HaXml, emeory usage and segmentation fault

I have Hugs version February 2001, HaXml version 1.02 and this program:
module Main where import XmlLib
main = processXmlWith (invoices `o` tag "invoice")
invoices = html [ hhead [ htitle [ ("Invoices"!)] ], hbody [ customers `o` children `with` tag "customer" ] ]
customers = cat [ h2 [ ("Customer"!) ], contracts `o` children `with` tag "contract" ]
contracts = cat [ h3 [ ("Id:"!), ("id"?)], hpara [ ("Access:"!), keep /> txt ] `o` children `with` tag "access", hpara [ ("Intl:"!), keep /> txt] `o` children `with` tag "inter" ]
This program can process following file: <?xml version='1.0'?> <invoice> <customer> <contract id='1'> <access>1</access> <inter>1</inter> </contract> <contract id='2'> <access>2</access> <inter>2</inter> </contract> </customer> </invoice> (I use "runhugs translate.hs invoice.xml invoice.html") Now increase amount of <customer>s to 10, and amount of <contract>s within each customer to 999. After that, "runhugs -h6000000 translate.hs invoice.xml invoice.html" dumps core :( What's the reason: bug in hugs, bug in HaXml, or my own bad programming techniques? -- Dmitry Astapov //ADEpt E-mail: adept@umc.com.ua GPG KeyID/fprint: F5D7639D/CA36 E6C4 815D 434D 0498 2B08 7867 4860 F5D7 639D

Dmitry Astapov wrote:
I have Hugs version February 2001, HaXml version 1.02 and this program: [...] This program can process following file:
<?xml version='1.0'?> <invoice> [... one <customer> containing two <contract>s ... ] </invoice>
Now increase amount of <customer>s to 10, and amount of <contract>s within each customer to 999. After that, "runhugs -h6000000 translate.hs invoice.xml invoice.html" dumps core :(
What's the reason: bug in hugs, bug in HaXml, or my own bad programming techniques?
More an inappropriate use of Hugs -- 10 <customer>s with 999 <contract>s each is a moderately large input file, and the Hugs interpreter just isn't designed to work with large inputs. Try compiling the program instead. The other issue is that HaXml's XML parser is insufficiently lazy (although the rest of HaXml has very nice strictness properties). For instance, there's no reason why your program shouldn't run in near-constant space, but due to the way the parser is structured it won't begin producing any output until the entire input document has been read. Try the identity transform 'main = processXmlWith keep' on your sample document and see if that runs out of heap too. If so, there's not much you can do short of replacing the HaXml parser. --Joe English jenglish@flightlab.com

What's the reason: bug in hugs, bug in HaXml, or my own bad programming techniques?
JE> More an inappropriate use of Hugs -- 10 <customer>s with 999 JE> <contract>s each is a moderately large input file, Almost 6 megs JE> and the Hugs interpreter just isn't designed to work with large JE> inputs. Try compiling the program instead. well, ghc-5.02 seems to dislike something inside XmlLib.hs - it could not find interface defs file for modules IOExts .. I plan to look more deeply into it though. JE> The other issue is that HaXml's XML parser is insufficiently lazy JE> (although the rest of HaXml has very nice strictness properties). For JE> instance, there's no reason why your program shouldn't run in JE> near-constant space, but due to the way the parser is structured it JE> won't begin producing any output until the entire input document has JE> been read. I suspected it, and your comment encouraged me to look more deeply in the code, and yes - it seems that examples like mine simply do not fit in :( JE> Try the identity transform 'main = processXmlWith keep' on your sample JE> document and see if that runs out of heap too. If so, there's not JE> much you can do short of replacing the HaXml parser. I got: runhugs98 +sgt -h5000000 translate_invoices.hs invoice.xml invoice_small.html runhugs: Error occurred {{Gc:4788153}}{{Gc:4619912}}{{Gc:4442164}}{{Gc:4271039}}{{Gc:4122687}}{{Gc:3964107}}{{Gc:3827478}}{{Gc:3680235}}{{Gc:3554593}}{{Gc:3417827}}{{Gc:3286249}}{{Gc:3175771}}{{Gc:3053698}}{{Gc:2936095}}{{Gc:2839042}}{{Gc:2729711}}{{Gc:2624806}}{{Gc:2539770}}{{Gc:2442035}}{{Gc:2347994}}{{Gc:2257773}}{{Gc:4077399}}{{Gc:3715115}} (47153895 reductions, 79953374 cells, 23 garbage collections) {{Gc:3812956}}ERROR - Control stack overflow I tried to put several "observe" statements in the code, but they seem to be ignored in the case of "Control stack overflow". -- Dmitry Astapov //ADEpt E-mail: adept@umc.com.ua GPG KeyID/fprint: F5D7639D/CA36 E6C4 815D 434D 0498 2B08 7867 4860 F5D7 639D

Dmitry Astapov wrote:
JE> and the Hugs interpreter just isn't designed to work with large JE> inputs. Try compiling the program instead. well, ghc-5.02 seems to dislike something inside XmlLib.hs - it could not find interface defs file for modules IOExts .. I plan to look more deeply into it though.
I got it to compile with ghc 5.02 using ghc --make -package lang translate.hs The compiled version succeeds, but on a large document it uses a *lot* of memory and starts paging pretty badly.
JE> Try the identity transform 'main = processXmlWith keep' on your sample JE> document and see if that runs out of heap too. If so, there's not JE> much you can do short of replacing the HaXml parser.
I tried this as well, modifying your program to use an XML parser I wrote a while ago that has better laziness properties than the HaXML one. Alas, my parser also suffers from a space leak under Hugs, so this only deferred the problem. Under ghc/ghci, though, it has modest memory requirements and runs without paging. --Joe English jenglish@flightlab.com

JE> I got it to compile with ghc 5.02 using JE> ghc --make -package lang translate.hs JE> The compiled version succeeds, but on a large document it uses a *lot* JE> of memory and starts paging pretty badly. Exactly. iPIII-800/192M ram died on me swapping when I tried to run compiled version with 16M big stack and input file with 100000 children in one node.
JE> Try the identity transform 'main = processXmlWith keep' on your sample JE> document and see if that runs out of heap too. If so, there's not JE> much you can do short of replacing the HaXml parser.
I tried this with ghc 5.02, and it run in 20M RAM or so. It could be less, but at least it runs, and not segfaults as hugs :) JE> I tried this as well, modifying your program to use an XML parser I JE> wrote a while ago that has better laziness properties than the HaXML JE> one. Alas, my parser also suffers from a space leak under Hugs, so JE> this only deferred the problem. Under ghc/ghci, though, it has modest JE> memory requirements and runs without paging. Is it's distribution restricted? Is it possible to get it somwhere, use it, patch it, etc? -- Dmitry Astapov //ADEpt E-mail: adept@umc.com.ua GPG KeyID/fprint: F5D7639D/CA36 E6C4 815D 434D 0498 2B08 7867 4860 F5D7 639D

Dmitry Astapov wrote:
JE> I tried this as well, modifying your program to use an XML parser I JE> wrote a while ago that has better laziness properties than the HaXML JE> one. Alas, my parser also suffers from a space leak under Hugs, so JE> this only deferred the problem. Under ghc/ghci, though, it has modest JE> memory requirements and runs without paging.
Is it's distribution restricted? Is it possible to get it somwhere, use it, patch it, etc?
If you don't mind a complete lack of documentation, sure :-) The code is alpha quality; there are a few missing features and a couple of things that it just gets wrong, but it's basically working. I'll package it up and put it on the Web when I get a chance. This may take a day or two... --Joe English jenglish@flightlab.com

An update on Dmitry's problems with HaXml memory usage: + Compiling HaXml and the driver program with ghc -O helps a *lot*. + Using the version of HaXml that comes preinstalled with GHC (-package text) helps even more. There is a slight difference in the 'Pretty' module (which is used to print the output) between the two versions. + I wrote an adapter that converts my parser's XML representation into HaXml's, so you can use it as a drop-in replacement. This helps some, but not enough. The heap profile using HaXml 1.02 has two large humps: the first from parsing the input, and the second from pretty-printing the output. (With the GHC version of HaXml the second hump is about half as tall as with the "official" HaXml version). With the new parser, only the smaller hump remains. + Figuring that using a pretty-printer is overkill, I replaced it with a quick hack that converts the HaXml representation _back_ into my representation and feeds it to a serializer that I had previously written. This improves things some more: the identity transformation 'processXmlWith keep' now has a flat heap profile. + Unfortunately, Dmitry's original program still has a space leak. I suspect that the HaXml combinators (or, more likely, the HaXml internal representation) are not as space-efficient as I had originally thought, since when I rewrote Dmitry's test case to use the new parser's internal representation directly I again got a flat heap profile -- there doesn't seem to be anything wrong with the structure of the original program. The code will be ready to release Real Soon Now; I'll keep you posted. --Joe English jenglish@flightlab.com
participants (2)
-
Dmitry Astapov
-
Joe English