
Hi, I've been using hxt to process xml files. Now that my files are getting a bit bigger (30m) I'm finding that hxt uses inordinate amounts of memory. I have 8g on my box, and it's running out. As far as I can tell, this memory is getting used up while parsing the text, rather than in any down-stream processing by xpickle. Is this a known issue? Matthew

Matthew Pocock wrote:
I've been using hxt to process xml files. Now that my files are getting a bit bigger (30m) I'm finding that hxt uses inordinate amounts of memory. I have 8g on my box, and it's running out. As far as I can tell, this memory is getting used up while parsing the text, rather than in any down-stream processing by xpickle.
Is this a known issue?
Yes, hxt calls parsec, which is not incremental. haxml offers the choice of non-incremental parsers and incremental parsers. The incremental parsers offer finer control (and therefore also require finer control).

On Thursday 24 January 2008, Albert Y. C. Lai wrote:
Matthew Pocock wrote:
I've been using hxt to process xml files. Now that my files are getting a bit bigger (30m) I'm finding that hxt uses inordinate amounts of memory. I have 8g on my box, and it's running out. As far as I can tell, this memory is getting used up while parsing the text, rather than in any down-stream processing by xpickle.
Is this a known issue?
Yes, hxt calls parsec, which is not incremental.
haxml offers the choice of non-incremental parsers and incremental parsers. The incremental parsers offer finer control (and therefore also require finer control).
I've got a load of code using xpickle, which taken together are quite an investment in hxt. Moving to haxml may not be very practical, as I'll have to find some eqivalent of xpickle for haxml and port thousands of lines of code over. Is there likely to be a low-cost solution to convincing hxt to be incremental that would get me out of this mess? Matthew
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Would a bytestring-backed implementation of parsec solve my problems? Is there such a beast out there? Matthew

It well might. I believe you can do this yourself. Jeremy Shaw wrote a
parser for Debian control files, which was useless on the really large
package index files. He switched it over to using bytestrings and that
solved the problem. You can find the code in a darcs repository at:
http://src.seereason.com/haskell-debian. It may reference other libraries
hosted at that URL, which you can browse from the base URL.
On Jan 24, 2008 1:34 PM, Matthew Pocock
Would a bytestring-backed implementation of parsec solve my problems? Is there such a beast out there?
Matthew _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Jan 24, 2008 10:34 PM, Matthew Pocock
Would a bytestring-backed implementation of parsec solve my problems? Is there such a beast out there?
I'm working on one as a part of another project. It's not incremental and needs some optimizing (I've focused on correctness so far.) I don't think it solves your problem though as you'll only save a constant amount of memory by using ByteStrings over Strings and it may not be enough. I also don't know how to integrate it with HXT. In case you want to have a look anyway you can find it here: http://darcs.johantibell.com/hyena/Hyena/ByteStringParser.hs -- Johan

"Matthew Pocock"
On Thursday 24 January 2008, Albert Y. C. Lai wrote:
Matthew Pocock wrote:
I've been using hxt to process xml files. Now that my files are getting a bit bigger (30m) I'm finding that hxt uses inordinate amounts of memory. I have 8g on my box, and it's running out. As far as I can tell, this memory is getting used up while parsing the text, rather than in any down-stream processing by xpickle.
Is this a known issue?
Yes, hxt calls parsec, which is not incremental.
haxml offers the choice of non-incremental parsers and incremental parsers. The incremental parsers offer finer control (and therefore also require finer control).
I've got a load of code using xpickle, which taken together are quite an investment in hxt. Moving to haxml may not be very practical, as I'll have to find some eqivalent of xpickle for haxml and port thousands of lines of code over. Is there likely to be a low-cost solution to convincing hxt to be incremental that would get me out of this mess?
Matthew
I don't think so. Even if you replace parsec, HXT is itself not incremental. (It stores the whole XML document in memory as a tree, and the tree is not memory effecient. Still I am a bit surprised that you can't parse 30m with 8 gig memory. This was discussed here before, and I think someone benchmarked HXT as using roughly 50 bytes of memory per 1 byte of input. i.e. HXT would then be using about 1.5 gig of memory for your 30m file. Rene.

Hello Rene, Friday, January 25, 2008, 10:49:53 PM, you wrote:
Still I am a bit surprised that you can't parse 30m with 8 gig memory.
This was discussed here before, and I think someone benchmarked HXT as using roughly 50 bytes of memory per 1 byte of input. i.e. HXT would then be using about 1.5 gig of memory for your 30m file.
to be exact, it's GHC by itself who use so many memory (on 64-bit systems). the calculation is simple: - one word (8 byte) for Char itself - one word for tail pointer - one word for laziness thunk it's already 24 bytes. and depending on GC used, this amount increases either 2x (with collecting GC) or 3x (with default, copying GC) and HXT may add even more overhead, saving several copies of data or building some lazy thunks on each char -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Rene de Visser wrote:
"Matthew Pocock"
schrieb im Newsbeitrag news:200801241917.33281.matthew.pocock@ncl.ac.uk... On Thursday 24 January 2008, Albert Y. C. Lai wrote:
Matthew Pocock wrote:
I've been using hxt to process xml files. Now that my files are getting a bit bigger (30m) I'm finding that hxt uses inordinate amounts of memory. I have 8g on my box, and it's running out. As far as I can tell, this memory is getting used up while parsing the text, rather than in any down-stream processing by xpickle.
Is this a known issue?
Yes, hxt calls parsec, which is not incremental.
haxml offers the choice of non-incremental parsers and incremental parsers. The incremental parsers offer finer control (and therefore also require finer control).
I've got a load of code using xpickle, which taken together are quite an investment in hxt. Moving to haxml may not be very practical, as I'll have to find some eqivalent of xpickle for haxml and port thousands of lines of code over. Is there likely to be a low-cost solution to convincing hxt to be incremental that would get me out of this mess?
Matthew
I don't think so. Even if you replace parsec, HXT is itself not incremental. (It stores the whole XML document in memory as a tree, and the tree is not memory effecient.
this statement isn't true in general. HXT itself can be incremental, if there is no need for traversing the whole XML tree. When processing a document containing a DTD, indeed there is a need even when no validation is required, for traversal because of the entity substitution. Technically it's not a big deal to write a very simple and lasy parser, or to take the tagsoup or haxml lasy parsers and adapt it to the hxt DOM structure. Combining the parser with the ByteString lib raises a small problem, the handling of Unicode chars, so there is a need for a lasy Word8 to Unicode (Char) conversion, but that's already in HXT (thanks to Henning Thielemann). So the problem is not a technical one, it's just a matter of time an resources. If someone has such a lightweigt lasy xml parser, I will help to integrate it into hxt. Uwe

"Uwe Schmidt"
this statement isn't true in general. HXT itself can be incremental, if there is no need for traversing the whole XML tree. When processing a document containing a DTD, indeed there is a need even when no validation is required, for traversal because of the entity substitution.
It would be nice if HXT was incremental even when you are processing the whole tree. If I remember correctly, the data type of the tree in HXT is something like data Tree = Tree NodeData [Tree] which means that already processed parts of the tree can't be garbage collected because the parent node is holding onto them. If instead it was data Tree = Tree NodeData (IORef [Tree]) Would could remove each subtree as it was processed (well just before would probably be necessary, and we would need to rely on blackholing to remove the reference on the stack). This would perhaps allow already processed subtree to be garbage collected. Together with the lazy evaluation this could lead to quite good memory usage. Rene.

On Monday 28 January 2008, Rene de Visser wrote:
It would be nice if HXT was incremental even when you are processing the whole tree.
If I remember correctly, the data type of the tree in HXT is something like
data Tree = Tree NodeData [Tree]
which means that already processed parts of the tree can't be garbage collected because the parent node is holding onto them.
If instead it was
data Tree = Tree NodeData (IORef [Tree])
Would could remove each subtree as it was processed (well just before would probably be necessary, and we would need to rely on blackholing to remove the reference on the stack). This would perhaps allow already processed subtree to be garbage collected. Together with the lazy evaluation this could lead to quite good memory usage.
Rene.
Not so sure about this. For streaming processing, it would be nicer to have something like StAX with a stack of already entered elements kept about as book-keeping |(the tags + attribute sets to root). Let's face it, if you sign up to a document model, you are signing up to a document and shouldn't be supprised when it sits in memory. I think the 'right' solution at least in part goes with the problem to be solved. I'd be upset if we moved to something more complex where my code breaks because something accidentaly garbage collected data that I need to back-track to. Matthew

Hi
Not so sure about this. For streaming processing, it would be nicer to have something like StAX with a stack of already entered elements kept about as book-keeping |(the tags + attribute sets to root). Let's face it, if you sign up to a document model, you are signing up to a document and shouldn't be supprised when it sits in memory.
The code is so simple I'm not sure it even needs to go in the tagsoup library :-) process stack (t@(TagOpen name attribs):rest) = process (stack:t) rest process (TagOpen name1 _:stack) (TagClose name2:rest) | name1 == name2 = process stack rest process stack rest = ... your code here ... You are probably right that every application needs to tailor this pattern to its own needs, but its sufficiently simple that I can't see that being a problem. Thanks Neil

Rene de Visser wrote:
If I remember correctly, the data type of the tree in HXT is something like
data Tree = Tree NodeData [Tree]
which means that already processed parts of the tree can't be garbage collected because the parent node is holding onto them.
This statement only holds, if you still need the root node after processing. When the tree is processed top down, the root and all trees to the left of the currently processed tree can be garbage collected. This is precisely the type of computation needed when e.g. working with picklers to convert XML into native Haskell data structures. BTW: I've taken the tagsoup lib and wrote a small parser to build a tree out of the stream of tags. It's about a 100 lines of code. This DOM parser does not need to read until the closing tag to build an element node, so it should be as lasy as possible. A first version for HTML already runs on my box, but it stil needs a bit of testing Uwe

Hi Uwe,
BTW: I've taken the tagsoup lib and wrote a small parser to build a tree out of the stream of tags. It's about a 100 lines of code. This DOM parser does not need to read until the closing tag to build an element node, so it should be as lasy as possible. A first version for HTML already runs on my box, but it stil needs a bit of testing
Please send a patch with whatever come up with, so others can make use of it. I've already added Data.HTML.TagSoup.Tree to the latest darcs version, which does as well as it can with tag matching, but is entirely strict. Having a lazy version would be great. I've been talking to the Java tagsoup author (http://tagsoup.info), which does very clever processing of HTML to make it as structured and normalised as possible. He said:
The schema that describes HTML can be found at src/definitions/html.tssl in the source archive; I'll be glad to explain any obscurities in it.
There is also some slides on his website (at the bottom) which detail the Java TagSoup approach to reconstructing HTML, and have obviously had a lot of thought put into them! Thanks Neil

Hi Neil,
Please send a patch with whatever come up with, so others can make use of it. I've already added Data.HTML.TagSoup.Tree to the latest darcs version, which does as well as it can with tag matching, but is entirely strict. Having a lazy version would be great.
It's too early for a new release, testing, especially performance testing, is not yet none, but the first version is in the darcs repository "http://darcs.fh-wedel.de/hxt/" (the version number is still 7.4) Those, who urgently need a more lasy XML parser, may try that one. Usage: call readDocument as usual, but with an extra option: readDocument [..., (a_tagsoup, "1")]
I've been talking to the Java tagsoup author (http://tagsoup.info), which does very clever processing of HTML to make it as structured and normalised as possible. He said:
The schema that describes HTML can be found at src/definitions/html.tssl in the source archive; I'll be glad to explain any obscurities in it.
There is also some slides on his website (at the bottom) which detail the Java TagSoup approach to reconstructing HTML, and have obviously had a lot of thought put into them!
I will have a look into that. Currently the strategy to repair lousy HTML is the same as in the parsec HTML parser and that's equivalent to what is done in HaXML. Uwe

"Rene de Visser"
Even if you replace parsec, HXT is itself not incremental. (It stores the whole XML document in memory as a tree, and the tree is not memory effecient.
If the usage pattern of the tree is search-and-discard, then only enough of the tree to satisfy the search needs to be stored in memory at once. Everything from the root to the first node of interest can easily be pruned by the garbage collector. A paper describing the lazy parsing technique, and using XML-parsing as its motivating example, is available at http://www.cs.york.ac.uk/~malcolm/partialparse.html
haxml offers the choice of non-incremental parsers and incremental parsers.
Indeed. This lazy incremental parser for XML is available in the development version of HaXml: http://www.cs.york.ac.uk/fp/HaXml-devel The source code for partial parsing is available in a separate package: http://www.cs.york.ac.uk/fp/polyparse These lazy parser combinators are roughly between 2x - 5x faster than Parsec on large inputs (although the strict variation is about 2x slower than Parsec). Regards, Malcolm

Matthew Pocock
I've been using hxt to process xml files. Now that my files are getting a bit bigger (30m) I'm finding that hxt uses inordinate amounts of memory. : Is this a known issue?
Yes. I parse what I suppose are rather large XML files (the largest so far is 26GB), and ended up replacing HXT code with TagSoup. I also needed to use concurrency[1]. XML parsing is still slow, typically consuming 90% of the CPU time, but at least it works without blowing the heap. While I haven't tried HaXML, there is IMO a market opportunity for a fast and small XML library, and I'd happily trade away features like namespace support or arrows interfaces for that. -k [1] http://www.mail-archive.com/haskell-cafe@haskell.org/msg31862.html -- If I haven't seen further, it is by standing in the footprints of giants

Hi One of the problems with XML parsing is nesting. Consider this fragment: <foo>lots of text</foo> The parser will naturally want to track all the way down to the closing </foo> in order to check the document is well formed, so it can put it in a tree. The problem is that means keeping "lots of text" in memory - often the entire document. TagSoup works in a lazy streaming manner, so would parse the above as: [TagOpen "foo" [], TagText "lots of text", TagClose "foo"] i.e. it hasn't matched the foo's, and can return the TagOpen before even looking at the text.
XML parsing is still slow, typically consuming 90% of the CPU time, but at least it works without blowing the heap.
I'd love TagSoup to go faster, while retaining its laziness. A basic profiling doesn't suggest anything obvious, but I may have missed something. It's more likely that it would be necessary to prod at the Core level, or move to supporting both (Lazy)ByteString and [Char]. Thanks Neil

ketil+haskell:
Matthew Pocock
writes: I've been using hxt to process xml files. Now that my files are getting a bit bigger (30m) I'm finding that hxt uses inordinate amounts of memory. : Is this a known issue?
Yes. I parse what I suppose are rather large XML files (the largest so far is 26GB), and ended up replacing HXT code with TagSoup. I also needed to use concurrency[1]. XML parsing is still slow, typically consuming 90% of the CPU time, but at least it works without blowing the heap.
While I haven't tried HaXML, there is IMO a market opportunity for a fast and small XML library, and I'd happily trade away features like namespace support or arrows interfaces for that.
So this is a request for an xml-light based on lazy bytestrings, designed for speed at all costs? -- Don

Don Stewart
So this is a request for an xml-light based on lazy bytestrings, designed for speed at all costs?
Yes, I suppose it is. (For certain values of "all costs".) For "industrial" use, I think it is important to have better performance, ideally approaching disk and network speeds, and support large documents without excessive memory consumption. This probably means that strict, whole-document trees (DOM-like) must be optional. I think a good approach would be a TagSoup-like (SAX-like) lazy ByteString parser, with more advanced features (checking for well-formedness, building a tree structure, validation, namespace support..) layered on top. These days, there is a lot of XML around, so solid and performant XML processing could be another step in missing our stated mission goal of avoiding success at all costs. -k -- If I haven't seen further, it is by standing in the footprints of giants

On 1/26/08 3:43 AM, Ketil Malde wrote:
I think a good approach would be a TagSoup-like (SAX-like) lazy ByteString parser, with more advanced features (checking for well-formedness, building a tree structure, validation, namespace support..) layered on top.
Perhaps a more modern approach would be StAX[1]-like rather than SAX-like? In either case, streaming, non-DOM. I am concerned by the number of people expressing willingness to abandon namespace support, but perhaps I'm being too much of a purist....
These days, there is a lot of XML around, so solid and performant XML processing could be another step in missing our stated mission goal of avoiding success at all costs.
Agreed. Keith 1. http://stax.codehaus.org/ http://www.xml.com/pub/a/2003/09/17/stax.html

On Saturday 26 January 2008, Keith Fahlgren wrote:
Perhaps a more modern approach would be StAX[1]-like rather than SAX-like? In either case, streaming, non-DOM.
I am concerned by the number of people expressing willingness to abandon namespace support, but perhaps I'm being too much of a purist....
Keith
StAX is fine for a very wide range of applications, including web services. In the web-service domain, namespaces and entity expansion and xsd are not optional extras, but these can be layered on top of StAX rather than a more DOM-like structure. Just as a reality check, we regularly stream xml messages between web services in Java where the message bodies are many gig in size, using StAX, and neither the client or server need anything other than a constant memory overhead, as the portions of the message are generated and consumed in a streaming manner. It would be very nice if we could do similar things in haskell. Matthew

matthew.pocock:
On Saturday 26 January 2008, Keith Fahlgren wrote:
Perhaps a more modern approach would be StAX[1]-like rather than SAX-like? In either case, streaming, non-DOM.
I am concerned by the number of people expressing willingness to abandon namespace support, but perhaps I'm being too much of a purist....
Keith
StAX is fine for a very wide range of applications, including web services. In the web-service domain, namespaces and entity expansion and xsd are not optional extras, but these can be layered on top of StAX rather than a more DOM-like structure.
Just as a reality check, we regularly stream xml messages between web services in Java where the message bodies are many gig in size, using StAX, and neither the client or server need anything other than a constant memory overhead, as the portions of the message are generated and consumed in a streaming manner. It would be very nice if we could do similar things in haskell.
Lazy evaluation FTW. :) -- Don

Hi
Perhaps a more modern approach would be StAX[1]-like rather than SAX-like? In either case, streaming, non-DOM.
Remember, Haskell has lazy evaluation. TagSoup is basically a SAX approach, without the massive pain of the Java version and the API being back to front compared to what you want. If you assume that your XML is well formed, then I think TagSoup is already StAX as well. TagSoup has been carefully engineered to run in constant memory.
I am concerned by the number of people expressing willingness to abandon namespace support, but perhaps I'm being too much of a purist....
TagSoup has both no namespace support, and full namespace support - it will happily read tags with namespaces in them, and its a trivial break in your application to "deal" with them as you wish. Thanks Neil
participants (13)
-
Albert Y. C. Lai
-
Bulat Ziganshin
-
Clifford Beshers
-
Don Stewart
-
Johan Tibell
-
Keith Fahlgren
-
Ketil Malde
-
Malcolm Wallace
-
Matthew Pocock
-
Neil Mitchell
-
Rene de Visser
-
Steve Lihn
-
Uwe Schmidt