
Hi, I'm struggling to get my HXT-based parser to parse a largish file (<300MB), even after breaking into reasonably-sized chunks. The culprit appears to be parsing one element comprising 25K lines of text, which apparently requires more memory than the 2Gb my computer is equipped with. I'm wondering what approach others use for non-toy XML data. Is the problem due to some error I have made, or should I just ignore the XML, and just parse it "manually" by dissecting bytestrings, or will another XML library serve better? -k -- If I haven't seen further, it is by standing in the footprints of giants

Hi Ketil,
I'm struggling to get my HXT-based parser to parse a largish file (<300MB), even after breaking into reasonably-sized chunks. The culprit appears to be parsing one element comprising 25K lines of text, which apparently requires more memory than the 2Gb my computer is equipped with.
You can try TagSoup (http://www-users.cs.york.ac.uk/~ndm/tagsoup/) which isn't really a complete XML parser, but may do what you want. The other option is HaXml which Malcolm has been adding lazy parsing to - I'm not sure if that is in a released variant or not. Thanks Neil

On Mon, 22 Oct 2007, Ketil Malde wrote:
I'm wondering what approach others use for non-toy XML data. Is the problem due to some error I have made, or should I just ignore the XML, and just parse it "manually" by dissecting bytestrings, or will another XML library serve better?
HXT uses Parsec, which is strict.

Henning Thielemann wrote:
HXT uses Parsec, which is strict.
Is is strict to the extent that it cannot produce any output at all until it has read the entire XML document? That would make HXT (and Parsec, for that matter) useless for a large percentage of tasks. Or is it just too strict in certain cases - like the case described by Ketil, where a certain element with a huge amount of CDATA content blows up the stack? That could probably be worked around. Thanks, Yitz

"Yitzchak Gale"
Henning Thielemann wrote:
HXT uses Parsec, which is strict. I had a look at using HXT awhile ago. Parsec is the least of the problems. HXT stores the XML as an explicit tree in memory, where the head has explict references to the children. This means that the whole XML tree is stored in memory until the last child is processed. Also this tree is stored ineffeciently. Everything as non shared Haskell strings. My experience is that a 30MB file (which is quite small for an XML file) can NOT be processed with 2GB memory.
Rene.

On 10/22/07, Rene de Visser
I had a look at using HXT awhile ago. Parsec is the least of the problems. HXT stores the XML as an explicit tree in memory, where the head has explict references to the children.
What did you end up using? I've started building an app based on HXT and that worries me. The fact that it produces a 10MB library file is also worrisome. Justin

On Mon, 22 Oct 2007, Yitzchak Gale wrote:
Henning Thielemann wrote:
HXT uses Parsec, which is strict.
Is is strict to the extent that it cannot produce any output at all until it has read the entire XML document?
Yep. You might like to check my XML wrapper library which interfaces to HaXML, HXT and TagSoup (providing the most lazy XML parser we have). It uses its own structure and cannot map all features of HaXML and HXT. But feel free to extend it as you need. I wrote it for HTML processing, thus it is maybe not optimal for other XML applications. http://darcs.haskell.org/wraxml/

"Yitzchak Gale"
Henning Thielemann wrote:
HXT uses Parsec, which is strict.
Is is strict to the extent that it cannot produce any output at all until it has read the entire XML document? That would make HXT (and Parsec, for that matter) useless for a large percentage of tasks.
Yes, and yes. By contrast, the Utrecht parser combinator library gives "online" results, meaning that it delivers as much as it can without ambiguity. It is a bit like laziness, but it analyses the grammar to determine when it is safe to commit to a value, essentially once no error has been seen in a prefix of the input. And the polyparse library has several variations of properly lazy parsers, which only return results on demand (but there might be parse errors hidden inside the returned values, as exceptions). The user (grammar-writer) decides where the results should be lazy or strict. HaXml now uses the polyparse library, and you can choose whether you want well-formedness checking with the original strict parser, or lazy space-efficient on-demand parsing. Initial performance results show that parsing XML lazily is always better than 2x as fast, and 1/2x peak memory usage of strict parsing. In some usage patterns, it can reduce the cost of processing from linear in the size of the document, to a constant (the distance into the document to find a particular element). I have just made fresh releases of development versions of these libraries, for those who would like to experiment. http://www.cs.york.ac.uk/fp/polyparse http://www.cs.york.ac.uk/fp/HaXml-devel They are also available on hackage.haskell.org. Regards, Malcolm

Malcolm Wallace wrote:
HaXml now uses the polyparse library, and you can choose whether you want well-formedness checking with the original strict parser, or lazy space-efficient on-demand parsing. Initial performance results show that parsing XML lazily is always better than 2x as fast, and 1/2x peak memory usage of strict parsing.
Hey, that's great news!
In some usage patterns, it can reduce the cost of processing from linear in the size of the document, to a constant (the distance into the document to find a particular element).
Oh oh - does that mean that Ketil's original case (an element containing a large quantity of CDATA) could still be a problem? Parsing the CDATA ought to be possible to delay if needed. Thanks, Yitz

"Yitzchak Gale"
In some usage patterns, it can reduce the cost of processing from linear in the size of the document, to a constant (the distance into the document to find a particular element).
Oh oh - does that mean that Ketil's original case (an element containing a large quantity of CDATA) could still be a problem?
Not necessarily. If the CDATA is not actually needed, it is possible that it would simply be discarded automatically by the lazy demand pattern. That does depend very much on how the consumer is written however. HaXml will still require a rather large amount of space to _lex_ the 25k line text element into a single token of course (I estimate no bigger than about 3Mb though). I have been considering moving the lexer to use ByteString instead of String, which would neatly solve that problem too. Regards, Malcolm

"Yitzchak Gale"
In some usage patterns, it can reduce the cost of processing from linear in the size of the document, to a constant (the distance into the document to find a particular element).
Oh oh - does that mean that Ketil's original case (an element containing a large quantity of CDATA) could still be a problem?
Don't know how interested you all are, but it's not just (P)CDATA, but a subsection (enclosed by a specific tag) that I need to process as a whole. HaXml on my list after TagSoup, which I'm about to get to work, I think (got distracted a bit ATM). -k -- If I haven't seen further, it is by standing in the footprints of giants

Ketil Malde
HaXml on my list after TagSoup, which I'm about to get to work, I think (got distracted a bit ATM).
As it is, I managed to parse my document using TagSoup. One major obstacle was the need to process a sizeable partition of the file. Using 'partitions' from TagSoup (which is implemented using the 'groupBy (const (not . p))' trick) didn't work, as it requires space proportional to the partition size. My solution (and please forgive me, it is getting late at night here) was to replace it with (slightly different semantics alert): breaks :: (a -> Bool) -> [a] -> [[a]] breaks p (x:xs) = let first = x : takeWhile (not.p) xs rest = dropWhile (not.p) xs in rest `par` first : if null rest then [] else breaks p rest I have no idea how reliable this is, and I suspect it isn't very, but on the plus side it does seems to work, at long as I compile with -smp. Parsing 300Mbytes of XML and outputting the information in 305K records takes approximately 5 minutes, and works with less than 1G of heap. This is fast and small enough for my purposes. Thanks for listening, and good night! -k -- If I haven't seen further, it is by standing in the footprints of giants

Another question about HaXML and HXT - what is the level of XML spec. compliance? The many who have tried to implement compliant XML parsers in various languages - and the few who have succeeded - all agree that this is much harder than it seems at first. Most of the time, the final result is an FFI call to expat. Thanks, Yitz

Yitzchak Gale wrote:
Another question about HaXML and HXT - what is the level of XML spec. compliance?
Implementing the XML 1.0 Standard was one of the goals of HXT when starting the project. This includes full support of DTD processing, which turned out to be the hardest part of the whole parsing and validating stuff. Another goal was, to stay as near as posible to the XML spec, that meant, separate the tasks in a clean way and do it step by step: reading, decoding, parsing, substituting entiies, implementing the include mechanism wiht external references, validating and normalizing. In a second step we added a HTML parser, again with parsec, to be able to process none standard XML. Again we hadn't in mind to process "very large" XML documents. There is no technical reason of adding 3. parser (or a 4. one) accepting something like XML, perhaps without the DTD suff, which works lazily. The only reason not yet having done this was lack of time and manpower. So, dear Haskeller, feel free to participate to HXT by developping a lazy parser and we will integarte it into HXT. This still does not solve the processing of "very very large" XML document. I doubt, whether we can do this with a DOM like approach, as in HXT or HaXml. Lazy input does not solve all problems. A SAX like parser could be a more useful choice for very large documents. Uwe

"Uwe Schmidt"
This still does not solve the processing of "very very large" XML document. I doubt, whether we can do this with a DOM like approach, as in HXT or HaXml. Lazy input does not solve all problems. A SAX like parser could be a more useful choice for very large documents.
Uwe
I think a step towards support medium size documents in HXT would be to store the tags and content more efficiently. If I undertand the coding correctly every tag is stored as a seperate Haskell string. As each byte of a string under GHC takes 12 bytes this alone leads to high memory usage. Tags tend to repeat. You could store them uniquely using a hash table. Content could be stored in compressed byte strings. As I mentioned in an earlier post 2GB memory is not enough to process a 35MB XML document in HXT as we have 30 x 2 x 12 = 720 MB for starters to just store the string data (once in the parser and once in the DOM). (Well a machine with 2GB memory). I guess I had somewhere around 1GB free for the program. Other overheads most likely used up the ramaining 300 MB. Rene.

"Rene de Visser"
If I undertand the coding correctly every tag is stored as a seperate Haskell string. As each byte of a string under GHC takes 12 bytes this alone leads to high memory usage.
Not that it detracts from your point, but I guess that is 24 bytes per character on 64 bit machinery? :-) -k -- If I haven't seen further, it is by standing in the footprints of giants

Rene de Visser wrote:
I think a step towards support medium size documents in HXT would be to store the tags and content more efficiently. If I undertand the coding correctly every tag is stored as a seperate Haskell string. As each byte of a string under GHC takes 12 bytes this alone leads to high memory usage. Tags tend to repeat. You could store them uniquely using a hash table. Content could be stored in compressed byte strings.
Yes, storing element and attribute names in a packed format, something similar to ByteString but for unicode values, would reduce the amount of storage. A perhaps small shortcomming of that aproach are the conversions between String and the internal representation when processing names. The hashtable approach would of course reduce memory usage, but this would require a global change of the processing model: A document then does not longer consist of a single tree, it alway consists of a pair of a tree and a map. By the way, the amount of memory used for strings ([Char] values) in Haskell is a problem for ALL text processing tasks. Its not limited HXT, nor is it special to XML. For me the efficieny problems with strings as list of chars and a possible solution by e.g. implementing String data transparenty more efficent than other lists is an issue for Haskell' (or Haskell'') and/or it's a challage for the language implementors. Uwe

Uwe Schmidt wrote:
The hashtable approach would of course reduce memory usage,
Note that hashtables are evil :) I'm all for tries instead.
but this would require a global change of the processing model: A document then does not longer consist of a single tree, it alway consists of a pair of a tree and a map.
Ah! I got struck by a trick: it's possible to store every tag name in full only once but still present a simple tree with full tag names to the user. You simply share all the tag names. For instance, in let x = "mytagname" in Tree (Tag x) [Tree (Tag x) [Text "foobar"]] the string "mytagname" is stored in memory only once although there are two tags with this name. When parsing an XML-file, you look up every parsed tag name in a finite map (i.e. the trie). If it's already in, you don't store the parsed tag name in the XML tree but the one stored at the leaf in the trie. Of course, these two strings are equal. But they're not (pointer-) identical! After parsing, all equal tag names now are pointers to one and the same leaf in the finite map. You can drop the finite map structure afterwards, the pointers to the leaves will remain. That would be quite the memory reduction for large XML trees which are likely to have many many identical tag names. Regards, apfelmus

apfelmus wrote:
Ah! I got struck by a trick: it's possible to store every tag name in full only once but still present a simple tree with full tag names to the user. You simply share all the tag names. For instance, in
let x = "mytagname" in Tree (Tag x) [Tree (Tag x) [Text "foobar"]]
the string "mytagname" is stored in memory only once although there are two tags with this name.
When parsing an XML-file, you look up every parsed tag name in a finite map (i.e. the trie). If it's already in, you don't store the parsed tag name in the XML tree but the one stored at the leaf in the trie. Of course, these two strings are equal. But they're not (pointer-) identical! After parsing, all equal tag names now are pointers to one and the same leaf in the finite map. You can drop the finite map structure afterwards, the pointers to the leaves will remain.
That would be quite the memory reduction for large XML trees which are likely to have many many identical tag names.
I've also thought about this approach. It sounds a bit weired, to built a map (or a trie) for the identity function. But it would solve a part of the space problem, at least when building XML documents by parsing. So I guess, there is a new project: A simple, small and lazy parser (no parsec), at least for the content part of XML. Uwe

"Yitzchak Gale"
Another question about HaXML and HXT - what is the level of XML spec. compliance?
In HaXml, I certainly tried pretty hard to match the (draft) XML 1.0 spec, since the library was originally developed for a commercial entity. But that was back in 1999, and the spec has changed a little since then. You're right that it is a difficult target to hit though. The spec is unnecessarily complex. The major area of non-compliance in HaXml is that it doesn't do anything with input encodings. That is partly a limitation of Haskell implementations themselves, which have only recently gained libraries for the purpose. Otherwise, Dimitry Astapov has recently been sending me patches to fix some minor compliance problems, along with the QuickCheck properties that discovered them. His additions will appear probably in the next release of HaXml. Let me emphasise that these non-compliances are very minor. Larger issues that remain open do not fall into the compliance spectrum, but are more about useability in practice. For instance: effective support for external catalogs of references; techniques to handle XML namespaces in a sensible fashion, etc. Regards, Malcolm
participants (10)
-
apfelmus
-
Henning Thielemann
-
Justin Bailey
-
Ketil Malde
-
Ketil Malde
-
Malcolm Wallace
-
Neil Mitchell
-
Rene de Visser
-
Uwe Schmidt
-
Yitzchak Gale