
Hi, an idea recently crossed my mind that describes a (possibly?) useful way of accessing XML data from Haskell that is very memory efficient – at least as long as you only read the XML; for XML processing and generation, other existing libraries are probably well suited. Given a (possibly very large) XML file as a ByteString, a „normal“ XML parser would have an algebraic data type for node etc., parsing the ByteString into a tree of haskell values. Simplified a bit, the data structure could look like this:
data Node = Node String [Attrib] [Node] | CData String
The in-memory representation of such a a data structure is presumably very large, compared to the XML ByteString. Additionally, all the content is copied in memory and I evaluated a part of the Node (e.g. the tag name), this is retained by the constructor as long as the Node, and thus the whole document tree, is retained. My idea is to copy as little as possible, and refer to the existing ByteString as much as possible. Remember that a substring of a ByteString points at the same memory region, just with a different offset and length. So, in my case, the datatype looks like
newtype Node = Node ByteString newtype CData = CData ByteString newtype Attrib = Attrib ByteString where the constructors would not be exported. Instead, accessor methods would be provided:
tagName :: Node -> String tagAttribs :: Node -> [Attrib] tagContent :: Node -> [Either Node CData] cDataContent :: CData -> ByteString
So when searching for a certain tag, the library would only have to generate (and retain) pointers to the ByteString, but the tag name themselves do not have to be retained after the comparision. If it turns out to be too expensive to parse the fragment that is referenced by Node on every invocation of tagContent, this could be made a field of the constructor. Due to lazy evaluation, it will only be evaluated when needed, but then kept in memory:
data Node = Node ByteString [Either Node CData]
About cDataContent: This method cannot just return the fragment of the document unaltered, unfortunately, as XML entities will have to be replaced. But if it returns a lazy ByteString, everything between the entities can be a chunk refering to the original data, again avoiding the creation of a copy. Now this is just a quick idea I wanted to share. I have not tried it yet, so I might be overlooking something which prevents the whole thing from working. It might also be the case that, due to lazy evaluation, the “usual” approach is actually well enough (at least if it uses ByteString instead of String). I’m curious what you think, Joachim Breitner -- Joachim Breitner e-Mail: mail@joachim-breitner.de Homepage: http://www.joachim-breitner.de ICQ#: 74513189 Jabber-ID: nomeata@joachim-breitner.de