[Haskell-cafe] No copy XML parser (rough idea only)

14 May 2010

      Hi,

an idea recently crossed my mind that describes a (possibly?) useful way
of accessing XML data from Haskell that is very memory efficient – at
least as long as you only read the XML; for XML processing and
generation, other existing libraries are probably well suited.

Given a (possibly very large) XML file as a ByteString, a „normal“ XML
parser would have an algebraic data type for node etc., parsing the
ByteString into a tree of haskell values. Simplified a bit, the data
structure could look like this:
...
data Node = Node String [Attrib] [Node] | CData String
The in-memory representation of such a a data structure is presumably
very large, compared to the XML ByteString. Additionally, all the
content is copied in memory and I evaluated a part of the Node (e.g. the
tag name), this is retained by the constructor as long as the Node, and
thus the whole document tree, is retained.

My idea is to copy as little as possible, and refer to the existing
ByteString as much as possible. Remember that a substring of a
ByteString points at the same memory region, just with a different
offset and length.

So, in my case, the datatype looks like
...
newtype Node = Node ByteString
newtype CData = CData ByteString
newtype Attrib = Attrib ByteString
where the constructors would not be exported. Instead, accessor methods
would be provided:
...
tagName :: Node -> String
tagAttribs :: Node -> [Attrib]
tagContent :: Node -> [Either Node CData]
cDataContent :: CData -> ByteString
So when searching for a certain tag, the library would only have to
generate (and retain) pointers to the ByteString, but the tag name
themselves do not have to be retained after the comparision.

If it turns out to be too expensive to parse the fragment that is
referenced by Node on every invocation of tagContent, this could be made
a field of the constructor. Due to lazy evaluation, it will only be
evaluated when needed, but then kept in memory:
...
data Node = Node ByteString [Either Node CData]
About cDataContent: This method cannot just return the fragment of the
document unaltered, unfortunately, as XML entities will have to be
replaced. But if it returns a lazy ByteString, everything between the
entities can be a chunk refering to the original data, again avoiding
the creation of a copy.

Now this is just a quick idea I wanted to share. I have not tried it
yet, so I might be overlooking something which prevents the whole thing
from working. It might also be the case that, due to lazy evaluation,
the “usual” approach is actually well enough (at least if it uses
ByteString instead of String). I’m curious what you think,

Joachim Breitner

-- 
Joachim Breitner
  e-Mail: mail@joachim-breitner.de
  Homepage: http://www.joachim-breitner.de
  ICQ#: 74513189
  Jabber-ID: nomeata@joachim-breitner.de