No copy XML parser (rough idea only)

Hi, an idea recently crossed my mind that describes a (possibly?) useful way of accessing XML data from Haskell that is very memory efficient – at least as long as you only read the XML; for XML processing and generation, other existing libraries are probably well suited. Given a (possibly very large) XML file as a ByteString, a „normal“ XML parser would have an algebraic data type for node etc., parsing the ByteString into a tree of haskell values. Simplified a bit, the data structure could look like this:
data Node = Node String [Attrib] [Node] | CData String
The in-memory representation of such a a data structure is presumably very large, compared to the XML ByteString. Additionally, all the content is copied in memory and I evaluated a part of the Node (e.g. the tag name), this is retained by the constructor as long as the Node, and thus the whole document tree, is retained. My idea is to copy as little as possible, and refer to the existing ByteString as much as possible. Remember that a substring of a ByteString points at the same memory region, just with a different offset and length. So, in my case, the datatype looks like
newtype Node = Node ByteString newtype CData = CData ByteString newtype Attrib = Attrib ByteString where the constructors would not be exported. Instead, accessor methods would be provided:
tagName :: Node -> String tagAttribs :: Node -> [Attrib] tagContent :: Node -> [Either Node CData] cDataContent :: CData -> ByteString
So when searching for a certain tag, the library would only have to generate (and retain) pointers to the ByteString, but the tag name themselves do not have to be retained after the comparision. If it turns out to be too expensive to parse the fragment that is referenced by Node on every invocation of tagContent, this could be made a field of the constructor. Due to lazy evaluation, it will only be evaluated when needed, but then kept in memory:
data Node = Node ByteString [Either Node CData]
About cDataContent: This method cannot just return the fragment of the document unaltered, unfortunately, as XML entities will have to be replaced. But if it returns a lazy ByteString, everything between the entities can be a chunk refering to the original data, again avoiding the creation of a copy. Now this is just a quick idea I wanted to share. I have not tried it yet, so I might be overlooking something which prevents the whole thing from working. It might also be the case that, due to lazy evaluation, the “usual” approach is actually well enough (at least if it uses ByteString instead of String). I’m curious what you think, Joachim Breitner -- Joachim Breitner e-Mail: mail@joachim-breitner.de Homepage: http://www.joachim-breitner.de ICQ#: 74513189 Jabber-ID: nomeata@joachim-breitner.de

The primary problem I see with this is that XML content is fundamentally text, not bytes. Using your types, two XML documents with identical content but different encodings will have different Haskell values (and thus be incorrect regarding Eq, Ord, etc). Additionally, since the original bytestring is shared in your types, potentially very large buffers could be locked in memory due to references held by only a small portion of the document. Chopping a document up into events or nodes creates some overhead due to the extra pointers, but allows unneeded portions to be freed. If you'd like memory-efficient text storage, using Bryan O'Sullivan's "text" package[1] is probably the best option. It uses packed Word16 buffers to store text as UTF-16. Probably not as efficient as a type backed by UTF-8, but it's much much better than String. I know the data types you specified are just examples, but they're leaving out some important XML features -- namespaces, entity references, etc. Consider either reading the XML spec, or perhaps use my package "xml-types"[2] as a starting point for designing your type hierarchy. [1] http://hackage.haskell.org/package/text [2] http://hackage.haskell.org/package/xml-types

On Fri, May 14, 2010 at 08:57:42AM -0700, John Millikin wrote:
Additionally, since the original bytestring is shared in your types, potentially very large buffers could be locked in memory due to references held by only a small portion of the document. Chopping a document up into events or nodes creates some overhead due to the extra pointers, but allows unneeded portions to be freed.
However, if your bytestring comes from mmap'ed memory this drawback wouldn't apply :D. -- Felipe.

Hi, Am Freitag, den 14.05.2010, 15:31 -0300 schrieb Felipe Lessa:
On Fri, May 14, 2010 at 08:57:42AM -0700, John Millikin wrote:
Additionally, since the original bytestring is shared in your types, potentially very large buffers could be locked in memory due to references held by only a small portion of the document. Chopping a document up into events or nodes creates some overhead due to the extra pointers, but allows unneeded portions to be freed.
However, if your bytestring comes from mmap'ed memory this drawback wouldn't apply :D.
exactly. Of course such a library would not be a general-purpose tool, but in cases where you know that you need most of the document for most of the time, e.g. when doing statistics on it, this would be acceptable. Also note that even after chopping into nodes, if you don’t make sure you drop the reference to root in a timely manner, the same thing would happen. Am Freitag, den 14.05.2010, 08:57 -0700 schrieb John Millikin:
The primary problem I see with this is that XML content is fundamentally text, not bytes. Using your types, two XML documents with identical content but different encodings will have different Haskell values (and thus be incorrect regarding Eq, Ord, etc).
The instances could be adapted... but this will be expensive, of course. One could also convert documents that are not utf-8 encoded as a whole and then work on that.
If you'd like memory-efficient text storage, using Bryan O'Sullivan's "text" package[1] is probably the best option. It uses packed Word16 buffers to store text as UTF-16. Probably not as efficient as a type backed by UTF-8, but it's much much better than String.
Right. For arbtt, I tried to switch from String to text, and it actually got slower. The reason (I think) was that besides passing strings around, it mainly runs pcre-light on them – which wants utf8-encoded bytestrings. I ended up creating a newtype¹ around utf8-encoded ByteStrings and the result was quite satisfying, both memory- and runtime-wise. I wish we had a package providing a standard type for this type that would become similarly popular. There is at least one more packages on hackage that defines this type: http://hackage.haskell.org/packages/archive/regex-tdfa-utf8/1.0/doc/html/Tex... Greetings, Joachim ¹ http://darcs.nomeata.de/arbtt/src/Data/MyText.hs -- Joachim Breitner e-Mail: mail@joachim-breitner.de Homepage: http://www.joachim-breitner.de ICQ#: 74513189 Jabber-ID: nomeata@joachim-breitner.de

Hi Joachim,
I have been playing around with this idea myself in TagSoup
(http://community.haskell.org/~ndm/tagsoup). The largest conceptual
problem I came across was that TagSoup decodes entities (i.e. >
becomes >). However, I think that's a minor issue, and entity
resolution can be turned off in TagSoup to guarantee copy-free
parsing.
The practical problem is that writing an HTML parser is hard, and that
writing it in a way that works on both String/ByteString/LBS and gets
the copy-free behaviour in the right places is tricky. I am still
working on program optimisation techniques that I think will make this
feasible (based around the idea of supercompilation, to expose the
underlying structure of the parser, without writing it in too painful
a manner). So with any luck your copy-free XML parser will happen
sooner or later.
Thanks, Neil
On Fri, May 14, 2010 at 8:20 PM, Joachim Breitner
Hi,
Am Freitag, den 14.05.2010, 15:31 -0300 schrieb Felipe Lessa:
On Fri, May 14, 2010 at 08:57:42AM -0700, John Millikin wrote:
Additionally, since the original bytestring is shared in your types, potentially very large buffers could be locked in memory due to references held by only a small portion of the document. Chopping a document up into events or nodes creates some overhead due to the extra pointers, but allows unneeded portions to be freed.
However, if your bytestring comes from mmap'ed memory this drawback wouldn't apply :D.
exactly. Of course such a library would not be a general-purpose tool, but in cases where you know that you need most of the document for most of the time, e.g. when doing statistics on it, this would be acceptable.
Also note that even after chopping into nodes, if you don’t make sure you drop the reference to root in a timely manner, the same thing would happen.
Am Freitag, den 14.05.2010, 08:57 -0700 schrieb John Millikin:
The primary problem I see with this is that XML content is fundamentally text, not bytes. Using your types, two XML documents with identical content but different encodings will have different Haskell values (and thus be incorrect regarding Eq, Ord, etc).
The instances could be adapted... but this will be expensive, of course.
One could also convert documents that are not utf-8 encoded as a whole and then work on that.
If you'd like memory-efficient text storage, using Bryan O'Sullivan's "text" package[1] is probably the best option. It uses packed Word16 buffers to store text as UTF-16. Probably not as efficient as a type backed by UTF-8, but it's much much better than String.
Right. For arbtt, I tried to switch from String to text, and it actually got slower. The reason (I think) was that besides passing strings around, it mainly runs pcre-light on them – which wants utf8-encoded bytestrings.
I ended up creating a newtype¹ around utf8-encoded ByteStrings and the result was quite satisfying, both memory- and runtime-wise. I wish we had a package providing a standard type for this type that would become similarly popular. There is at least one more packages on hackage that defines this type: http://hackage.haskell.org/packages/archive/regex-tdfa-utf8/1.0/doc/html/Tex...
Greetings, Joachim
¹ http://darcs.nomeata.de/arbtt/src/Data/MyText.hs
-- Joachim Breitner e-Mail: mail@joachim-breitner.de Homepage: http://www.joachim-breitner.de ICQ#: 74513189 Jabber-ID: nomeata@joachim-breitner.de
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
participants (4)
-
Felipe Lessa
-
Joachim Breitner
-
John Millikin
-
Neil Mitchell