Re: haskell xml parsing for larger files?

Have you looked at tagsoup?
On Feb 20, 2014 3:30 AM, "Christian Maeder"

I've just tried: import Text.HTML.TagSoup import Text.HTML.TagSoup.Tree main :: IO () main = getContents >>= putStr . renderTags . flattenTree . tagTree . parseTags which also ends with the getMBlock error. Only "renderTags . parseTags" works fine (like the hexpat SAX parser). Why should tagsoup be better suited for building trees from large files? C. Am 20.02.2014 15:30, schrieb Chris Smith:
Have you looked at tagsoup?
On Feb 20, 2014 3:30 AM, "Christian Maeder"
mailto:Christian.Maeder@dfki.de> wrote: Hi,
I've got some difficulties parsing "large" xml files (> 100MB). A plain SAX parser, as provided by hexpat, is fine. However, constructing a tree consumes too much memory on a 32bit machine.
see http://trac.informatik.uni-__bremen.de:8080/hets/ticket/__1248 http://trac.informatik.uni-bremen.de:8080/hets/ticket/1248
I suspect that sharing strings when constructing trees might greatly reduce memory requirements. What are suitable libraries for string pools?
Before trying to implement something myself, I'ld like to ask who else has tried to process large xml files (and met similar memory problems)?
I have not yet investigated xml-conduit and hxt for our purpose. (These look scary.)
In fact, I've basically used the content trees from "The (simple) xml package" and switching to another tree type is no fun, in particular if this gains not much.
Thanks Christian _________________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.__org mailto:Glasgow-haskell-users@haskell.org http://www.haskell.org/__mailman/listinfo/glasgow-__haskell-users http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

Ah, I'd misunderstood your question, and thought you were looking for a
sax-like alternative.
On Feb 20, 2014 6:57 AM, "Christian Maeder"
I've just tried:
import Text.HTML.TagSoup import Text.HTML.TagSoup.Tree
main :: IO () main = getContents >>= putStr . renderTags . flattenTree . tagTree . parseTags
which also ends with the getMBlock error. Only "renderTags . parseTags" works fine (like the hexpat SAX parser).
Why should tagsoup be better suited for building trees from large files?
C.
Am 20.02.2014 15:30, schrieb Chris Smith:
Have you looked at tagsoup?
On Feb 20, 2014 3:30 AM, "Christian Maeder"
mailto:Christian.Maeder@dfki.de> wrote: Hi,
I've got some difficulties parsing "large" xml files (> 100MB). A plain SAX parser, as provided by hexpat, is fine. However, constructing a tree consumes too much memory on a 32bit machine.
see http://trac.informatik.uni-__bremen.de:8080/hets/ticket/__1248 http://trac.informatik.uni-bremen.de:8080/hets/ticket/1248
I suspect that sharing strings when constructing trees might greatly reduce memory requirements. What are suitable libraries for string pools?
Before trying to implement something myself, I'ld like to ask who else has tried to process large xml files (and met similar memory problems)?
I have not yet investigated xml-conduit and hxt for our purpose. (These look scary.)
In fact, I've basically used the content trees from "The (simple) xml package" and switching to another tree type is no fun, in particular if this gains not much.
Thanks Christian _________________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.__org mailto:Glasgow-haskell-users@haskell.org http://www.haskell.org/__mailman/listinfo/glasgow-__haskell-users http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
participants (2)
-
Chris Smith
-
Christian Maeder