
Hello David,
I tried to use HXT's readDocument with its tagsoup option for my application. I couldn't find a way to construct the operation that didn't run out of memory. I'll attach some code using HaXml's saxParse so you can see what I want. Is that easy to do in HXT? I simply want the text of <PMID> and <AbstractText> elements.
here's an example, that reads the input in a lazy way. I ran this in ghci with a file containing 2^20 XML Elements. The file was about 18Mb in size. A normal parse with the standard parsec parser ran out of memory on my 1Gb box. This one used within ghci about 200Mb max. ------------------------------------ module Main where import Text.XML.HXT.Arrow import System main = do mapM_ main' names main = do (name:_) <- getArgs runX ( readDoc name >>> fromLA (deep (hasName "PIMD" -- select the nodes <+> hasName "AbstractText" ) >>> getChildren -- get the text >>> getText ) >>> arrIO putStrLn ) putStrLn "main finished" readDoc = readDocument [ (a_tagsoup, v_1) , (a_parse_xml, v_1) , (a_remove_whitespace, v_1) , (a_encoding, isoLatin1) , (a_issue_warnings, v_0) , (a_trace, "1") ] --------------------- Cheers, Uwe Schmidt -- Uwe Schmidt Web: http://www.fh-wedel.de/~si/