[Haskell-cafe] Re: [Haskell] Re: Current XML libraries status

27 Oct 2008


      Hello David,
...
I tried to use HXT's  readDocument with its  tagsoup option for my
application.   I couldn't find a way to construct the operation that
didn't run out of memory.   I'll attach some code using HaXml's
saxParse so you can see what I want.   Is that easy to do in HXT?
I simply want the text of <PMID> and <AbstractText> elements.
here's an example, that reads the input
in a lazy way. I ran this in ghci with a
file containing 2^20 XML Elements. The file
was about 18Mb in size.
A normal parse with the standard parsec parser
ran out of memory on my 1Gb box. This
one used within ghci about 200Mb max.

------------------------------------

module Main where

import Text.XML.HXT.Arrow
import System

main
    = do

      mapM_ main' names

main
    = do
      (name:_) <- getArgs
      runX ( readDoc name
	     >>>
	     fromLA (deep (hasName "PIMD"          -- select the nodes
                           <+>
                           hasName "AbstractText"
                          )
		     >>>
		     getChildren                   -- get the text
		     >>>
		     getText
		    )
	     >>>
	     arrIO putStrLn
	   )
      putStrLn "main finished"

readDoc
    = readDocument [ (a_tagsoup, v_1)
                   , (a_parse_xml, v_1)
                   , (a_remove_whitespace, v_1)
                   , (a_encoding, isoLatin1)
                   , (a_issue_warnings, v_0)
                   , (a_trace, "1")
                   ]

---------------------
    

Cheers,

  Uwe Schmidt

-- 

Uwe Schmidt
Web: http://www.fh-wedel.de/~si/