Hexpat: Lazy I/O problem with huge input files

13 Oct 2010

      Hello Haskell Cafe,

I really hope this is the right list for this sort of question. I've
bugged the folks in #haskell, they say go here, so I'm turning to you.

I want to use Hexpat to read in some humongous XML files (linguistic
corpora,) since it's the only Haskell XML library (I could find) that
takes ByteStrings as input. I stumbled on a problem when using one of
the examples from the docs of Text.XML.Expat.Tree. The "cookbook
recipe" there suggests *first* processing the data, and only then
looking into the parser error to see if there has been an error. I
understand this should prevent the parse tree from being fully
evaluated before use. Unfortunately, that is not what happens on my
system (ghc 6.12.1, if that's of importance.)

This is the code from the docs, that I modified to read files:
...
import Text.XML.Expat.Tree
import System.Environment (getArgs)
import Control.Monad (liftM)
import qualified Data.ByteString.Lazy as C
·
-- This is the recommended way to handle errors in lazy parses
main = do
    f <- liftM head getArgs >>= C.readFile
    let (tree, mError) = parse defaultParseOptions f
    print (tree :: UNode String)
·
    -- Note: We check the error _after_ we have finished our processing
    -- on the tree.
     case mError of
         Just err -> putStrLn $ "It failed : "++show err
         Nothing -> putStrLn "Success!"
Given a 42M test file, an invocation like this:

% ghc --make -O2 Hexpat.hs
% ./Hexpat input.xml > dump.xml

will gobble up some 2Gigs of RAM (at least. I usually kill it before
it starts thrashing the swap space, since that almost crashes my
entire machine.) If I remove the last 3 lines:
...
import Text.XML.Expat.Tree
import System.Environment (getArgs)
import Control.Monad (liftM)
import qualified Data.ByteString.Lazy as C
main = do
    f <- liftM head getArgs >>= C.readFile
    let (tree, mError) = parse defaultParseOptions f
    print (tree :: UNode String)
the same invocation and input file barely uses a megabyte or two of
RAM and finishes really quickly.

Why is that? Is this a mistake in the Hexpat docs, or am I doing
something wrong? Lazy IO has always been a little bit of a mystery to
me, and just when I thought I had it...

Thanks for any help on the matter!
Aleks

Aleksandar Dimitrov

Daniel Fischer

Aleksandar Dimitrov

thinkingeric

tags

participants (3)