Hexpat: Lazy I/O problem with huge input files

Hello Haskell Cafe, I really hope this is the right list for this sort of question. I've bugged the folks in #haskell, they say go here, so I'm turning to you. I want to use Hexpat to read in some humongous XML files (linguistic corpora,) since it's the only Haskell XML library (I could find) that takes ByteStrings as input. I stumbled on a problem when using one of the examples from the docs of Text.XML.Expat.Tree. The "cookbook recipe" there suggests *first* processing the data, and only then looking into the parser error to see if there has been an error. I understand this should prevent the parse tree from being fully evaluated before use. Unfortunately, that is not what happens on my system (ghc 6.12.1, if that's of importance.) This is the code from the docs, that I modified to read files:
import Text.XML.Expat.Tree import System.Environment (getArgs) import Control.Monad (liftM) import qualified Data.ByteString.Lazy as C · -- This is the recommended way to handle errors in lazy parses main = do f <- liftM head getArgs >>= C.readFile let (tree, mError) = parse defaultParseOptions f print (tree :: UNode String) · -- Note: We check the error _after_ we have finished our processing -- on the tree. case mError of Just err -> putStrLn $ "It failed : "++show err Nothing -> putStrLn "Success!"
Given a 42M test file, an invocation like this: % ghc --make -O2 Hexpat.hs % ./Hexpat input.xml > dump.xml will gobble up some 2Gigs of RAM (at least. I usually kill it before it starts thrashing the swap space, since that almost crashes my entire machine.) If I remove the last 3 lines:
import Text.XML.Expat.Tree import System.Environment (getArgs) import Control.Monad (liftM) import qualified Data.ByteString.Lazy as C
main = do f <- liftM head getArgs >>= C.readFile let (tree, mError) = parse defaultParseOptions f print (tree :: UNode String)
the same invocation and input file barely uses a megabyte or two of RAM and finishes really quickly. Why is that? Is this a mistake in the Hexpat docs, or am I doing something wrong? Lazy IO has always been a little bit of a mystery to me, and just when I thought I had it... Thanks for any help on the matter! Aleks

On Wednesday 13 October 2010 23:06:04, Aleksandar Dimitrov wrote:
Hello Haskell Cafe,
I really hope this is the right list for this sort of question. I've bugged the folks in #haskell, they say go here, so I'm turning to you.
I want to use Hexpat to read in some humongous XML files (linguistic corpora,) since it's the only Haskell XML library (I could find) that takes ByteStrings as input. I stumbled on a problem when using one of the examples from the docs of Text.XML.Expat.Tree. The "cookbook recipe" there suggests *first* processing the data, and only then looking into the parser error to see if there has been an error. I understand this should prevent the parse tree from being fully evaluated before use. Unfortunately, that is not what happens on my system (ghc 6.12.1, if that's of importance.)
This is the code from the docs, that I modified to read files:
import Text.XML.Expat.Tree import System.Environment (getArgs) import Control.Monad (liftM) import qualified Data.ByteString.Lazy as C · -- This is the recommended way to handle errors in lazy parses main = do f <- liftM head getArgs >>= C.readFile let (tree, mError) = parse defaultParseOptions f print (tree :: UNode String) · -- Note: We check the error _after_ we have finished our processing -- on the tree. case mError of Just err -> putStrLn $ "It failed : "++show err Nothing -> putStrLn "Success!"
Given a 42M test file, an invocation like this:
% ghc --make -O2 Hexpat.hs % ./Hexpat input.xml > dump.xml
will gobble up some 2Gigs of RAM (at least. I usually kill it before it starts thrashing the swap space, since that almost crashes my entire machine.)
I don't know Hexpat at all, so I can only guess. Perhaps due to the laziness of let-bindings, mError keeps a reference to the entire tuple, thus preventing tree from being garbage collected as it is consumed by print. Try main = do f <- liftM head getArgs >>= C.readFile case parse defaultParseOptions f of (tree, mError) -> do print (tree :: UNode String) case mError of Just err -> putStrLn $ "It failed: " ++ show err Nothing -> putStrLn "Success!" it may fix the leak, change nothing or make it worse.
If I remove the last 3 lines:
import Text.XML.Expat.Tree import System.Environment (getArgs) import Control.Monad (liftM) import qualified Data.ByteString.Lazy as C
main = do f <- liftM head getArgs >>= C.readFile let (tree, mError) = parse defaultParseOptions f print (tree :: UNode String)
the same invocation and input file barely uses a megabyte or two of RAM and finishes really quickly.
Why is that? Is this a mistake in the Hexpat docs, or am I doing something wrong? Lazy IO has always been a little bit of a mystery to me, and just when I thought I had it...
Thanks for any help on the matter! Aleks

Hello Daniel,
I don't know Hexpat at all, so I can only guess.
Perhaps due to the laziness of let-bindings, mError keeps a reference to the entire tuple, thus preventing tree from being garbage collected as it is consumed by print.
Thanks for your input. I think you are right, the parse tree isn't freed as the parse proceeds if mError is forced later on in the program (anywhere.) I don't think it has something to do with the tuple constructor or 'let' itself, but I'm also not very proficient at figuring these kinds of things out, so I may be very wrong. I did do the following test to support my hypothesis:
import Text.XML.Expat.Tree import System.Environment (getArgs) import Control.Monad (liftM) import qualified Data.ByteString.Lazy as C · -- This is the recommended way to handle errors in lazy parses main = do f <- liftM head getArgs >>= C.readFile let (_, mError) = parse defaultParseOptions f :: (UNode String, Maybe XMLParseError)
case mError of Just err -> putStrLn $ "It failed : "++show err Nothing -> putStrLn "Success!"
I.e., keeping the parse tree is forced by the evaluation of mError. There is not a single reference to the parse tree within the program itself (unless I'm not noticing some sort of do-notation magic in the whole thing here...) It is interesting (and rather unfortunate) that just evaluating a potential error seems to block garbage collection. If I'm correct (and I hope I'm not!) this seems to prevent using lazy I/O in Hexpat if you want to know if there's a parse error (and if so, what that would be.) I'll contact the author, maybe it's a genuine bug? Thanks again :-) Aleks

Hi Aleks, Did you (or anyone) ever resolve this? I'm having precisely the same problem. Eric Aleksandar Dimitrov wrote
Hello Daniel,
I don't know Hexpat at all, so I can only guess.
Perhaps due to the laziness of let-bindings, mError keeps a reference to the entire tuple, thus preventing tree from being garbage collected as it is consumed by print.
Thanks for your input. I think you are right, the parse tree isn't freed as the parse proceeds if mError is forced later on in the program (anywhere.) I don't think it has something to do with the tuple constructor or 'let' itself, but I'm also not very proficient at figuring these kinds of things out, so I may be very wrong. I did do the following test to support my hypothesis:
import Text.XML.Expat.Tree import System.Environment (getArgs) import Control.Monad (liftM) import qualified Data.ByteString.Lazy as C · -- This is the recommended way to handle errors in lazy parses main = do f <- liftM head getArgs >>= C.readFile let (_, mError) = parse defaultParseOptions f :: (UNode String, Maybe XMLParseError)
case mError of Just err -> putStrLn $ "It failed : "++show err Nothing -> putStrLn "Success!"
I.e., keeping the parse tree is forced by the evaluation of mError. There is not a single reference to the parse tree within the program itself (unless I'm not noticing some sort of do-notation magic in the whole thing here...) It is interesting (and rather unfortunate) that just evaluating a potential error seems to block garbage collection.
If I'm correct (and I hope I'm not!) this seems to prevent using lazy I/O in Hexpat if you want to know if there's a parse error (and if so, what that would be.) I'll contact the author, maybe it's a genuine bug?
Thanks again :-) Aleks _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@ http://www.haskell.org/mailman/listinfo/haskell-cafe
-- View this message in context: http://haskell.1045720.n5.nabble.com/Hexpat-Lazy-I-O-problem-with-huge-input... Sent from the Haskell - Haskell-Cafe mailing list archive at Nabble.com.
participants (3)
-
Aleksandar Dimitrov
-
Daniel Fischer
-
thinkingeric