On Mon, Apr 19, 2010 at 3:01 AM, Daniil Elovkov <daniil.elovkov@googlemail.com> wrote:
Hello haskellers!

I'm trying to process an xml file with as little footprint as possible. SAX is alright for my case, and I think that's the lightest way possible. So, I'm looking at HaXml.SAX

I'm surprised to see that it takes about 56-60 MB of ram. This seems constant relative to xml file size, which is expected. Only slightly depends on it as I recursively traverse the list of sax events. But it seems like too much.

For me these sorts of problems always involve investigation into the root cause.  I'm just not good enough at predicting what is causing the memory consumption.  Thankfully, GHC has great tools for this sort of investigative work.  The book real-world haskell documents how to use those tools:
http://book.realworldhaskell.org/read/profiling-and-optimization.html

If you haven't already, I highly recommend looking at the profiling graphs.  See if you can figure out if your program has any space leaks.
 

The size of the file is from 1MB to 20MB.

The code is something like this

main = do
   (fn:_) <- getArgs
   h <- openFile fn ReadMode
   c <- hGetContents h
   let out = proc $ fst $ saxParse fn c
   putStrLn out
   getChar

For such a simple program you won't run into any problem with lazy IO, but as your program grows in complexity it will very likely come back to bite you.  If you're not familiar with lazy IO, I'm referring to the hGetContents.  Some example problems:
1) If you opened many files this way, you could run out of file handles (lazy IO closing of handles is unpredictable, but file handles are a scarce resource).  The safe-io package on hackage can help you avoid this particular pitfall.
2) Reading of the file will happen during your pure code.  This implies that IO exceptions can happen in your pure code.  It also means that in some ways you'll be able to observe side-effects in your pure code.
3) If you were to reference 'c' from two places in main, the GC would not collect any of it until both references were collectable.  To avoid that leak, you'd need to load the data twice to avoid the memory leak.

I'm sure there are other things that can go wrong that i've missed.

I think iteratees are slowly catching on as an alternative to lazy io.  Basically, the iteratee approach uses a left fold style to stream the data and process it in chunks, including some exception handling.  Unfortunately, I think it may also require a special sax parser that is specifically geared towards iteratee use.  Having an iteratee based sax parser would make processing large xml streams very convenient in haskell.  Hint, hint, if you want to write a library :)  (Or, maybe it exists, I admit that I haven't checked.)

I hope that helps,
Jason