Re: [Haskell-cafe] memory needed for SAX parsing XML

Message: 8 Date: Tue, 20 Apr 2010 12:08:36 +0400 From: Daniil Elovkov
Subject: Re: [Haskell-cafe] memory needed for SAX parsing XML To: Haskell-Cafe Message-ID: <4BCD6104.50508@googlemail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Jason Dagit wrote:
On Mon, Apr 19, 2010 at 3:01 AM, Daniil Elovkov
mailto:daniil.elovkov@googlemail.com> wrote: I think iteratees are slowly catching on as an alternative to lazy io. Basically, the iteratee approach uses a left fold style to stream the data and process it in chunks, including some exception handling. Unfortunately, I think it may also require a special sax parser that is specifically geared towards iteratee use. Having an iteratee based sax parser would make processing large xml streams very convenient in haskell. Hint, hint, if you want to write a library :) (Or, maybe it exists, I admit that I haven't checked.)
Iteratees seem like a natural thing if we want to completely avoid unsafeInterleaveIO. But the presence of the latter is so good for modularity.
We can glue IO code and pure code of the type String -> a so seamlessly. In case of iteratees, as I understand, pure functions of the type String -> a would no longer be usable with IO String result. The signature (and the code itself) would have to be changed to be left fold.
To some extent, yes, although the amount of work required can vary. The general rule of thumb is that functions can be used directly with iteratees if they can work strictly. Functions that rely on laziness need some adaptation, although how much varies. If you've built your string handling functions out of a parser combinator library, e.g. parsec-3 or attoparsec, you can just lift the parser into an iteratee and use all your existing functions. Since you're using HaXml, this should work. The only missing part is a polyparse-iteratee converter. I haven't used polyparse, but it looks like the converter would be similar to the one used for attoparsec in the attoparsec-iteratee package. That said, it's not how I would do it. Since SAX is a stream processor, an iteratee-based SAX implementation would be a good fit. I would write a lexer and parser (using a parser combinator library) and then use those with Data.Iteratee.convStream. That's how I would write an iteratee-based SAX parser. HaXml already includes a suitable lexer and parser, but unfortunately they're not exposed.
Another (additional) approach would be to encapsulate unsafeInterleaveIO within some routine and not let it go out into the wild.
lazilyDoWithIO :: IO a -> (a -> b) -> IO b
It would use unsafeInterleave internally but catch all IO errors within itself.
I wonder if this is a reasonable idea? Maybe it's already done? So the topic is shifting...
doWithIO :: NFData b => IO a -> (a -> b) -> IO b doWithIO m f = liftM (\a -> let b = f a in b `deepseq` b) m It works (just stick it in a "try" block for error handling), but you need to write a lot of NFData instances. You also need to be careful that b is some sort of reduced structure, or you can end up forcing the whole file (or other data) into memory. It also doesn't help with other IO effects, e.g. writing output. I consider this one of the nicest features of iteratee-based processing. John

John Lato wrote:
Another (additional) approach would be to encapsulate unsafeInterleaveIO within some routine and not let it go out into the wild.
lazilyDoWithIO :: IO a -> (a -> b) -> IO b
It would use unsafeInterleave internally but catch all IO errors within itself.
I wonder if this is a reasonable idea? Maybe it's already done? So the topic is shifting...
doWithIO :: NFData b => IO a -> (a -> b) -> IO b doWithIO m f = liftM (\a -> let b = f a in b `deepseq` b) m
It works (just stick it in a "try" block for error handling), but you need to write a lot of NFData instances. You also need to be careful that b is some sort of reduced structure, or you can end up forcing the whole file (or other data) into memory.
I meant a different thing. In your example there is no unsafeInterleave at all. I think you mean that 'm' argument is supposed to be an unsafeInterleaved io action, like getContents, and deepseq'ing saves us from it hanging somewhere for a long time. Ok But I meant to have a routine that lets us use ordinary io actions in a lazy way, and restricting that 'hanging' within bounds of this routine. But having thought about it a little more I understood that it's impossible. Lazy io works now because unsafeInterleaveIO is sticked into getContents itself, and is called repeatedly (via recursion). Or at least I can think of this implementation, haven't looked into it. I realised that calling unsafeInterleaveIO for an ordinary io action will not make it run lazily. It will, still, run all-at-once, just not now, but later. So, to cope with it, I can think of exposing a little structure of an io action. Normally it's completely opaque. But if we knew where its recursion point lies, then we could control its course of execution. So, if we had something like type RecIO a = IO a -> IO a and io actions were like getContents :: RecIO [Char] getContents rec = do c <- readOneChar rest <- rec return (c:rest) then we could either run them normally :: RecIO a -> IO a normally r = r (normally r) or lazily :: RecIO a -> IO a lazily r = unsafeInterleavIO $ r (lazily r) And lazilyDoWithIO :: RecIO a -> (a -> b) -> IO b lazilyDoWithIO m f = do a <- lazily m return $ f a Hmm, but then, we would have to take special care to not let it out of this function anyway... So here we come to deepSeq'ing you proposed. And anyway, instead of re-writing pure functions to become iteratees we will have to re-write io functions to adopt continuation passing style. Initially it looked better to me :) But with this approach we can run lazily any io action that has the form of RecIO. Also, we can interleave normally and lazily based on the time of day and other conditions :)
It also doesn't help with other IO effects, e.g. writing output. I consider this one of the nicest features of iteratee-based processing.
Can you clarify what's the problem with writing? I think I just haven't switched from the topic of gluing code. Because as for gluing code, type signature for io writing d -> IO () is perfectly sufficient. -- Daniil Elovkov
participants (2)
-
Daniil Elovkov
-
John Lato