Mixing Unboxed Mutable Vectors and Parsers

CC: Maintainers of STMonadTrans, Vector, and JuicyPixels Hello, I am writing a Haskell Attoparsec parser which will modify 2-d arrays of small values (Word8, Int8, etc.). My first idea was to simply parse all the deltas, and later apply them to the input list. However, I can't do that because the value of the deltas depend on the value they're modifying. My first pass at this program used a function of the form: p :: [[Word8]] -> Parser [[Word8]] This approach works, however, the program uses far too much memory. Some quick testing shows that lists of Word8s are ~52.6x larger than unboxed vectors of Word8s, and boxed vectors of Word8s are ~7.5x larger than unboxed vectors of Word8s. A better approach would be to use Data.Vector.Unboxed.Mutable and do the mutations inline with the parser. Because mutable vectors require a monad in PrimMonad to do the mutations inside of, I'd have to use a monad transformer to combine Parser and something in PrimMonad. Attoparsec doesn't support being used as a monad transformer, so I can't say something like p :: (PrimMonad m, UM.Unbox a) => M.MVector (PrimState m) (UM.MVector (PrimState m) a) -> ParserT m () I can't use Parsec (instead of Attoparsec) because I require streaming semantics -- eventually I'm going to hook this up to Data.Conduit and parse directly from the net. There is STT (in the package STMonadTrans), however, so I could potentially make the function result in STT Parser (). However, STT doesn't work with Data.Vector.Mutable or Data.Vector.Unboxed.Mutable, because STT isn't a member of the PrimMonad class (as far as I can tell). STT itself doesn't define unboxed mutable vectors (only boxed mutable vectors), but I feel that giving up unboxing isn't really an option because of the memory footprint. As a general observation, it seems silly to have two different mutable vector implementations, one for STT and the other for PrimMonad. So here are my questions: 1. Is it possible to implement PrimMonad with STT? I looked around for a little while, but couldn't find anything that did this. 2. Otherwise, is it reasonable to try to implement unboxed mutable vectors in STT? I feel this is probably going down the wrong path. 3. Are there any parsers that support streaming semantics and being used as a monad transformer? This would require rewriting my whole program to use this new parser, but if that's what I have to do, then so be it. Thanks, Myles C. Maxfield

Hi Myles It seems odd to mix parsing (consuming input) with mutation. What problem are you trying to solve and are you sure you can't get better phase separation than this paragraph suggests?
My first idea was to simply parse all the deltas, and later apply them to the input list. However, I can't do that because the value of the deltas depend on the value they're modifying.

It's a JPEG parser.
Progressive JPEG is set up where there's a vector Word8s, and some of
the entries in the vector may be 0. The JPEG has a stream of bits, and
the decoder is supposed to shift in one bit to each successive element
in the vector, skipping over 0s, and stop when it reaches some
specified number of 0s.
So if your partially decoded vector is
<2, 8, 0, 12, 0, 10, 6, 0, 2, 10>
and the jpeg has this bit stream
<1, 1, 0, 1, 0, 0, 1, 0, ...>
and the jpeg says "shift in until the 3rd zero is found"
that would result in the partially decoded vector being
<3, 9, 0, 12, 0, 11, 6, 0, 2, 10>
with the leftover part of the stream being
<0, 1, 0, ...>
The JPEG parser has to keep track of where it is in the partially
decoded vector to know how many bits to shift in, and where they
belong, so the next iteration is aligned to the right place. It would
be possible to keep track of this stuff throughout the parsing, and
have the result of the parse be a second "delta" framebuffer and apply
it to the original after each scan is parsed, but that's fairly ugly
and I'd like to avoid doing that.
If that's what I have to do, though, I guess I have to do it. Isn't
there a better way?
--Myles
On Sat, Apr 7, 2012 at 11:56 PM, Stephen Tetley
Hi Myles
It seems odd to mix parsing (consuming input) with mutation.
What problem are you trying to solve and are you sure you can't get better phase separation than this paragraph suggests?
My first idea was to simply parse all the deltas, and later apply them to the input list. However, I can't do that because the value of the deltas depend on the value they're modifying.

On 8 April 2012 19:17, Myles C. Maxfield
It's a JPEG parser. ....
Ahem, I'm a bit of of my depth then, but one thing you should consider is that writing your own parser monad is trivial, especially for well defined binary formats. Well defined binary formats should be deterministic and don't need backtracking, hence you won't need to worry about defining Alternative / MonadPlus instances which is where the complication lies. As you should be able to live without backtracking for JPEG it shouldn't be hard to make the parser efficient if you stick closely to the API of the Vector or Array library you are using - e.g. you might want to move a cursor forward through the Array rather than "unconsing" [*] at the head which would create work for the garbage collector. This should still be amenable to streaming - you just work at a larger chunk size. Presumably you've seen Jeroen Fokker's paper on JPEG decoding with Gofer? (Gofer is ~= Haskell). [*] Parsec etc. work by unconsing the head of the input stream.

On 12-04-07 05:35 PM, Myles C. Maxfield wrote:
So here are my questions: ... 3. Are there any parsers that support streaming semantics and being used as a monad transformer? This would require rewriting my whole program to use this new parser, but if that's what I have to do, then so be it.
Have a look at the incremental-parser package. It's not a monad transformer, only a monad, but it's written with streaming in mind. In particular, it solves the problem of mismatch between the input chunk boundaries and the boundaries of the structures you're trying to parse. The current version supports ByteString, Text, and list inputs out of the box, but support for Vector and arrays can be added as outside instances.

On 2012-04-07 23:35, Myles C. Maxfield wrote:
CC: Maintainers of STMonadTrans, Vector, and JuicyPixels
Hello, I am writing a Haskell Attoparsec parser which will modify 2-d arrays of small values (Word8, Int8, etc.).
My first idea was to simply parse all the deltas, and later apply them to the input list. However, I can't do that because the value of the deltas depend on the value they're modifying.
My first pass at this program used a function of the form:
p :: [[Word8]] -> Parser [[Word8]]
This approach works, however, the program uses far too much memory.
Does the parser really need the input to determine what to do? Or is the parse tree the same regardless? In the latter case, you could perhaps rewrite it to p :: Parser ([[Word8]] -> [[Word8]]) or when working with mutable vectors p :: MVector s Word8 -> Parser (ST s ()) So instead of explicit deltas, the deltas can just be the function that applies them. Twan
participants (4)
-
Mario Blažević
-
Myles C. Maxfield
-
Stephen Tetley
-
Twan van Laarhoven