Re: [Haskell-cafe] Using Get monad to efficiently parse byte-stuffed data

25 Mar 2010

      On 10-03-24 12:44 PM, Paul Johnson wrote:
...
On 24/03/10 04:36, Pom Monico wrote:
...
Hello all,
I'm struggling to use the Get monad to efficiently parse the some
binary data of the format below. I simply can't find a way to use the
monad to parse it efficiently.
Binary data is terminated by a 0xFF marker. If the data itself
contains an 0xFF, a 0x00 is byte-stuffed after it to indicate it is
part of the data. An 0xFF followed by a non-zero byte indicates that
it is a real marker and terminates the data and the 0xFF is not part
of the data.
So the only way to find out if you have reached the end of the data is 
to read beyond it?  Yuck!
I haven't used it myself, but I suggest taking a look at Attoparsec, 
which is like Parsec but for bytestrings.  That might be able to 
handle this kind of non-determinism.
Don't know who thought it up but I have to deal with it.
The best I've gotten so far is the code below. In the Get monad, I do a 
"lookAhead getRemainingLazyByteString", call "go s 0 []" to get the 
byte-unstuffed data and #bytes consumed, and then do a "skip consumed" 
to move forward in the Get monad.

It is ugly and slow. I would like to implement this using the Get monad 
without dropping into the ByteString, and also so speed it up. In order 
of most time consuming, the sections are: go, takeToFF, ECS5-4, ECS5-2. 
But I'm also out of ideas without resorting to mutable constructs, 
esoteric tricks, or things that look like it belongs in a .c file.

Pom.

ff00 = {-# SCC "ECS5-8" #-} L.pack [0xff, 0x00]
ff   = {-# SCC "ECS5-9" #-} L.singleton 0xff
takeToFF s = case L.elemIndex 0xff s of
     Nothing -> (s, L.empty, L.length s) -- Return entire string
     Just x  -> let (h, t) = L.splitAt x s in (h, t, x) -- Split before 0xff
go s n acc
     | L.null s            = {-# SCC "ECS5-1" #-} (L.concat $ reverse 
acc, n)
     | L.isPrefixOf ff00 s = {-# SCC "ECS5-2" #-} go (L.drop 2 s) (n+2) 
(ff:acc)
     | L.isPrefixOf ff s   = {-# SCC "ECS5-3" #-} (L.concat $ reverse 
acc, n)
     | otherwise           = {-# SCC "ECS5-4" #-} let (h, t, x) = 
takeToFF s in go t (n+x) (h:acc)