data.binary get reading beyond end of input bytestring?

Hi, I am reading data from a file as strict bytestrings and processing them in an iteratee. As the parsing code uses Data.Binary, the strict bytestrings are then converted to lazy bytestrings (using fromWrap which Gregory Collins posted here in January: -- | wrapped bytestring -> lazy bytestring fromWrap :: I.WrappedByteString Word8 -> L.ByteString fromWrap = L.fromChunks . (:[]) . I.unWrap ). The parsing is then done with the library function Data.Binary.Get.runGetState: -- | Run the Get monad applies a 'get'-based parser on the input -- ByteString. Additional to the result of get it returns the number of -- consumed bytes and the rest of the input. runGetState :: Get a -> L.ByteString -> Int64 -> (a, L.ByteString, Int64) The issue I am seeing is that runGetState consumes more bytes than the length of the input bytestring, while reporting an apparently successful get (ie. it does not call error/fail). I was able to work around this by checking if the bytes consumed > input length, and if so to ignore the result of get and simply prepend the input bytestring to the next chunk in the continuation. However I am curious as to why this apparent lack of bounds checking happens. My guess is that Get does not check the length of the input bytestring, perhaps to avoid forcing lazy bytestring inputs; does that make sense? Would a better long-term solution be to use a strict-bytestring binary parser (like cereal)? So far I've avoided that as there is not yet a corresponding ieee754 parser. Conrad.

Conrad Parker
Hi,
I am reading data from a file as strict bytestrings and processing them in an iteratee. As the parsing code uses Data.Binary, the strict bytestrings are then converted to lazy bytestrings (using fromWrap which Gregory Collins posted here in January:
-- | wrapped bytestring -> lazy bytestring fromWrap :: I.WrappedByteString Word8 -> L.ByteString fromWrap = L.fromChunks . (:[]) . I.unWrap
This just makes a 1-chunk lazy bytestring: (L.fromChunks . (:[])) :: S.ByteString -> L.ByteString
). The parsing is then done with the library function Data.Binary.Get.runGetState:
-- | Run the Get monad applies a 'get'-based parser on the input -- ByteString. Additional to the result of get it returns the number of -- consumed bytes and the rest of the input. runGetState :: Get a -> L.ByteString -> Int64 -> (a, L.ByteString, Int64)
The issue I am seeing is that runGetState consumes more bytes than the length of the input bytestring, while reporting an apparently successful get (ie. it does not call error/fail). I was able to work around this by checking if the bytes consumed > input length, and if so to ignore the result of get and simply prepend the input bytestring to the next chunk in the continuation.
Something smells fishy here. I have a hard time believing that binary is reading more input than is available? Could you post more code please?
However I am curious as to why this apparent lack of bounds checking happens. My guess is that Get does not check the length of the input bytestring, perhaps to avoid forcing lazy bytestring inputs; does that make sense?
Would a better long-term solution be to use a strict-bytestring binary parser (like cereal)? So far I've avoided that as there is not yet a corresponding ieee754 parser.
If you're using iteratees you could try attoparsec + attoparsec-iteratee
which would be a more natural way to bolt parsers together. The
attoparsec-iteratee package exports:
parserToIteratee :: (Monad m) =>
Parser a
-> IterateeG WrappedByteString Word8 m a
Attoparsec is an incremental parser so this technique allows you to
parse a stream in constant space (i.e. without necessarily having to
retain all of the input). It also hides the details of the annoying
buffering/bytestring twiddling you would be forced to do otherwise.
Cheers,
G
--
Gregory Collins

I have a similar issue, I think. The problem with attoparsec is it only covers the unmarshalling side, writing data to disk still requires manually marshalling values into ByteStrings. Data.Binary with Data.Derive provide a clean, proven (encode . decode == id) way of doing this. If there's a way to accomplish this with attoparsec, I'd love to know. Max On Jul 28, 2010, at 10:32 PM, Gregory Collins wrote:
Conrad Parker
writes: Hi,
I am reading data from a file as strict bytestrings and processing them in an iteratee. As the parsing code uses Data.Binary, the strict bytestrings are then converted to lazy bytestrings (using fromWrap which Gregory Collins posted here in January:
-- | wrapped bytestring -> lazy bytestring fromWrap :: I.WrappedByteString Word8 -> L.ByteString fromWrap = L.fromChunks . (:[]) . I.unWrap
This just makes a 1-chunk lazy bytestring:
(L.fromChunks . (:[])) :: S.ByteString -> L.ByteString
). The parsing is then done with the library function Data.Binary.Get.runGetState:
-- | Run the Get monad applies a 'get'-based parser on the input -- ByteString. Additional to the result of get it returns the number of -- consumed bytes and the rest of the input. runGetState :: Get a -> L.ByteString -> Int64 -> (a, L.ByteString, Int64)
The issue I am seeing is that runGetState consumes more bytes than the length of the input bytestring, while reporting an apparently successful get (ie. it does not call error/fail). I was able to work around this by checking if the bytes consumed > input length, and if so to ignore the result of get and simply prepend the input bytestring to the next chunk in the continuation.
Something smells fishy here. I have a hard time believing that binary is reading more input than is available? Could you post more code please?
However I am curious as to why this apparent lack of bounds checking happens. My guess is that Get does not check the length of the input bytestring, perhaps to avoid forcing lazy bytestring inputs; does that make sense?
Would a better long-term solution be to use a strict-bytestring binary parser (like cereal)? So far I've avoided that as there is not yet a corresponding ieee754 parser.
If you're using iteratees you could try attoparsec + attoparsec-iteratee which would be a more natural way to bolt parsers together. The attoparsec-iteratee package exports:
parserToIteratee :: (Monad m) => Parser a -> IterateeG WrappedByteString Word8 m a
Attoparsec is an incremental parser so this technique allows you to parse a stream in constant space (i.e. without necessarily having to retain all of the input). It also hides the details of the annoying buffering/bytestring twiddling you would be forced to do otherwise.
Cheers, G -- Gregory Collins
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Wed, Jul 28, 2010 at 8:38 PM, Max Cantor
I have a similar issue, I think. The problem with attoparsec is it only covers the unmarshalling side, writing data to disk still requires manually marshalling values into ByteStrings. Data.Binary with Data.Derive provide a clean, proven (encode . decode == id) way of doing this.
If there's a way to accomplish this with attoparsec, I'd love to know.
Sorry, attoparsec is just a parsing library :-)

"Bryan O'Sullivan"
On Wed, Jul 28, 2010 at 8:38 PM, Max Cantor
wrote: I have a similar issue, I think. The problem with attoparsec is it only covers the unmarshalling side, writing data to disk still requires manually marshalling values into ByteStrings. Data.Binary with Data.Derive provide a clean, proven (encode . decode == id) way of doing this.
If there's a way to accomplish this with attoparsec, I'd love to know.
Sorry, attoparsec is just a parsing library :-)
:o You mean it isn't the solution to all of the world's problems? -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

On 28 July 2010 23:32, Gregory Collins
Conrad Parker
writes: Hi,
I am reading data from a file as strict bytestrings and processing them in an iteratee. As the parsing code uses Data.Binary, the strict bytestrings are then converted to lazy bytestrings (using fromWrap which Gregory Collins posted here in January:
-- | wrapped bytestring -> lazy bytestring fromWrap :: I.WrappedByteString Word8 -> L.ByteString fromWrap = L.fromChunks . (:[]) . I.unWrap
This just makes a 1-chunk lazy bytestring:
(L.fromChunks . (:[])) :: S.ByteString -> L.ByteString
). The parsing is then done with the library function Data.Binary.Get.runGetState:
-- | Run the Get monad applies a 'get'-based parser on the input -- ByteString. Additional to the result of get it returns the number of -- consumed bytes and the rest of the input. runGetState :: Get a -> L.ByteString -> Int64 -> (a, L.ByteString, Int64)
The issue I am seeing is that runGetState consumes more bytes than the length of the input bytestring, while reporting an apparently successful get (ie. it does not call error/fail). I was able to work around this by checking if the bytes consumed > input length, and if so to ignore the result of get and simply prepend the input bytestring to the next chunk in the continuation.
Something smells fishy here. I have a hard time believing that binary is reading more input than is available? Could you post more code please?
The issue seems to just be the return value for "bytes consumed" from getLazyByteString. Here's a small example. conrad@hunter:~/src/haskell/binary-overrun$ cat overrun.hs {-# LANGUAGE OverloadedStrings #-} import Data.Binary import Data.Binary.Get import qualified Data.ByteString.Lazy.Char8 as C data TenChars = TenChars C.ByteString deriving (Show) instance Binary TenChars where get = getLazyByteString 10 >>= return . TenChars put = undefined consume bs = do let (ret, rem, len) = runGetState (get :: Get TenChars) bs 0 putStrLn $ "Input: " ++ show bs ++ ", length " ++ (show $ C.length bs) putStrLn $ " consumed " ++ (show len) ++ " bytes without error." putStrLn $ " Output: " ++ show ret putStrLn $ " Remain: " ++ show rem main = do consume "1234567890ABCDE" consume "1234567890" consume "12345" conrad@hunter:~/src/haskell/binary-overrun$ ./overrun Input: Chunk "1234567890ABCDE" Empty, length 15 consumed 10 bytes without error. Output: TenChars (Chunk "1234567890" Empty) Remain: Chunk "ABCDE" Empty Input: Chunk "1234567890" Empty, length 10 consumed 10 bytes without error. Output: TenChars (Chunk "1234567890" Empty) Remain: Empty Input: Chunk "12345" Empty, length 5 consumed 10 bytes without error. Output: TenChars (Chunk "12345" Empty) Remain: Empty Here, the third example claims to have consumed 10 bytes out of the available 5, and does not fail. The issue is that this return value cannot be used for maintaining offsets. It is documented that it will not fail, but the returned len value seems to be incorrect. I've now added a check that fails if the returned bytestring is shorter than required.
However I am curious as to why this apparent lack of bounds checking happens. My guess is that Get does not check the length of the input bytestring, perhaps to avoid forcing lazy bytestring inputs; does that make sense?
Would a better long-term solution be to use a strict-bytestring binary parser (like cereal)? So far I've avoided that as there is not yet a corresponding ieee754 parser.
If you're using iteratees you could try attoparsec + attoparsec-iteratee which would be a more natural way to bolt parsers together. The attoparsec-iteratee package exports:
parserToIteratee :: (Monad m) => Parser a -> IterateeG WrappedByteString Word8 m a
Attoparsec is an incremental parser so this technique allows you to parse a stream in constant space (i.e. without necessarily having to retain all of the input). It also hides the details of the annoying buffering/bytestring twiddling you would be forced to do otherwise.
thanks for the pointer :) Conrad.

On 29 July 2010 07:53, Conrad Parker
Something smells fishy here. I have a hard time believing that binary is reading more input than is available? Could you post more code please?
The issue seems to just be the return value for "bytes consumed" from getLazyByteString. Here's a small example.
http://hackage.haskell.org/packages/archive/binary/0.5.0.2/doc/html/Data-Bin... getLazyByteString :: Int64 -> Get ByteString An efficient get method for lazy ByteStrings. Does not fail if fewer than n bytes are left in the input. Because it does it lazily it cannot check if it's gone past the end of the input. Arguably this is crazy and the function should not exist. Duncan

On 29 July 2010 17:46, Duncan Coutts
On 29 July 2010 07:53, Conrad Parker
wrote: Something smells fishy here. I have a hard time believing that binary is reading more input than is available? Could you post more code please?
The issue seems to just be the return value for "bytes consumed" from getLazyByteString. Here's a small example.
http://hackage.haskell.org/packages/archive/binary/0.5.0.2/doc/html/Data-Bin...
getLazyByteString :: Int64 -> Get ByteString An efficient get method for lazy ByteStrings. Does not fail if fewer than n bytes are left in the input.
Because it does it lazily it cannot check if it's gone past the end of the input. Arguably this is crazy and the function should not exist.
cheers Duncan, that confirms my guess about the reason. Would you accept a patch quoting you on that last point to the comment? ;-) Conrad.

On Thu, 2010-07-29 at 19:01 +0900, Conrad Parker wrote:
On 29 July 2010 17:46, Duncan Coutts
wrote: On 29 July 2010 07:53, Conrad Parker
wrote: Something smells fishy here. I have a hard time believing that binary is reading more input than is available? Could you post more code please?
The issue seems to just be the return value for "bytes consumed" from getLazyByteString. Here's a small example.
http://hackage.haskell.org/packages/archive/binary/0.5.0.2/doc/html/Data-Bin...
getLazyByteString :: Int64 -> Get ByteString An efficient get method for lazy ByteStrings. Does not fail if fewer than n bytes are left in the input.
Because it does it lazily it cannot check if it's gone past the end of the input. Arguably this is crazy and the function should not exist.
cheers Duncan, that confirms my guess about the reason. Would you accept a patch quoting you on that last point to the comment? ;-)
The consensus plan amongst the binary hackers is to eliminate lazy lookahead functions and to rebuild binary on top of a continuation style using strict chunks (then with lazy decoding built on top). Duncan

On 29 July 2010 19:13, Duncan Coutts
On Thu, 2010-07-29 at 19:01 +0900, Conrad Parker wrote:
On 29 July 2010 17:46, Duncan Coutts
wrote: On 29 July 2010 07:53, Conrad Parker
wrote: Something smells fishy here. I have a hard time believing that binary is reading more input than is available? Could you post more code please?
The issue seems to just be the return value for "bytes consumed" from getLazyByteString. Here's a small example.
http://hackage.haskell.org/packages/archive/binary/0.5.0.2/doc/html/Data-Bin...
getLazyByteString :: Int64 -> Get ByteString An efficient get method for lazy ByteStrings. Does not fail if fewer than n bytes are left in the input.
Because it does it lazily it cannot check if it's gone past the end of the input. Arguably this is crazy and the function should not exist.
cheers Duncan, that confirms my guess about the reason. Would you accept a patch quoting you on that last point to the comment? ;-)
The consensus plan amongst the binary hackers is to eliminate lazy lookahead functions and to rebuild binary on top of a continuation style using strict chunks (then with lazy decoding built on top).
I'll take that as a no on the patch. How would that plan differ from having an iteratee version of data.binary? ie. something that is easily compatible with WrappedByteString, as the existing Data.Binary is easily compatible with Data.ByteString.Lazy? Conrad.

On Thu, 2010-07-29 at 19:17 +0900, Conrad Parker wrote:
On 29 July 2010 19:13, Duncan Coutts
wrote: On Thu, 2010-07-29 at 19:01 +0900, Conrad Parker wrote:
On 29 July 2010 17:46, Duncan Coutts
wrote: On 29 July 2010 07:53, Conrad Parker
wrote: Something smells fishy here. I have a hard time believing that binary is reading more input than is available? Could you post more code please?
The issue seems to just be the return value for "bytes consumed" from getLazyByteString. Here's a small example.
http://hackage.haskell.org/packages/archive/binary/0.5.0.2/doc/html/Data-Bin...
getLazyByteString :: Int64 -> Get ByteString An efficient get method for lazy ByteStrings. Does not fail if fewer than n bytes are left in the input.
Because it does it lazily it cannot check if it's gone past the end of the input. Arguably this is crazy and the function should not exist.
cheers Duncan, that confirms my guess about the reason. Would you accept a patch quoting you on that last point to the comment? ;-)
The consensus plan amongst the binary hackers is to eliminate lazy lookahead functions and to rebuild binary on top of a continuation style using strict chunks (then with lazy decoding built on top).
I'll take that as a no on the patch.
Oh, sorry, documentation patch is fine.
How would that plan differ from having an iteratee version of data.binary? ie. something that is easily compatible with WrappedByteString, as the existing Data.Binary is easily compatible with Data.ByteString.Lazy?
No idea what WrappedByteString is. It would look like attoparsec's resumable parser: data Result a = Fail !ByteString | Partial (ByteString -> Result a) | Done !ByteString a runGet :: Get a -> ByteString -> Result a Point is you feed it strict bytestring chunks. Then decoding a lazy bytestring can be implemented on top easily, as can decoding a sequence lazily. I imagine you could fairly easily interface it with iteratee too. Duncan

On Thu, Jul 29, 2010 at 6:15 AM, Duncan Coutts wrote: No idea what WrappedByteString is. WrappedByteString is a newtype wrapper around ByteString that has a phantom
type. This allows instances of to be written such that ByteString can be
used with the iteratee library. You can see the source here if you're
interested:
http://hackage.haskell.org/packages/archive/iteratee/0.3.5/doc/html/src/Data... It would look like attoparsec's resumable parser: data Result a = Fail !ByteString
| Partial (ByteString -> Result a)
| Done !ByteString a runGet :: Get a -> ByteString -> Result a Point is you feed it strict bytestring chunks. Then decoding a lazy
bytestring can be implemented on top easily, as can decoding a sequence
lazily. Like attoparsec you'll probably want to write some other utility functions
for working with Results. Attoparsec defines feed, parseWith, maybeResult,
and eitherResult. I think you'll want something similar here. I imagine you could fairly easily interface it with iteratee too. Yes that should be easy given the above API. See for example the
attoparsec-iteratee package.
Once that work is done how will binary differ from cereal? How will I know
which one to pick?
Thanks,
Jason

On Thu, Jul 29, 2010 at 10:35 AM, Jason Dagit
On Thu, Jul 29, 2010 at 6:15 AM, Duncan Coutts < duncan.coutts@googlemail.com> wrote:
No idea what WrappedByteString is.
WrappedByteString is a newtype wrapper around ByteString that has a phantom type. This allows instances of to be written such that ByteString can be used with the iteratee library. You can see the source here if you're interested:
http://hackage.haskell.org/packages/archive/iteratee/0.3.5/doc/html/src/Data...
It would look like attoparsec's resumable parser:
data Result a = Fail !ByteString | Partial (ByteString -> Result a) | Done !ByteString a
runGet :: Get a -> ByteString -> Result a
Point is you feed it strict bytestring chunks. Then decoding a lazy bytestring can be implemented on top easily, as can decoding a sequence lazily.
Like attoparsec you'll probably want to write some other utility functions for working with Results. Attoparsec defines feed, parseWith, maybeResult, and eitherResult. I think you'll want something similar here.
And I forgot to mention.... Given those constructors for Result, how will you decode a sequence lazily? I agree you can consume the input lazily but, decodings won't produce any values until you hit Done. This is something I noticed with attoparsec as well. You can feed it incrementally, but it doesn't produce output until it reaches a Done state. In one project, were the input was line based records, I used ByteString's hGetLine to read individual lines and then parse them individually. That gave me a Result for each line and I could build incremental processing of records on top of that. Perhaps I was using attoparsec incorrectly, but as far as I can tell it doesn't have a way to incrementally produce results. I think what needs to be added is a way to 'yield' and 'resume'. What I can't figure out, is how to meaningfully return 'partial results' in general. In cases where the top level combinator is something like manyTill it makes sense, but in other cases it's unclear to me. And I think you'd need to add a constructor above that is some variation of "PartiallyDone a ByteString (ByteString -> Result a)". Jason

On Thu, Jul 29, 2010 at 10:55 AM, Jason Dagit
Given those constructors for Result, how will you decode a sequence lazily?
I deliberately left incremental results out of the attoparsec API, because it's a burrito-filled spacesuit of worms. The problem is that parsers can nest and fail and backtrack and be, in general, arbitrarily perverse. For instance, if there was a "yield" action, I could yield you a 1, backtrack, then yield you a 10 from some completely different branch of the parser that neither I nor you could foresee. Now, lazily parsing a sequence of data is a very common need, and it's also highly constrained, and hence easy to predict. It's also something that's easy to write without any access to the attoparsec internals. It would take just a few moments to write this: parseSequence :: Parser a -> Lazy.ByteString -> [a] I haven't added this to attoparsec because it's (a) trivial and (b) just one possible approach. Using iteratees would be another; running with a monadic action that can refill the parser on a Partial result would be yet another.
participants (7)
-
Bryan O'Sullivan
-
Conrad Parker
-
Duncan Coutts
-
Gregory Collins
-
Ivan Lazar Miljenovic
-
Jason Dagit
-
Max Cantor