A fancier Get monad or two (a la binary and binary-strict)

Summary: I have two new Get-like monads for binary data (byte-aligned) that also (*) Suspend parsing and request more import when reading past the end of data. It is possible to respond with Nothing to indicate the hard end of data. (*) Return failures instead of calling error (*) Offer lookAhead,lookAheadM,lookAheadE like Data.Binary.Get (*) Are BinaryParser instances from Data.Binary.Strict.Class (*) Are Monad Transformers (and thus MonadIO) (*) Are MonadError,Plus,Reader,Writer,State and Applicative,Alternative (*) They differ because one is also MonadCont while the other is simpler (*) Simplified Non-transformer versions (applied to Identity) are defined Note: Making the suspend/resume-with-more-input work efficiently with MonadError/Alternative is one improvement to binary-strict. Make callCC work with sane semantics at the same time was non-trivial. Long version and where to find the source: In the process of writing an implementation of google's protocol buffers, I used binary's Put monad and binary-strict's BinaryParser class (i.e. binary-strict's Incremental Get monad). The Get in the binary package works on Lazy ByteString (good for me) but tends to call "error" when unhappy, which was not ideal. The Incremental Get from the binary-strict package does not want to work on Lazy bytestrings (it can be adapted), but the internals hold onto data too long and obstruct garbage collection (this is how it appears to me when I read the code). So I have made a fancier Get monad that offers a large collection of features. I am announcing this in case other people might want to further adapt my solution for their own ends (or just use it outright). Removing feature from my code ought to be much easier than adding them was. The code itself is currently living in a few files inside the protocol-buffers project at http://darcs.haskell.org/packages/protocol-buffers/Text/ProtocolBuffers/ The files are "MyGet.hs" or "MyGetW.hs", there are two flavors. (The "testDepth" file is annotated test output from MyGetW's testDepth function). They depend on binary-strict, but this would be easy to comment out. They both contain code "lifted" from binary and "inspired" by binary-strict. MyGetW.hs is the version with a MonadCont instance. MyGet.hs has the (forall b.) to simplify the types and thus has no MonadCont instance. I have tested the code, but it is still quite new. I doubt I will have time to make a proper cabal release. I may see if the maintainers of the binary-strict or binary packages are interested in a fancier Get monad. Cheers, Chris K

Silly typo in the first bullet point... (*) Suspend parsing and request more input when reading past the end of data. ^^^^^ I do not have to wonder why "import" is so easy to type. -- Chris

On Wed, 2008-07-30 at 02:23 +0100, Chris Kuklewicz wrote:
Summary: I have two new Get-like monads for binary data (byte-aligned) that also (*) Suspend parsing and request more import when reading past the end of data. It is possible to respond with Nothing to indicate the hard end of data. (*) Return failures instead of calling error (*) Offer lookAhead,lookAheadM,lookAheadE like Data.Binary.Get (*) Are BinaryParser instances from Data.Binary.Strict.Class (*) Are Monad Transformers (and thus MonadIO) (*) Are MonadError,Plus,Reader,Writer,State and Applicative,Alternative (*) They differ because one is also MonadCont while the other is simpler (*) Simplified Non-transformer versions (applied to Identity) are defined
[..]
I have tested the code, but it is still quite new. I doubt I will have time to make a proper cabal release. I may see if the maintainers of the binary-strict or binary packages are interested in a fancier Get monad.
Yep, definitely interested. Sounds like we could make something that would satisfy the needs of existing users of the binary and binary-strict packages. We'll have to look closely at the performance costs of the new features but my intuition is that a non-transformer but continuation based version that has error handling (and plus/alternative) and can request more input should have minimal cost. Duncan

Summary: I have made the simplified version Duncan speculates about... Duncan Coutts wrote:
Yep, definitely interested. Sounds like we could make something that would satisfy the needs of existing users of the binary and binary-strict packages.
I see this as incremental development of incremental get. The comments in binary-strict's code point to one additional useful feature: When the parser suspends and asks for more input it could potentially also return a list or Sequence of the output-so-far. I believe this is possible to add to my existing code (even the simplified version below), so there will be an even fancier version eventually. This would make the parser a controllable part of a pipeline.
We'll have to look closely at the performance costs of the new features but my intuition is that a non-transformer but continuation based version that has error handling (and plus/alternative) and can request more input should have minimal cost.
Duncan
To modify down "MyGet.hs" to produce that type is a matter of using the delete key (which I have done, see below). I only have my Apple powerpc/G4 laptop to run Haskell, this and the lack of either need or available time means I will not be making performance measurements. I have written for the language shootout before and I had binary's Get.hs to look at, and so I claim my code has no show-stopping performance killers in it. A bit of !strictness and even more INLINE pragmas should be all it needs. And so I just took such a knife to MyGet.hs to make MyGetSimplified.hs. It is at http://darcs.haskell.org/packages/protocol-buffers/Text/ProtocolBuffers/ with the other files. The simplified form of the data definitions mostly fits in this email: newtype Get a = Get { unGet :: forall b. -- the forall hides the CPS style Success b a -- main continuation -> S -- parser state -> FrameStack b -- error handler stack -> Result b -- operation } type Success b a = (a -> S -> FrameStack b -> Result b) data S = S { top :: {-# UNPACK #-} !S.ByteString , current :: {-# UNPACK #-} !L.ByteString , consumed :: {-# UNPACK #-} !Int64 } deriving Show data FrameStack b = [snip details] data Result a = Failed {-# UNPACK #-} !Int64 String -- the Int64 is amount consumed successfully | Finished {-# UNPACK #-} !L.ByteString {-# UNPACK #-} !Int64 a -- the bytestring is the unconsumed part, the Int64 is amount consumed | Partial (Maybe L.ByteString -> Result a) -- passing Nothing indicates that there will never be more input This could be streamlined further by making (FrameStack b) a field of S. The MonadError/Plus/Alternative should still work. It will still suspend and resume. A few more short functions and this will have the same exported signatures as binary's Get.hs. Cheers, Chris

On Wed, Jul 30, 2008 at 9:26 AM, Duncan Coutts
On Wed, 2008-07-30 at 02:23 +0100, Chris Kuklewicz wrote:
Summary: I have two new Get-like monads for binary data (byte-aligned) that also (*) Suspend parsing and request more import when reading past the end of data. It is possible to respond with Nothing to indicate the hard end of data. (*) Return failures instead of calling error (*) Offer lookAhead,lookAheadM,lookAheadE like Data.Binary.Get (*) Are BinaryParser instances from Data.Binary.Strict.Class (*) Are Monad Transformers (and thus MonadIO) (*) Are MonadError,Plus,Reader,Writer,State and Applicative,Alternative (*) They differ because one is also MonadCont while the other is simpler (*) Simplified Non-transformer versions (applied to Identity) are defined
[..]
We'll have to look closely at the performance costs of the new features but my intuition is that a non-transformer but continuation based version that has error handling (and plus/alternative) and can request more input should have minimal cost.
I've written what I believe to be a similar, continuation-based parser. I haven't uploaded my latest patches (basically faster combinators) but the idea can be seen in the file here: http://www.johantibell.com/cgi-bin/gitweb.cgi?p=hyena.git;a=blob;f=Hyena/Par... The use case is parsing HTTP without resorting to lazy I/O. Cheers, Johan

Johan Tibell wrote:>
I've written what I believe to be a similar, continuation-based parser. I haven't uploaded my latest patches (basically faster combinators) but the idea can be seen in the file here:
http://www.johantibell.com/cgi-bin/gitweb.cgi?p=hyena.git;a=blob;f=Hyena/Par...
The use case is parsing HTTP without resorting to lazy I/O.
I have just read through your code. It is quite similar. The differences: Your error handling is via Alternative, and if the first branch advances (consumes input) then the second branch is not attempted. The state can only go forward (by 1 byte) or remain in place (there is no look-ahead). If reading past the end, then either (*) the new first byte works (*) the new first fails and the Alternative is ready to try it. In MyGet/MyGetW/MyGetSimplified the MonadError semantics are different. If the first alternative fails then the parser state is rolled back and the second alternative is tried from the same starting point as the first was tried. If the first alternative trigged more input from IPartial then this input is still visible to the second alternative. The management of saved state on the stack of pending operations is much simpler with your commit-if-advance semantics and much more complicated with my rollback semantics. Oddly, it seems your committed operations do not immediately release the pending handlers so they can be garbage collected. This same kind of issue motivated me to improve the implementation of binary-strict's incremental get. On a different note: Hmmm....In the MyGetW implementation I could add a fancier throwError/Alternative command that allows the user to "commit" to the current branch and immediately release/discard the pending handler/second branch. Something like:
d = mplus (mplus a b) c where a = do comThing <- getCommitter x <- getWord32be catchError (commitTo comThing >> throwError "WTF") (\errMsg -> liftIO (print errMsg)) b = liftIO (print "b") c = liftIO (print "c")
In the above pseudo code the "commitTo" will cause the throwError to bypass both the (\errMsg -> ...) handler and the "b" alternative. It will go to "c" instead. The "comThing" is an opaque value that is the unique ID of the current error handler frame, in the above case it is the frame for "mplus a b". So the commitTo causes the system to immediately abandon (and allow garbage collection) of the (errMsg -> ..) and "b" code. Hmmm.... -- Chris

Hi Chris,
Thanks for providing feedback. It's much appreciated.
On Wed, Jul 30, 2008 at 11:34 PM, Chris Kuklewicz
The differences: Your error handling is via Alternative, and if the first branch advances (consumes input) then the second branch is not attempted. The state can only go forward (by 1 byte) or remain in place (there is no look-ahead). If reading past the end, then either (*) the new first byte works (*) the new first fails and the Alternative is ready to try it.
This is by design. This is enough for parsing many network protocols like HTTP and avoids having to deal with backtracking and the potential space leaks and bookkeeping overhead that might come with it.
The management of saved state on the stack of pending operations is much simpler with your commit-if-advance semantics and much more complicated with my rollback semantics. Oddly, it seems your committed operations do not immediately release the pending handlers so they can be garbage collected. This same kind of issue motivated me to improve the implementation of binary-strict's incremental get.
I'm not sure what you mean by releasing pending handlers. It's probably something I've not considered. Could you please elaborate? Cheers, Johan

Johan Tibell wrote:
Hi Chris,
Thanks for providing feedback. It's much appreciated.
On Wed, Jul 30, 2008 at 11:34 PM, Chris Kuklewicz
wrote: The differences: Your error handling is via Alternative, and if the first branch advances (consumes input) then the second branch is not attempted. The state can only go forward (by 1 byte) or remain in place (there is no look-ahead). If reading past the end, then either (*) the new first byte works (*) the new first fails and the Alternative is ready to try it.
This is by design. This is enough for parsing many network protocols like HTTP and avoids having to deal with backtracking and the potential space leaks and bookkeeping overhead that might come with it.
I agree that the design space of get and parsing monads is quite broad. That is why my small endeavor now has 3 monads: MyGetW with MonadCont, MyGet without MonadCont, and MyGetSimplified without MonadCont,Reader,Writer,State,Trans.
The management of saved state on the stack of pending operations is much simpler with your commit-if-advance semantics and much more complicated with my rollback semantics. Oddly, it seems your committed operations do not immediately release the pending handlers so they can be garbage collected. This same kind of issue motivated me to improve the implementation of binary-strict's incremental get.
I'm not sure what you mean by releasing pending handlers. It's probably something I've not considered. Could you please elaborate?
Absolutely. It does not affect the semantics of the computation, but it does affect the space performance. Your Hyena.Parser has a state, a success continuation, and a failure continuation. The state has a count of the number of bytes consumed. The Alterantive <|> operation is the interesting one:
data S = S S.ByteString Int64 Bool
newtype Parser a = Parser { unParser :: forall r. S -> (a -> S -> IResult r) -> (S -> IResult r) -> IResult r }
instance Alternative Parser where empty = Parser $ \s _ fail -> fail s
p <|> p' = Parser $ \s@(S _ pos _) succ fail -> unParser p s succ $ \s'@(S _ pos' _) -> if pos == pos' then unParser p' s' succ fail else fail s'
I will rewrite this for clarity, and define the function fail' as
p <|> p' = Parser $ \s@(S _ pos _) succ fail -> let fail' s'@(S _ pos' _) = if pos == pos' then unParser p' s' succ fail else fail s' in unParser p s succ fail'
It is now clear that <|> works by allocating fail' which hold a reference to the original pos and succ and fail. Happily, it does not hold a reference to the original S.ByteString because your handlers have commit-if-advance semantics. The commit-if-advance semantics come from (if pos == pos') which checks to see whether p has consumed any of the input. In practice pos <= pos' will always hold, since p cannot have gone backwards. When does fail' get released? It gets released after p fails by calling it or after p and thus <|> succeeds. But fail' reverts in behavior to fail as soon as p advances pos. So Parser keeps fail' longer than needed. For example: if p uses bytes then even though bytes advances for a while the fail' is not released until it hits an error or p finishes. The core problem is the error handlers in Hyena.Parser are written in the form of nested scopes and this does not model their semantics. They behave like the Bool in the state and thus 'fail' can be moved into 'S' ... which I have done in the attached file. The meat of the difference is in 'satisfy'. When pos is incemented to (pos+1) the (Maybe fail) is set to Nothing which signifies that the top-level failure continuation 'failed' is in effect. The 'fail' functions, which are allocated in <|> as before, are thus immediately released when the parser advances. Whether this makes it better or not depends on how deep the nested stack <|> calls become. A very deep stack had many fail' waiting in vain under the original Hyena.Parser and are released much much sooner under my version. [ Also, I deleted IPartial. ] Cheers, Chris K {-# LANGUAGE Rank2Types #-} -- modified by Chris Kuklewicz -- * remove IResult -- * move fail continuation to state S, inside Maybe [changes S to (S r)] -- * clear fail to Nothing when advancing -- * use new 'callFailed' to invoke failure continuation ------------------------------------------------------------------------ -- | -- Module : Hyena.Parser -- Copyright : (c) Johan Tibell 2008 -- License : BSD3-style (see LICENSE) -- -- Maintainer : johan.tibell@gmail.com -- Stability : experimental -- Portability : portable -- -- An incremental LL(1) parser combinator library. -- ------------------------------------------------------------------------ module Hyena.Parser ( -- * The Parser type Parser, Result(..), runParser, -- * Primitive parsers satisfy, byte, bytes, module Control.Applicative ) where import Control.Applicative import qualified Data.ByteString as S import Data.Int (Int64) import Data.Word (Word8) import Prelude hiding (fail, succ) import Text.Show.Functions import Debug.Trace -- --------------------------------------------------------------------- -- The Parser type -- | The parse state. data S r = S {-# UNPACK #-} !S.ByteString {-# UNPACK #-} !Int64 {-# UNPACK #-} !Bool {-# UNPACK #-} !(Maybe (S r -> Result r)) -- error hander moved to state deriving Show setFail :: S r -> Maybe (S r -> Result r) -> S r setFail (S a b c _) mfail = (S a b c mfail) getFail :: S r -> Maybe (S r -> Result r) getFail (S _ _ _ mfail) = mfail clearFail :: S r -> S r clearFail (S a b c _) = S a b c Nothing -- | A parse either succeeds, fails or returns a suspension with which -- the parsing can be resumed. data Result a = Finished a S.ByteString -- ^ Parsing succeeded and produced a value of type -- @a@. The returned 'S.ByteString' is the remaining -- unconsumed input. | Failed Int64 -- ^ Parsing failed at the given position. Either -- because the parser didn't match the input or because -- an unexpected end of input was reached during -- parsing. | Partial (Maybe S.ByteString -> Result a) -- ^ The parsing needs more input to continue. Pass in -- @Just input@ to continue parsing and @Nothing@ to -- signal end of input. If @Nothing@ is passed the -- 'Result' is either 'Finished' or 'Failed'. deriving Show -- | A parser takes a parse state, a success continuation and returns a 'Result'. newtype Parser a = Parser { unParser :: forall r. S r -> (a -> S r -> Result r) -> Result r } -- --------------------------------------------------------------------- -- Instances instance Functor Parser where fmap f p = Parser $ \s succ -> unParser p s (succ . f) {-# INLINE fmap #-} instance Applicative Parser where pure a = Parser $ \s succ -> succ a s {-# INLINE pure #-} p <*> p' = Parser $ \s succ -> let newSucc f s' = unParser p' s' (succ . f) in unParser p s newSucc {-# INLINE (<*>) #-} instance Alternative Parser where empty = Parser $ \s _ -> callFailed s {-# INLINE empty #-} p <|> p' = Parser $ \s@(S _ _ _ mfail) succ -> let newS = setFail s (Just (\s' -> unParser p' (setFail s' mfail) succ)) in unParser p newS succ {-# INLINE (<|>) #-} -- --------------------------------------------------------------------- -- Running a parser initState :: S.ByteString -> S r initState bs = S bs 0 False Nothing {-# INLINE initState #-} -- | This is the final continuation that turns a successful parse into -- a 'Result'. finished :: r -> S r -> Result r finished v (S bs _ _ _) = Finished v bs -- | This is the final continuation that turns an unsuccessful parse -- into a 'Result'. failed :: S r -> Result a failed (S _ pos _ _) = Failed pos -- | TODO: Write documentation. runParser :: Parser a -> S.ByteString -> Result a runParser p bs = unParser p (initState bs) finished callFailed :: S r -> Result r callFailed s@(S _ _ _ Nothing) = failed s callFailed s@(S _ _ _ (Just fail)) = fail s -- --------------------------------------------------------------------- -- Primitive parsers -- | The parser @satisfy p@ succeeds for any byte for which the -- supplied function @p@ returns 'True'. Returns the byte that is -- actually parsed. satisfy :: (Word8 -> Bool) -> Parser Word8 satisfy p = Parser $ \s@(S bs pos eof mfail) succ -> case S.uncons bs of Just (b, bs') -> if p b then succ b (S bs' (pos + 1) eof Nothing) -- clear (Maybe fail) on advance else callFailed s Nothing -> if eof then callFailed s else Partial $ \x -> case x of Just bs' -> retry (S bs' pos eof mfail) Nothing -> callFailed (S bs pos True mfail) where retry s' = unParser (satisfy p) s' succ -- | @byte b@ parses a single byte @b@. Returns the parsed byte -- (i.e. @b@). byte :: Word8 -> Parser Word8 byte b = satisfy (== b) -- | @bytes bs@ parses a sequence of bytes @bs@. Returns the parsed -- bytes (i.e. @bs@). bytes :: S.ByteString -> Parser S.ByteString bytes = fmap S.pack . go where go bs = case S.uncons bs of Just (b, bs') -> liftA2 (:) (byte b) (go bs') Nothing -> pure []

On Thu, Jul 31, 2008 at 2:00 PM, Chris Kuklewicz
[terrific explanation of the problem]
Thanks a lot for the explanation. It's perfectly clear to me now. I changed by implementation along the lines with the one you attached except I didn't use Maybe to wrap the error handler in the parse state but instead set the failure handler to `failed' instead of Nothing on commit. Cheers, Johan

And now MyGetW.hs [1] is even more capable: During the parsing one can use a "yieldItem y" command to queue y for delivery to the code running the Get monad. The length of this queue can be examined with "pendingItems" which returns an Int, which will be non-negative. To immediately pause parsing and return a Data.Sequence of yielded items to the user one calls "flushItems" which returns the sequence and a lazy value which is the future of the paused parsing. The pending queue of items is also returned every time the parser needs more input, sees an error, or completes. Thus the parser can be a lazy participant in a chain of processing. The queued items are undisturbed by fail/throwError/mzero all lookAhead* functions and any use of callCC and the continuations it returns. Thus "yieldItem" is for life and cannot be undone. The continuation returned by "callCC" was improved internally to not hold onto the old input state of the BinaryParser or the old user state. It does not use these things and should not keep them alive. Frankly, I cannot think of any more features to add. Once Haddock can be used with a released Cabal I may make a more proper release of these files. Cheers, Chris [1] MyGetW.hs and MyGet.hs and MyGetSimplified.hs are still at http://darcs.haskell.org/packages/protocol-buffers/Text/ProtocolBuffers/

Chris Kuklewicz
The length of this queue can be examined with "pendingItems" which returns an Int, which will be non-negative. Wouldn't it be more in the spirit of Haskell to use a data type that has "non-negative" already built in, like Data.Word.Word?
Regards, Michael Karcher
participants (4)
-
Chris Kuklewicz
-
Duncan Coutts
-
Johan Tibell
-
usenet@mkarcher.dialup.fu-berlin.de