
Hello, The recent thread on binary parsing got me to thinking about more general network protocol parsing with parsec. A lot of network protocols these days are text-oriented, so seem a good fit for parsec. However, the difficulty I come up time and again is: parsec normally expects to parse as much as possible at once. With networking, you must be careful not to attempt to read more data than the server hands back, or else you'll block. I've had some success with hGetContents on a socket and feeding it into extremely carefully-crafted parsers, but that is error-prone and ugly. Here's the problem. With a protocol such as IMAP, there is no way to know until a server response is being parsed, how many lines (or bytes) of data to read. Ideally, I would be able to slrup in more data as I go, but that doesn't seem to be very practical in Parsec either. Suggestions?

John Goerzen writes:
With networking, you must be careful not to attempt to read more data than the server hands back, or else you'll block. [...] With a protocol such as IMAP, there is no way to know until a server response is being parsed, how many lines (or bytes) of data to read.
The approach I recommend is to run a scanner (tokenizer) before the actual parser. IMAP, like most other RFC protocols, is line-based; so you can use a very simple scanner to read a CRLF-terminated line efficiently (using non-blocking I/O, for example), which you can then feed into the parser just fine because you know that it has to contain a complete request (response) that you can handle. Peter

On 2005-09-15, Peter Simons
The approach I recommend is to run a scanner (tokenizer) before the actual parser.
IMAP, like most other RFC protocols, is line-based; so you can use a very simple scanner to read a CRLF-terminated line efficiently (using non-blocking I/O, for example), which you can then feed into the parser just fine because you know that it has to contain a complete request (response) that you can handle.
I thought of that, but that isn't really true for IMAP. IMAP responses can span many, many lines (for instance, it can return a list of all matching messages in a folder, or multiple bits of status results). Or they can use only one line. Not only that, but IMAP has a way where you can embed, say {305} instead of a string. That means, "after you finish reading this line, read exactly 305 bytes, and consider that to be used here." But if you see "{305}" (the double quotes indicating a string), this is just a string containing the text {305}. So, to make that approach work, I would really need to do a lot of work outside of Parsec -- the stuff that I really want to use Parsec for, I think. -- John

On 9/15/05, John Goerzen
Not only that, but IMAP has a way where you can embed, say {305} instead of a string. That means, "after you finish reading this line, read exactly 305 bytes, and consider that to be used here." But if you see "{305}" (the double quotes indicating a string), this is just a string containing the text {305}.
So, to make that approach work, I would really need to do a lot of work outside of Parsec -- the stuff that I really want to use Parsec for, I think.
Well, you do have a state monad to work with. Why not just stuff the number 305 into your state, keep reading until you've read 305 bytes (decrementing the count as you read), and return the 305-byte string as your result for this parser? When you resume, you should be ready to parse the next very token after the 305-byte string. -- Adam

On 2005-09-15, Adam Turoff
On 9/15/05, John Goerzen
wrote: So, to make that approach work, I would really need to do a lot of work outside of Parsec -- the stuff that I really want to use Parsec for, I think.
Well, you do have a state monad to work with. Why not just stuff the number 305 into your state, keep reading until you've read 305 bytes (decrementing the count as you read), and return the 305-byte string as your result for this parser? When you resume, you should be ready to parse the next very token after the 305-byte string.
It's unclear to me exactly how to mix the IO monad with Parsec. It doesn't really seem to be doable. Not to mention that if hGetContents is used, the Handle has to be put into non-buffering mode, which means one syscall per character read. Terribly slow.

John Goerzen wrote:
On 2005-09-15, Adam Turoff
wrote: On 9/15/05, John Goerzen
wrote: So, to make that approach work, I would really need to do a lot of work outside of Parsec -- the stuff that I really want to use Parsec for, I think.
Well, you do have a state monad to work with. Why not just stuff the number 305 into your state, keep reading until you've read 305 bytes (decrementing the count as you read), and return the 305-byte string as your result for this parser? When you resume, you should be ready to parse the next very token after the 305-byte string.
It's unclear to me exactly how to mix the IO monad with Parsec. It doesn't really seem to be doable.
Not to mention that if hGetContents is used, the Handle has to be put into non-buffering mode, which means one syscall per character read. Terribly slow.
Does it? I didn't think so ... Keean.

On Tue, Sep 20, 2005 at 02:29:12PM +0100, Keean Schupke wrote:
It's unclear to me exactly how to mix the IO monad with Parsec. It doesn't really seem to be doable.
Not to mention that if hGetContents is used, the Handle has to be put into non-buffering mode, which means one syscall per character read. Terribly slow.
Does it? I didn't think so ...
strace seems to say yes. If buffering is used, it blocks on attempts to read more than is available. -- John

John Goerzen wrote:
On Tue, Sep 20, 2005 at 02:29:12PM +0100, Keean Schupke wrote:
It's unclear to me exactly how to mix the IO monad with Parsec. It doesn't really seem to be doable.
Not to mention that if hGetContents is used, the Handle has to be put into non-buffering mode, which means one syscall per character read. Terribly slow.
Does it? I didn't think so ...
strace seems to say yes.
Thats odd, the source code seems to suggest that when you read past the end of the buffer it reads the next entire buffer (it has cases for each possible buffer configuration, line, block and none) - and I can think of no reason _why_ it cannot use buffering... I would think that it's a bug if it is the case. Regards, Keean.

On Tue, Sep 20, 2005 at 03:05:25PM +0100, Keean Schupke wrote:
strace seems to say yes.
Thats odd, the source code seems to suggest that when you read past the end of the buffer it reads the next entire buffer (it has cases for each possible buffer configuration, line, block and none) - and I can think of no reason _why_ it cannot use buffering... I would think that it's a bug if it is the case.
Because the next entire buffer might consume more data than the remote has sent. That results in deadlock. -- John

John Goerzen wrote:
On Tue, Sep 20, 2005 at 03:05:25PM +0100, Keean Schupke wrote:
strace seems to say yes.
Thats odd, the source code seems to suggest that when you read past the end of the buffer it reads the next entire buffer (it has cases for each possible buffer configuration, line, block and none) - and I can think of no reason _why_ it cannot use buffering... I would think that it's a bug if it is the case.
Because the next entire buffer might consume more data than the remote has sent. That results in deadlock.
Would it not be usual to have a timeout incase of dropped connection? Regards, Keean. (Btw, did you look at the Parser Monad-Transformer?)

On Tue, Sep 20, 2005 at 03:20:01PM +0100, Keean Schupke wrote:
Because the next entire buffer might consume more data than the remote has sent. That results in deadlock.
Would it not be usual to have a timeout incase of dropped connection?
Yes, but hardly useful if it happens after issuing every command ;-)
(Btw, did you look at the Parser Monad-Transformer?)
Not yet, but thanks for sending it along. -- John

Here's the code from hGetContents (base/GHC/IO.lhs): -- we never want to block during the read, so we call fillReadBuffer with -- is_line==True, which tells it to "just read what there is". lazyReadBuffered h handle_ fd ref buf = do catch (do buf <- fillReadBuffer fd True{-is_line-} (haIsStream handle_) buf lazyReadHaveBuffer h handle_ fd ref buf ) -- all I/O errors are discarded. Additionally, we close the handle. (\e -> do handle_ <- hClose_help handle_ return (handle_, "") ) So, it reads whatever is available, further description is available from the definition of fillReadBuffered: -- For a line buffer, we just get the first chunk of data to arrive, -- and don't wait for the whole buffer to be full (but we *do* wait -- until some data arrives). This isn't really line buffering, but it -- appears to be what GHC has done for a long time, and I suspect it -- is more useful than line buffering in most cases. So for a disc buffer I would expect 1 complete buffer to be returned most of the time, for a network read, I guess one packet (MTUs) worth should be expected... Regards, Keean. Keean Schupke wrote:
John Goerzen wrote:
On Tue, Sep 20, 2005 at 02:29:12PM +0100, Keean Schupke wrote:
It's unclear to me exactly how to mix the IO monad with Parsec. It doesn't really seem to be doable.
Not to mention that if hGetContents is used, the Handle has to be put into non-buffering mode, which means one syscall per character read. Terribly slow.
Does it? I didn't think so ...
strace seems to say yes.
Thats odd, the source code seems to suggest that when you read past the end of the buffer it reads the next entire buffer (it has cases for each possible buffer configuration, line, block and none) - and I can think of no reason _why_ it cannot use buffering... I would think that it's a bug if it is the case.
Regards, Keean.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Tue, Sep 20, 2005 at 03:17:11PM +0100, Keean Schupke wrote:
-- For a line buffer, we just get the first chunk of data to arrive, -- and don't wait for the whole buffer to be full (but we *do* wait -- until some data arrives). This isn't really line buffering, but it -- appears to be what GHC has done for a long time, and I suspect it -- is more useful than line buffering in most cases.
So for a disc buffer I would expect 1 complete buffer to be returned most of the time, for a network read, I guess one packet (MTUs) worth should be expected...
Hmm, and checking my code, it appears that I did use line buffering for my FTP client. However, I am *sure* that I did run into some deadlock issues relating to buffering at some point. Sigh. So this is good and should work nicely with protocols such as SMTP and FTP. The other thing is that the Handle is not the most convenient way to work with a socket. As I mentioned, two Handles for a single socket must be opened, which is inconvenient and annoying, not to mention leads to some confusing semantics. You also have to be *very* careful to never consume more than you need. On the flip side, Parsec is really nice. I wonder how easy it would be to make it parse [Word8] instead of String? Or even a FastPackedString? (And how easy it would be to get that instead of a String from hGetContents)? -- John

On Tuesday 20 September 2005 16:50, John Goerzen wrote:
On the flip side, Parsec is really nice. I wonder how easy it would be to make it parse [Word8] instead of String?
Isn't Parsec parameterized over the token type?
Or even a FastPackedString? (And how easy it would be to get that instead of a String from hGetContents)?
From the FPS haddock: hGetContents :: Handle -> IO FastString Read entire handle contents into a FastString. This may or may not do what you want...it's probably not a lazy read. Ben

On Sep 20, 2005, at 6:32 PM, Benjamin Franksen wrote:
On Tuesday 20 September 2005 16:50, John Goerzen wrote:
On the flip side, Parsec is really nice. I wonder how easy it would be to make it parse [Word8] instead of String?
Isn't Parsec parameterized over the token type?
Or even a FastPackedString? (And how easy it would be to get that instead of a String from hGetContents)?
From the FPS haddock:
hGetContents :: Handle -> IO FastString
Read entire handle contents into a FastString.
This may or may not do what you want...it's probably not a lazy read.
It seems like there might need to be something like: hGetContentsLazily :: Handle -> IO [FastString] which returns file contents in chunks based on our ability to buffer the handle. If we can mmap the handle, we may get a singleton list with a giant FastString; if we are using a Socket or a terminal, each succeeding string might be the next chunk of available data from the handle. I had the impression the internals of getContents from the prelude worked a bit like this (in GHC, anyway). -Jan-Willem Maessen
Ben _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Wednesday 21 September 2005 19:36, Jan-Willem Maessen wrote:
On Sep 20, 2005, at 6:32 PM, Benjamin Franksen wrote:
On Tuesday 20 September 2005 16:50, John Goerzen wrote:
On the flip side, Parsec is really nice. I wonder how easy it would be to make it parse [Word8] instead of String?
Isn't Parsec parameterized over the token type?
Or even a FastPackedString? (And how easy it would be to get that instead of a String from hGetContents)?
From the FPS haddock:
hGetContents :: Handle -> IO FastString
Read entire handle contents into a FastString.
This may or may not do what you want...it's probably not a lazy read.
It seems like there might need to be something like:
hGetContentsLazily :: Handle -> IO [FastString]
which returns file contents in chunks based on our ability to buffer the handle. If we can mmap the handle, we may get a singleton list with a giant FastString; if we are using a Socket or a terminal, each succeeding string might be the next chunk of available data from the handle.
From the FPS haddock: data LazyFile Constructors LazyString String MMappedFastString FastString LazyFastStrings [FastString] Instances Eq LazyFile readFileLazily :: FilePath -> IO LazyFile That comes pretty near. Unfortunately it works on a file name, not a handle. Thus it cannot be used for a socket or such things. Ben

On Wed, Sep 21, 2005 at 12:32:56AM +0200, Benjamin Franksen wrote:
On Tuesday 20 September 2005 16:50, John Goerzen wrote:
On the flip side, Parsec is really nice. I wonder how easy it would be to make it parse [Word8] instead of String?
Isn't Parsec parameterized over the token type?
Or even a FastPackedString? (And how easy it would be to get that instead of a String from hGetContents)?
From the FPS haddock:
hGetContents :: Handle -> IO FastString
Read entire handle contents into a FastString.
This may or may not do what you want...it's probably not a lazy read.
If it can be implemented via 'mmap' then it is effectivly a lazy read. and _very_ efficient to boot. John -- John Meacham - ⑆repetae.net⑆john⑈

On Wednesday 21 September 2005 20:17, John Meacham wrote:
On Wed, Sep 21, 2005 at 12:32:56AM +0200, Benjamin Franksen wrote:
On Tuesday 20 September 2005 16:50, John Goerzen wrote:
On the flip side, Parsec is really nice. I wonder how easy it would be to make it parse [Word8] instead of String?
Isn't Parsec parameterized over the token type?
Or even a FastPackedString? (And how easy it would be to get that instead of a String from hGetContents)?
From the FPS haddock:
hGetContents :: Handle -> IO FastString
Read entire handle contents into a FastString.
This may or may not do what you want...it's probably not a lazy read.
If it can be implemented via 'mmap' then it is effectivly a lazy read. and _very_ efficient to boot.
True, I forgot mmap. Taking a look at the implementation reveals that it indeed uses mmap (if available on the platform). Ben

On Thu, Sep 15, 2005 at 11:09:25AM -0500, John Goerzen wrote:
The recent thread on binary parsing got me to thinking about more general network protocol parsing with parsec. A lot of network protocols these days are text-oriented, so seem a good fit for parsec.
However, the difficulty I come up time and again is: parsec normally expects to parse as much as possible at once.
With networking, you must be careful not to attempt to read more data than the server hands back, or else you'll block.
I've had some success with hGetContents on a socket and feeding it into extremely carefully-crafted parsers, but that is error-prone and ugly.
I don't see why this would be more error-prone than any other approach. As for ugly, it might be somewhat more pleasant if Parsec could take input from a monadic action, but hGetContents works, and if you want more control (eg, reading from a socket fd directly), you can use unsafeInterleaveIO yourself. I wrote a parser for s-expressions that must not read beyond the final ')', and while I agree it is tricky, it's all necessary trickiness. Note I use lexeme parsers as in the Parsec documentation, and use an "L" suffix in their names. -- do not eat trailing whitespace, because we want to process a request from -- a lazy stream (eg socket) as soon as we see the closing paren. sexpr :: Parser a -> Parser (Sexpr a) sexpr p = liftM Atom p <|> cons p cons :: Parser a -> Parser (Sexpr a) cons p = parens tailL where tailL = do dotL sexprL p <|> liftM2 Cons (sexprL p) tailL <|> return Nil sexprL :: Parser a -> Parser (Sexpr a) sexprL p = lexeme (sexpr p) consL :: Parser a -> Parser (Sexpr a) consL p = lexeme (cons p) top p = between whiteSpace eof p lexeme p = do r <- p whiteSpace return r whiteSpace = many space dotL = lexeme (string ".") -- NB: eats whitespace after opening paren, but not closing parens p = between (lexeme (string "(")) (string ")") p Andrew

On Thu, Sep 15, 2005 at 06:11:58PM -0700, Andrew Pimlott wrote:
I don't see why this would be more error-prone than any other approach.
Hmm... I take that back. I don't know anything about the IMAP protocol, but after imagining for a few moments what it might be like, I can see how it could be more difficult than my example. The user state of the parser might help you... Andrew

On 2005-09-16, Andrew Pimlott
On Thu, Sep 15, 2005 at 06:11:58PM -0700, Andrew Pimlott wrote:
I don't see why this would be more error-prone than any other approach.
Hmm... I take that back. I don't know anything about the IMAP protocol, but after imagining for a few moments what it might be like, I can see how it could be more difficult than my example.
The user state of the parser might help you...
Hmm, can you elaborate on that? Basically, I *really* want to get away frmo having to use hGetContents. It is just not at all friendly for an interactive netwrk protocol. If I were just streaming a large file from an FTP server, it would be fine, but even using it to begin with involves using Handles in a nonstandard way (since there must be a separate Handle for writing, since hGetContents sents the Handle to be half-closed) that is apparently not well-supported. -- John

You may like my parser transformer then (based on the efficent backtracking parser paper, I believe by Ralf Heinze - uses endofunctor and continuation passing - Its a long time since I tested it but I think it holds its own against Parsec, without requiring the extra return types). -- parser.hs: Copyright (C)2001,2002 Keean Schupke. -- -- Polymorphic monadic consumer based parser. module Lib.Monad.ParserT(ParserT(..)) where import Control.Monad hiding (guard) import Control.Monad.Error import Lib.Monad.MonadT import Lib.Monad.MonadState import Lib.Monad.MonadParser import Lib.Monad.MonadControl import Lib.Arrow.Runnable ------------------------------------------------------------------------------ -- An continuation passing endomorphic parser type Cps a r = (a -> r) -> r type Endo r = r -> r newtype ParserT r tok m a = PT (Cps a ([tok] -> Endo (m r))) instance Monad m => Functor (ParserT r tok m) where fmap g (PT m) = PT $ \k -> m (\a s f -> k (g a) s f) instance Monad m => Monad (ParserT r tok m) where {-# INLINE return #-} return a = PT $ \k -> k a {-# INLINE (>>=) #-} (PT m) >>= f = PT $ \k -> m (\a -> (\(PT x) -> x) (f a) k) instance Monad m => MonadPlus (ParserT r tok m) where {-# INLINE mzero #-} mzero = PT $ \_ _ f -> f {-# INLINE mplus #-} mplus (PT m) (PT n) = PT $ \k s -> m k s . n k s instance MonadPlus m => MonadT (ParserT r tok) m where {-# INLINE up #-} up m = PT $ \k s f -> (m >>= \a -> k a s mzero) `mplus` f {-# INLINE down #-} down = undefined instance (MonadPlus m,MonadT (ParserT r tok) m,Runnable ([tok] -> m ([tok],r)) ([tok] -> n ([tok],r))) => Runnable (ParserT ([tok],r) tok m r) ([tok] -> n ([tok],r)) where run = run . (\(PT m) t -> m (\a t' f -> return (t',a) `mplus` f) t mzero) instance (MonadPlus m,MonadT (ParserT r tok) m) => Runnable (ParserT ([tok],r) tok m r) ([tok] -> m ([tok],r)) where run = (\(PT m) t -> m (\a t' f -> return (t',a) `mplus` f) t mzero) instance Monad m => MonadState [tok] (ParserT r tok m) where {-# INLINE update #-} update st = PT $ \k s -> k s ((st s) `asTypeOf` s) setState st = PT $ \k _ -> k () st getState = PT $ \k s -> k s s instance Monad m => MonadParser tok (ParserT r tok m) where {-# INLINE item #-} item = PT $ \k s -> case s of [] -> id (a:x) -> k a x instance (MonadPlus (t m),MonadParser tok m,MonadT t m) => MonadParser tok (t m) where item = up item instance Monad m => MonadControl (ParserT r tok m) where {-# INLINE once #-} once (PT m) = PT $ \k s f -> m (\a s' _ -> k a s' f) s f Regards, Keean. John Goerzen wrote:
On 2005-09-16, Andrew Pimlott
wrote: On Thu, Sep 15, 2005 at 06:11:58PM -0700, Andrew Pimlott wrote:
I don't see why this would be more error-prone than any other approach.
Hmm... I take that back. I don't know anything about the IMAP protocol, but after imagining for a few moments what it might be like, I can see how it could be more difficult than my example.
The user state of the parser might help you...
Hmm, can you elaborate on that?
Basically, I *really* want to get away frmo having to use hGetContents. It is just not at all friendly for an interactive netwrk protocol. If I were just streaming a large file from an FTP server, it would be fine, but even using it to begin with involves using Handles in a nonstandard way (since there must be a separate Handle for writing, since hGetContents sents the Handle to be half-closed) that is apparently not well-supported.
-- John
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Here's some useful definitions to go with that... module Lib.Parser.Parser(Parser,when,unless,guard,(|>),opt,many,many1,sepBy, parse,alpha,digit,lower,upper,other,lexical,satisfy,optional,literal,untilP,untilParser,matchP) where ... (see attachment for files) Regards, Keean.

On Tue, Sep 20, 2005 at 12:54:26PM +0000, John Goerzen wrote:
On 2005-09-16, Andrew Pimlott
wrote: On Thu, Sep 15, 2005 at 06:11:58PM -0700, Andrew Pimlott wrote:
I don't see why this would be more error-prone than any other approach.
Hmm... I take that back. I don't know anything about the IMAP protocol, but after imagining for a few moments what it might be like, I can see how it could be more difficult than my example.
The user state of the parser might help you...
Hmm, can you elaborate on that?
I was imagining that the IMAP protocol does not have delimiting characters (like s-expressions), and that instead you are told to read a certain number of bytes, tokens, lines, etc. So you could maintain the number in the state and check it before any parser that might read too far. Don't know if this has anything to do with your actual problem.
Basically, I *really* want to get away frmo having to use hGetContents.
I found writing my own lazy read pretty easy, after having some problems using handles with sockets: recv' s = do r <- IO.try (recv s (1024 * 8)) case r of Left err -> if isEOFError err then return "" else ioError err Right msg -> do msg' <- unsafeInterleaveIO (recv' s) return (msg ++ msg') So even if you don't want to use handles, you can still use parsec without too much trouble. Andrew

On 2005 September 15 Thursday 12:09, John Goerzen wrote:
However, the difficulty I come up time and again is: parsec normally expects to parse as much as possible at once.
With networking, you must be careful not to attempt to read more data than the server hands back, or else you'll block.
I've had some success with hGetContents on a socket and feeding it into extremely carefully-crafted parsers, but that is error-prone and ugly.
Here's the problem. With a protocol such as IMAP, there is no way to know until a server response is being parsed, how many lines (or bytes) of data to read. Ideally, I would be able to slrup in more data as I go, but that doesn't seem to be very practical in Parsec either.
Assuming I've understood the gist of Koen Claessen's "Parallel Parsing Processes", its implementation of the Parsec interface returns all possible parses, in the order of how much input they consume. Also, no more input is consumed than necessary. For the purpose of parsing network input, that's superior to the usual order in which parse alternatives are considered. The Parsec interface supports lookahead, which implies examining beyond what is consumed. That could be error-prone, but I expect lookahead is considerably easier to manage than Parsec's order of considering alternatives.
participants (9)
-
Adam Turoff
-
Andrew Pimlott
-
Benjamin Franksen
-
Jan-Willem Maessen
-
John Goerzen
-
John Meacham
-
Keean Schupke
-
Peter Simons
-
Scott Turner