Brainstorming on how to parse IMAP

Hi folks, I'm interested in writing a library to work with IMAP servers. I'm interested in thoughts people have on parsing libraries and methods. I'm a huge fan of Parsec overall -- it lets me have a single-stage parser, for instance. But it isn't sufficiently lazy for this task, and I probably will need to deal with ByteStrings instead of Strings, since some IMAP messages may be 30MB or more. So to give a very, very brief rundown of RFC3501, there are lots of ways that an IMAP server can encode things. For instance, we could see this: A283 SEARCH "TEXT" "string not in mailbox" which is the same as: A283 SEARCH TEXT "string not in mailbox" and the same as: A283 SEARCH {4} "string not in mailbox" TEXT The braces mean that the given number of octets follows after the CRLF at the end of the given line. We could even see: A283 SEARCH {4} {21} TEXTstring not in mailbox Note that when downloading messages, I would fully expect to see things like * FETCH {10485760} representing a 10MB message. Also, quoted strings have escaping rules. [ please note that the above is paraphrased and isn't really true RFC3501 for simplicity sake ] Now then... some goals. 1) Ideally I could parse stuff lazily. I have tried this with FTP and it is more complex than it seems at first, due to making sure you never, never, never consume too much data. But being able to parse lazily would make it so incredibly easy to issue a command saying "download all new mail", and things get written to disk as they come in, with no buffer at all. 2) Avoiding Strings wherever possible. 3) Avoiding complex buffering schemes where I have to manually buffer data packets. Thoughts and ideas? BTW, if any of you have heard of OfflineIMAP, yes I am considering rewriting OfflineIMAP in Haskell. -- John

On Sat, 02 Aug 2008 21:04:28 -0500
John Goerzen
The braces mean that the given number of octets follows after the CRLF at the end of the given line. We could even see:
A283 SEARCH {4} {21} TEXTstring not in mailbox
I don't think it's quite that bad. The literal count must immediately precede the value -- {4}\r\nTEXT -- the way I read it. I think most servers use this mechanism for mail message data only, but seems to me there's one out there that may occasionally slip a literal into LIST results.
1) Ideally I could parse stuff lazily. I have tried this with FTP and it is more complex than it seems at first, due to making sure you never, never, never consume too much data. But being able to parse lazily would make it so incredibly easy to issue a command saying "download all new mail", and things get written to disk as they come in, with no buffer at all.
I'm not sure what that means, but to start at the beginning, ideally the IMAP parser would be pure, right?
2) Avoiding Strings wherever possible.
It certainly makes sense to me that message data would be bytestring, not only the "body" of the message but header fields as well. Flags, for me, would be string, unless you want to parse the standard flags into a sum type. No big deal, maybe bytestrings are more convenient than I realize.
3) Avoiding complex buffering schemes where I have to manually buffer data packets.
This sounds to me like the application's problem, not the parser's?
I actually wrote the beginnings of an IMAP parser, for my own entertainment.
Substantially incomplete, in terms of support for various IMAP responses,
but it works against a couple of IMAP servers, and it supports GSSAPI
authentication and SSL.
Well, of course the IMAP parsing code itself has no idea about GSSAPI
or SSL, that being the application's job, but I think it's worth looking
at how for example GSSAPI authentication works with IMAP, while designing
the parser.
In the most general view, it may return either (response, remainder) or
(insufficient-data), and in the latter case the application gets more
data from the server and tries again.
--
Donn Cave

John Goerzen wrote:
I'm interested in writing a library to work with IMAP servers.
I'm interested in thoughts people have on parsing libraries and methods. I'm a huge fan of Parsec overall -- it lets me have a single-stage parser, for instance. But it isn't sufficiently lazy for this task, and I probably will need to deal with ByteStrings instead of Strings, since some IMAP messages may be 30MB or more.
You might be interested in Frisby: http://repetae.net/computer/frisby/ Cheers Ben

Ben Franksen wrote:
John Goerzen wrote:
I'm interested in writing a library to work with IMAP servers.
I'm interested in thoughts people have on parsing libraries and methods. I'm a huge fan of Parsec overall -- it lets me have a single-stage parser, for instance. But it isn't sufficiently lazy for this task, and I probably will need to deal with ByteStrings instead of Strings, since some IMAP messages may be 30MB or more.
You might be interested in Frisby: http://repetae.net/computer/frisby/
Are there any examples available for it anywhere?
Cheers Ben
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

John Goerzen wrote:
Ben Franksen wrote:
John Goerzen wrote:
I'm interested in writing a library to work with IMAP servers.
I'm interested in thoughts people have on parsing libraries and methods. I'm a huge fan of Parsec overall -- it lets me have a single-stage parser, for instance. But it isn't sufficiently lazy for this task, and I probably will need to deal with ByteStrings instead of Strings, since some IMAP messages may be 30MB or more.
You might be interested in Frisby: http://repetae.net/computer/frisby/
Are there any examples available for it anywhere?
I am pretty sure John Meacham has some examples but I personally don't know any. I have once used it for something but in the end went back to parsec because it was slightly faster and I had no need for the lazyness. Cheers Ben

+++ John Goerzen [Aug 05 08 10:13 ]:
I'm interested in thoughts people have on parsing libraries and methods. I'm a huge fan of Parsec overall -- it lets me have a single-stage parser, for instance. But it isn't sufficiently lazy for this task, and I probably will need to deal with ByteStrings instead of Strings, since some IMAP messages may be 30MB or more.
You might be interested in Frisby: http://repetae.net/computer/frisby/
Are there any examples available for it anywhere?
I wrote a markdown parser using Frisby: http://github.com/jgm/markdown-peg/tree/master John

1) Ideally I could parse stuff lazily. I have tried this with FTP and it is more complex than it seems at first, due to making sure you never, never, never consume too much data.
PolyParse has lazy variants of its combinators, which would probably be of use here. Software: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/polyparse Paper: http://www-users.cs.york.ac.uk/~malcolm/partialparse.html
2) Avoiding Strings wherever possible.
PolyParse is completely independent of token type. There is a rudimentary module implementing the combinators on top of ByteString. (Not currently exposed in the package interface, BTW.) I'm sure it would not be difficult to build on top of LazyByteString also. Patches gratefully accepted.
3) Avoiding complex buffering schemes where I have to manually buffer data packets.
Provided the parser is lazy enough, the buffering would be done at the level of Lazy ByteString (but I don't know whether LBS is suitable for network-reading or not).
Thoughts and ideas?
A283 SEARCH {4} {21} TEXTstring not in mailbox
Assuming the first line can be read strictly, and the remainder should be lazy, the parser might look something like this: searchResult = do n <- qnumber word "SEARCH" extents <- many1 (braces number) return SearchResult n `apply` (mapAF (\m-> exactly m char) extents) where -- embed map in an applicative functor mapAF :: (a -> Parser b) -> [a] -> Parser [b] mapAF p [] = return [] mapAF p (x:xs) = return (:) `apply` p x `apply` mapAF p xs Regards, Malcolm

On 2008 Aug 4, at 6:10, Malcolm Wallace wrote:
A283 SEARCH {4} {21} TEXTstring not in mailbox
Assuming the first line can be read strictly, and the remainder should be lazy, the parser might look something like this:
FWIW, the actual data looks more like A283 SEARCH {4} TEXT {21} string not in mailbox (the literal declaration is followed immediately by CRLF, the literal, CRLF, then the command line context resumes) also, literals require multiple exchanges with the server unless the LITERAL+ extension is present and being used ({4+} instead of {4}). -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

Hi folks, Thanks to those that responded on this -- certainly some libraries to check out here. One problem with that is that if I use specific parsing library foo, then only others that are familiar with specific parsing library foo can hack on it. In general though, I think this speaks to more generic problems: 1) A lot of network protocols require reading data of arbitrary length until a certain delimiter is found. Often that delimiter is \n. Haskell is really weak at this. We can turn a Socket into a Handle and use hGetLine, but this has a security weakness: it has no upper bounds on the amount of data read, and this is vulnerable to resource exhaustion DOS from the remote end. There is, as far as I can tell, no general-purpose "buffer until I see foo" framework in Haskell. (Note that just reading character-by-character is too slow as well). 1a) Even more generally, a "read one packet of data, however much becomes available" notion is pretty weak. For a lazy ByteString, my only two choices are to block until n bytes are available (where n is specified in advance), or to not block at all. There is no "block until some data, however much, becomes available, and return that chunk up to a maxmimum size x." Well, Network.Socket.recv may do this, but it returns a String. Is there even a way to do this with a ByteString? 2) A lot of RFC protocols -- and IMAP in particular -- can involve complex responses from the server with hierarchical data, and the parse of, say, line 1 and of each successive line can indicate whether or not to read more data from the server. Parsing of these lines is a stateful activity. 3) The linkage between Parsec and IO is weak. I cannot write an "IMAPResponse" parser. I would have a write a set of parsers to parse individual components of the IMAP response as part of the IO monad code that reads the IMAP response, since the result of one dictates how much network data I attempt to read. -- John

Hi John, I recently posted new and fancy binary Get monads in http://article.gmane.org/gmane.comp.lang.haskell.libraries/9691 and http://article.gmane.org/gmane.comp.lang.haskell.libraries/9756 which might be of interest since network protocol are usually specified in bytes at the wire level. The latest one takes input which may or may not be complete and returns a stream (a Seq) of results. When it reaches the end of the input it pauses and asks for more. This handling of partial input might be good for network protocols where you can feed the data from the socket to the parser in pieces. (This Get monad eats lazy bytestrings). The latest MyGetW.hs allows the Get code to send a Data.Sequence.Seq of results by using yieldItem (and perhaps flushItems). This is in addition to any final result of the parser. It has the usual binary Get interface, so you can pull bytestrings and words and (fancy) any Storable off the input. I call it "fancy" because the monad is a transformer, and it is a MonadError/MonadPlus/Alternative, and it supports lookAhead and callCC/MonadCont and Reader/Writer/State. Whew. As for IMAP, I use imapfilter (http://imapfilter.hellug.gr/) which uses Lua. Cheers, Chris

ChrisK wrote:
I recently posted new and fancy binary Get monads in http://article.gmane.org/gmane.comp.lang.haskell.libraries/9691 and http://article.gmane.org/gmane.comp.lang.haskell.libraries/9756 which might be of interest since network protocol are usually specified in bytes at the wire level.
The latest one takes input which may or may not be complete and returns a stream (a Seq) of results.
IIRC Seq is not a 'Stream' but a strict sequence? Or do you meant 'a stream (of Seq)'? Cheers Ben

I am glad you asked Ben, Short answer: It can return a Seq of your values. The values in the Seq are lazy, the Seq itself is finite. It can return what it has so far before it finishes parsing (or even before the rest of the input has arrived from the network). Ben Franksen wrote:
ChrisK wrote:
I recently posted new and fancy binary Get monads in http://article.gmane.org/gmane.comp.lang.haskell.libraries/9691 and http://article.gmane.org/gmane.comp.lang.haskell.libraries/9756 which might be of interest since network protocol are usually specified in bytes at the wire level.
The latest one takes input which may or may not be complete and returns a stream (a Seq) of results.
IIRC Seq is not a 'Stream' but a strict sequence? Or do you meant 'a stream (of Seq)'?
I meant it returns many (Seq y), one after the other, while doing parsing in between.
Cheers Ben
Long answer: The complicated parser looks like this. Start with the run function: runCompGet :: (Monad m,Monoid w) => CompGet a y r w user m a -> r -> user -> L.ByteString -> m (CompResult y w user m a) This takes a CompGet and a reader state r and a user state user and the (initial) input L.ByteString (Data.ByteString.Lazy.ByteString). It evaluates to the inner monad 'm' returning a CompResult. The CompResult is a three-fold type:
data CompResult y w user m a = CFailed (Seq y) !Int64 String | CFinished (Seq y) !L.ByteString !Int64 w user a | CPartial (Seq y) (Either ( m (CompResult y w user m a) ) ( Maybe L.ByteString -> m (CompResult y w user m a) ))
All three have (Seq y) which are the Data.Sequence.Seq of things which have been queued by "yieldItem". CFailed also has the Int64 count of bytes parsed successfully and an error message String. Nothing more can be done. CFinished also has the unused tail of the input as a L.ByteString and an Int64 of the bytes consumed. And the output of the writer w, the final user state, and lastly it has the end value returned by the computation which has type 'a'. Nothing more can be done. CPartial is the intermediate result. It also carries Either: Left : the rest of the computation, currently suspended, to continue running. Right: a function from (Maybe ByteString) to the suspended computation. The Left is a result of the "flushItems" command and is merely a way to return the (Seq y) so far before continuing. The Right is a result of running out of input data. This allows the program to feed more input into the parser which will be appended to all the previous input. One does this by passing (Just someByteString) to the function. If the parser again runs out of data it will again return CPartial with a Right value. Alternatively, one can pass Nothing. This tells the parser that there will never ever be more input. The parser will never ask for (though it may flushItems and return a Right valued CPartial). A key thing about the (Seq y) is that yielded items are only returned once. The CPartial may be returned many times and each time it will have an empty list or fresh list of (Seq y). The values in the Seq are lazy, the Seq itself is finite. To collect all the value the caller has to concatenate all the (Seq y)'s that are returned during parsing. As for parsing, the module offers the usual BinaryParser interface (package binary-strict) and has an interface which mostly overlaps Data.Binary.Get (package binary). For example: it has "getByteString" and "getWord64be" and "getStorable". You don't have to use the "yieldItem" command. You can just return the results in the final "a" return value (or the "user" state or the writer "w" value). In this situation you only get an answer when CFinished is returned (and nothing if CFailed is returned). I could not use the writer mechanism for yield-ing because the "listen" and "pass" parts of the MonadWriter class ensure it has the wrong semantics. You might wonder: *) If the parser code uses MonadPlus to give several alternatives *) The first alternative gets more input via CPartial (perhaps several times) *) The first alternative then fails *) The second alternative starts parsing from the same position the first did *) Does the second alternative see the new input passed to CPartial earlier? The answer is yes. Changes to the input stream by appending with CPartial affect the whole computation and are never rolled back. You might wonder: *) If the first alternative calls "yieldItem foo" *) The first alternative fails *) The second alternative calls flushItems *) Does the CPartial (Seq y) contain "foo"? The answer is yes. Items yielded are never rolled back. [ Doing all of this in the presence of throwError/mzero/fail and lookAhead* and callCC was interesting to code. ] Cheers, Chris
participants (7)
-
Ben Franksen
-
Brandon S. Allbery KF8NH
-
ChrisK
-
Donn Cave
-
John Goerzen
-
John MacFarlane
-
Malcolm Wallace