Defining custom parser using Parsec

Hi everyone, I'm working on a digital forensics application that will take a file with lines of the following format: "MD5|name|inode|mode_as_string|UID|GID|size|atime|mtime|ctime|crtime" This string represents the metadata associated with a particular file in the filesystem. I created a data type to represent the information that I will need to perform my analysis: data Event = Event { fn :: B.ByteString, mftNum :: B.ByteString, ft :: B.ByteString, fs :: Integer, time :: Integer, at :: AccessType mt :: AccessType ct :: AccessType crt :: AccessType } deriving (Show) data AccessType = ATime | MTime | CTime | CrTime deriving (Show) I would like to create a function that takes the Bytestring representing the file and returns a list of Events: createEvents :: ByteString -> [Event] (For now I'm creating a list, but depending on the type of analysis I decide to do, I may change this data structure) I understand that I can use the Parsec Library to do this. I read RWH, and noticed they have the endBy and sepBy combinators, but my issue with these is that using these funcitons performs too many transformations on the data. endBy will return a list of strings, which then will be used by sepBy which will then return a [[ByteString]] which I will then have to iterate through to create the final [Event]. What I would like to do is define a custom parser, that will go from the ByteString to the [Event] without the overhead of those intermediate steps. This function needs to be as fast as possible, as these files can be rather large, and I will be performing many different tests and analysis on the data. I don't want the parsing to be a bottleneck. I'm under the impression that the Parsec library will allow me to define a custom parser to do this, but I'm having problems understanding the library, and the documentation for it. A gentle shove in the right direction would be greatly appreciated. Thanks for your help, Jimmy

On Sun, Oct 17, 2010 at 22:59, Jimmy Wylie
Hi everyone,
I'm working on a digital forensics application that will take a file with lines of the following format:
"MD5|name|inode|mode_as_string|UID|GID|size|atime|mtime|ctime|crtime"
This string represents the metadata associated with a particular file in the filesystem.
I created a data type to represent the information that I will need to perform my analysis:
data Event = Event { fn :: B.ByteString, mftNum :: B.ByteString, ft :: B.ByteString, fs :: Integer, time :: Integer, at :: AccessType mt :: AccessType ct :: AccessType crt :: AccessType } deriving (Show)
data AccessType = ATime | MTime | CTime | CrTime deriving (Show)
I would like to create a function that takes the Bytestring representing the file and returns a list of Events: createEvents :: ByteString -> [Event] (For now I'm creating a list, but depending on the type of analysis I decide to do, I may change this data structure)
I understand that I can use the Parsec Library to do this. I read RWH, and noticed they have the endBy and sepBy combinators, but my issue with these is that using these funcitons performs too many transformations on the data. endBy will return a list of strings, which then will be used by sepBy which will then return a [[ByteString]] which I will then have to iterate through to create the final [Event].
What I would like to do is define a custom parser, that will go from the ByteString to the [Event] without the overhead of those intermediate steps. This function needs to be as fast as possible, as these files can be rather large, and I will be performing many different tests and analysis on the data. I don't want the parsing to be a bottleneck.
This sounds awfully lot like a premature optimisation, which as we all know, is the root of evil :-) Why do you think that using Parsec will result in fewer transformations? (It will most likely result in fewer transformations *that you see*, but that doesn't mean much.)
I'm under the impression that the Parsec library will allow me to define a custom parser to do this, but I'm having problems understanding the library, and the documentation for it.
A gentle shove in the right direction would be greatly appreciated.
AFAIK Parsec deals with String, not ByteString, have a look at the attoparsec library[1] instead. There are numerous explanations of using parser combinators out there. Personally I've found the Parsec documentation fairly easy to understand. A while ago I wrote a few posts myself on it, and I think they should translate well to attoparsec (you will probably have to keep the haddock doc at hand though): http://therning.org/magnus/archives/289 http://therning.org/magnus/archives/290 http://therning.org/magnus/archives/295 http://therning.org/magnus/archives/296 /M [1]: http://hackage.haskell.org/package/attoparsec-0.8.1.1 -- Magnus Therning (OpenPGP: 0xAB4DFBA4) magnus@therning.org Jabber: magnus@therning.org http://therning.org/magnus identi.ca|twitter: magthe

On 18 October 2010 07:56, Magnus Therning
AFAIK Parsec deals with String, not ByteString, have a look at the attoparsec library[1] instead.
From the original post, the syntax of what is to be parsed looks simple (essentially a single data type, no alternatives except the AccessType tag), so it can probably easily get by without combinators
Parsec 3.0 deals with ByteStrings, Parsec 3.1 and higher improve the performance quite about. like sepBy and endBy. It would need a sample of the input data or a grammar for it to affirm this, though. A working (simple) parser in Parsec should be easy to convert to Attoparsec etc if it doesn't use combinators like sepBy. parseEvents :: Parser [Event] parseEvents = many event event :: Parser Event event = do fn_ <- parseFn mft_num <- parseMftNum ft_ <- parseFt fs_ <- integer time_ <- integer at_ <- parseAccessType mt_ <- parseAccessType ct_ <- parseAccessType crt_ <- parseAccessType return $ Event { fn = fn_ , mftNum = mft_num , ft = ft_ , fs = fs_ , time = time_ , at = at_ , mt = mt_ , ct = ct_ , crt = crt_ } More parsers will need to be defined for parseMftNum etc. The applicative style as used in RWH can get rid of the interim variables which makes it much nicer for parsing than the monadic style. Parsec 3.0 and later can use the Applicative style. How the input data separates fields will need taking care of.

This sounds awfully lot like a premature optimisation, which as we all know, is the root of evil :-)
Why do you think that using Parsec will result in fewer transformations? (It will most likely result in fewer transformations *that you see*, but that doesn't mean much.)
I think you're right. I misunderstood the way the parsec library worked, and was trying to run before I could walk. However, it is a priority that I make this tool as fast as possible (as close to drive speed as I can). I wanted to take an "incremental" approach to optimization: writing small pieces, optimizing them, and then putting them together. I am also facing faculty skeptical that I can make things "fast" in haskell. (Currently, most DF applications are written in Python and C).
http://therning.org/magnus/archives/289 http://therning.org/magnus/archives/290 http://therning.org/magnus/archives/295 http://therning.org/magnus/archives/296
/M
Thanks for the references. They're great blog posts, very easy to follow. I didn't realize how simple Parsec was to use, at least in the Monadic form. For some reason, I thought it was more complex. I do have one question though. Do you always have to define your own Applicative instance for GenParser when trying to use the Applicative form? I noticed that both you and also RWH defined your own when explaining this form of Parsec. Is there no standard Parsec.Applicative in the library. Thanks again for your help, Jimmy

On Mon, Oct 18, 2010 at 04:58:42PM -0500, Jimmy Wylie wrote:
I do have one question though. Do you always have to define your own Applicative instance for GenParser when trying to use the Applicative form? I noticed that both you and also RWH defined your own when explaining this form of Parsec. Is there no standard Parsec.Applicative in the library.
I believe Parsec exports an Applicative instance in version 3.0 and later, but previous versions of Parsec did not include it. After all, the first few versions of Parsec were written before Applicative was invented. =) -Brent

Brent Yorgey wrote:
On Mon, Oct 18, 2010 at 04:58:42PM -0500, Jimmy Wylie wrote:
I do have one question though. Do you always have to define your own Applicative instance for GenParser when trying to use the Applicative form? I noticed that both you and also RWH defined your own when explaining this form of Parsec. Is there no standard Parsec.Applicative in the library.
I believe Parsec exports an Applicative instance in version 3.0 and later, but previous versions of Parsec did not include it. After all, the first few versions of Parsec were written before Applicative was invented. =)
That's why have switched to Parsimony[1]. It's a simplified version of Parsec with a proper applicative instance. [1]: http://hackage.haskell.org/package/parsimony Regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com

On Mon, Oct 18, 2010 at 4:58 PM, Jimmy Wylie
This sounds awfully lot like a premature optimisation, which as we all know, is the root of evil :-)
Why do you think that using Parsec will result in fewer transformations? (It will most likely result in fewer transformations *that you see*, but that doesn't mean much.)
I think you're right. I misunderstood the way the parsec library worked, and was trying to run before I could walk. However, it is a priority that I make this tool as fast as possible (as close to drive speed as I can). I wanted to take an "incremental" approach to optimization: writing small pieces, optimizing them, and then putting them together. I am also facing faculty skeptical that I can make things "fast" in haskell. (Currently, most DF applications are written in Python and C).
If parsec turn out to not be fast enough for you, the attoparsec[1] library has a very similar interface to parsec but is expressly written for parsing data from bytestrings. I don't know what its backtracking strategy is, however. The cereal[2] library also supports the Alternative operations in the Control.Applicative, and is written for decoding low-level binary streams, so it presumably also support some form of backtracking. It doesn't ship with a "manyTill' function, which is what I would want to use for your data, but it doesn't look hard to write. I have no idea how well it would perform, though. Let us know how parsec works for you. Antoine [1] http://hackage.haskell.org/package/attoparsec [2] http://hackage.haskell.org/packages/archive/cereal/0.3.0.0/doc/html/Data-Ser...

On 19 October 2010 03:28, Antoine Latter
The cereal[2] library also supports the Alternative operations in the Control.Applicative, and is written for decoding low-level binary streams, so it presumably also support some form of backtracking. It doesn't ship with a "manyTill' function, which is what I would want to use for your data, but it doesn't look hard to write. I have no idea how well it would perform, though.
Well thought out "serialization" formats shouldn't need a backtracking parser, as backtracking introduces a substantial performance penalty. Where binary data formats have alternatives the choice to take is typically flagged with a tag byte first. I haven't looked much at recent textual serialization formats like JSON - if they need backtracking it would be my opinion that they're not very well designed. If you want speed for deserializing with a combinator parser avoiding backtracking choice (i.e. Alternative) is the way to go.

On 18/10/10 22:58, Jimmy Wylie wrote:
This sounds awfully lot like a premature optimisation, which as we all know, is the root of evil :-)
Why do you think that using Parsec will result in fewer transformations? (It will most likely result in fewer transformations *that you see*, but that doesn't mean much.)
I think you're right. I misunderstood the way the parsec library worked, and was trying to run before I could walk. However, it is a priority that I make this tool as fast as possible (as close to drive speed as I can). I wanted to take an "incremental" approach to optimization: writing small pieces, optimizing them, and then putting them together. I am also facing faculty skeptical that I can make things "fast" in haskell. (Currently, most DF applications are written in Python and C).
Well, I wouldn't say Python is *fast*, but as they say, it's often *fast enough*. I don't see any reason for Haskell to do worse than that. Profile stuff, both tools in C and Python that others have written, and your own code. My experience is that many people, if not most, are guided not by actual numbers but by some vague idea that low-level control somehow always result in faster code.
http://therning.org/magnus/archives/289 http://therning.org/magnus/archives/290 http://therning.org/magnus/archives/295 http://therning.org/magnus/archives/296
/M
Thanks for the references. They're great blog posts, very easy to follow. I didn't realize how simple Parsec was to use, at least in the Monadic form. For some reason, I thought it was more complex.
Yeah, it's surprising isn't it? :-)
I do have one question though. Do you always have to define your own Applicative instance for GenParser when trying to use the Applicative form? I noticed that both you and also RWH defined your own when explaining this form of Parsec. Is there no standard Parsec.Applicative in the library.
Those blog posts were written using Parsec 2. As others have said, Parsec 3 is better equipped. Apparently both with support for ByteString (which I didn't know) and an implementation for Applicative (that I did know). /M -- Magnus Therning (OpenPGP: 0xAB4DFBA4) magnus@therning.org Jabber: magnus@therning.org http://therning.org/magnus identi.ca|twitter: magthe
participants (6)
-
Antoine Latter
-
Brent Yorgey
-
Heinrich Apfelmus
-
Jimmy Wylie
-
Magnus Therning
-
Stephen Tetley