
On Sun, Oct 17, 2010 at 22:59, Jimmy Wylie
Hi everyone,
I'm working on a digital forensics application that will take a file with lines of the following format:
"MD5|name|inode|mode_as_string|UID|GID|size|atime|mtime|ctime|crtime"
This string represents the metadata associated with a particular file in the filesystem.
I created a data type to represent the information that I will need to perform my analysis:
data Event = Event { fn :: B.ByteString, mftNum :: B.ByteString, ft :: B.ByteString, fs :: Integer, time :: Integer, at :: AccessType mt :: AccessType ct :: AccessType crt :: AccessType } deriving (Show)
data AccessType = ATime | MTime | CTime | CrTime deriving (Show)
I would like to create a function that takes the Bytestring representing the file and returns a list of Events: createEvents :: ByteString -> [Event] (For now I'm creating a list, but depending on the type of analysis I decide to do, I may change this data structure)
I understand that I can use the Parsec Library to do this. I read RWH, and noticed they have the endBy and sepBy combinators, but my issue with these is that using these funcitons performs too many transformations on the data. endBy will return a list of strings, which then will be used by sepBy which will then return a [[ByteString]] which I will then have to iterate through to create the final [Event].
What I would like to do is define a custom parser, that will go from the ByteString to the [Event] without the overhead of those intermediate steps. This function needs to be as fast as possible, as these files can be rather large, and I will be performing many different tests and analysis on the data. I don't want the parsing to be a bottleneck.
This sounds awfully lot like a premature optimisation, which as we all know, is the root of evil :-) Why do you think that using Parsec will result in fewer transformations? (It will most likely result in fewer transformations *that you see*, but that doesn't mean much.)
I'm under the impression that the Parsec library will allow me to define a custom parser to do this, but I'm having problems understanding the library, and the documentation for it.
A gentle shove in the right direction would be greatly appreciated.
AFAIK Parsec deals with String, not ByteString, have a look at the attoparsec library[1] instead. There are numerous explanations of using parser combinators out there. Personally I've found the Parsec documentation fairly easy to understand. A while ago I wrote a few posts myself on it, and I think they should translate well to attoparsec (you will probably have to keep the haddock doc at hand though): http://therning.org/magnus/archives/289 http://therning.org/magnus/archives/290 http://therning.org/magnus/archives/295 http://therning.org/magnus/archives/296 /M [1]: http://hackage.haskell.org/package/attoparsec-0.8.1.1 -- Magnus Therning (OpenPGP: 0xAB4DFBA4) magnus@therning.org Jabber: magnus@therning.org http://therning.org/magnus identi.ca|twitter: magthe