Re: [Haskell-beginners] Defining custom parser using Parsec

18 Oct 2010

      On Sun, Oct 17, 2010 at 22:59, Jimmy Wylie  wrote:
...
 Hi everyone,
I'm working on a digital forensics application that will take a file with
lines of the following format:
"MD5|name|inode|mode_as_string|UID|GID|size|atime|mtime|ctime|crtime"
This string represents the metadata associated with a particular file in the
filesystem.
I created a data type to represent the information that I will need to
perform my analysis:
data Event = Event {
    fn          :: B.ByteString,
    mftNum :: B.ByteString,
    ft           :: B.ByteString,
    fs           :: Integer,
    time       :: Integer,
    at           :: AccessType
    mt          :: AccessType
    ct           ::  AccessType
    crt          :: AccessType
    } deriving (Show)
data AccessType = ATime | MTime | CTime | CrTime
                 deriving (Show)
I would like to create a function that takes the Bytestring representing the
file and returns a list of Events:
createEvents :: ByteString -> [Event]
(For now I'm creating a list, but depending on the type of analysis I decide
to do, I may change this data structure)
I understand that I can use the Parsec Library to do this.  I read RWH, and
noticed they have the endBy and sepBy combinators, but my issue with these
is that using these funcitons performs too many transformations on the data.
endBy will return a list of strings, which then will be used by sepBy which
will then return a [[ByteString]] which I will then have to iterate through
to create the final [Event].
What I would like to do is define a custom parser, that will go from the
ByteString to the [Event] without the overhead of those intermediate steps.
This function needs to be as fast as possible, as these files can be rather
large, and I will be performing many different tests and analysis on the
data.  I don't want the parsing to be a bottleneck.
This sounds awfully lot like a premature optimisation, which as we all
know, is the root of evil :-)

Why do you think that using Parsec will result in fewer
transformations?  (It will most likely result in fewer transformations
*that you see*, but that doesn't mean much.)
...
I'm under the impression that the Parsec library will allow me to define a
custom parser to do this, but I'm having problems understanding the library,
and the documentation for it.
A gentle shove in the right direction would be greatly appreciated.
AFAIK Parsec deals with String, not ByteString, have a look at the
attoparsec library[1] instead.

There are numerous explanations of using parser combinators out there.
 Personally I've found the Parsec documentation fairly easy to
understand.  A while ago I wrote a few posts myself on it, and I think
they should translate well to attoparsec (you will probably have to
keep the haddock doc at hand though):

http://therning.org/magnus/archives/289
http://therning.org/magnus/archives/290
http://therning.org/magnus/archives/295
http://therning.org/magnus/archives/296

/M

[1]: http://hackage.haskell.org/package/attoparsec-0.8.1.1

-- 
Magnus Therning                        (OpenPGP: 0xAB4DFBA4)
magnus＠therning．org          Jabber: magnus＠therning．org
http://therning.org/magnus         identi.ca|twitter: magthe