parsing machine-generated natural text

20 May 2006

      For a toy project I want to parse the output of a program.  The
program runs on someone else's machine and mails me the results, so I
only have access to the output it generates,

Unfortunately, the output is intended to be human-readable, and this
makes parsing it a bit of a pain.  Here are some sample lines from its
output:

France: Army Marseilles SUPPORT Army Paris -> Burgundy.
Russia: Fleet St Petersburg (south coast) -> Gulf of Bothnia.
England:     4 Supply centers,  3 Units:  Builds   1 unit.
The next phase of 'dip' will be Movement for Fall of 1901.

I've been using Parsec and it's felt rather complicated.  For example,
a "location" is a series of words and possibly parenthesis, except if
the word is SUPPORT.  And that "Supply centers" line ends up being
code filled with stuff lie "char ':'; skipMany space".

I actually have a separate parser that's Javascript with a bunch of
regular expressions and it's far shorter than my Haskell one, which
makes sense as munging this sort of text feels to me more like a
regexp job than a careful parsing job.

I'm considering writing a preprocessing stage in Ruby or Perl that
munges those output lines into something a bit more
"machine-readable", but before I did that I thought I'd ask here if
anyone had any pointers, hints, or better ideas.

Evan Martin

Bulat Ziganshin

Udo Stenzel

Evan Martin

Evan Martin

Udo Stenzel

Jason Dagit

Udo Stenzel

Bjorn Bringert

tags

participants (5)