parsing machine-generated natural text

For a toy project I want to parse the output of a program. The program runs on someone else's machine and mails me the results, so I only have access to the output it generates, Unfortunately, the output is intended to be human-readable, and this makes parsing it a bit of a pain. Here are some sample lines from its output: France: Army Marseilles SUPPORT Army Paris -> Burgundy. Russia: Fleet St Petersburg (south coast) -> Gulf of Bothnia. England: 4 Supply centers, 3 Units: Builds 1 unit. The next phase of 'dip' will be Movement for Fall of 1901. I've been using Parsec and it's felt rather complicated. For example, a "location" is a series of words and possibly parenthesis, except if the word is SUPPORT. And that "Supply centers" line ends up being code filled with stuff lie "char ':'; skipMany space". I actually have a separate parser that's Javascript with a bunch of regular expressions and it's far shorter than my Haskell one, which makes sense as munging this sort of text feels to me more like a regexp job than a careful parsing job. I'm considering writing a preprocessing stage in Ruby or Perl that munges those output lines into something a bit more "machine-readable", but before I did that I thought I'd ask here if anyone had any pointers, hints, or better ideas.

Hello Evan, Saturday, May 20, 2006, 5:35:15 AM, you wrote:
France: Army Marseilles SUPPORT Army Paris -> Burgundy. Russia: Fleet St Petersburg (south coast) -> Gulf of Bothnia. England: 4 Supply centers, 3 Units: Builds 1 unit. The next phase of 'dip' will be Movement for Fall of 1901.
I've been using Parsec and it's felt rather complicated. For example,
i have an experience of parsing such human-readable, imprecise texts and should say that regexps was developed just to do such jibs. ghc and hugs already contains regex library in module Text.Regex.Posix (it's available on all systems, including Windows). this lib has rather dumb interface, i recommend you to install JRegex lib by Johc Meacham that supports familiar =~ operators. there is also Text.Regex.Lazy module: Text.Regex.Lazy (0.33). Chris Kuklewicz [6]announced the release of [7]Text.Regex.Lazy. This is an alternative to Text.Regex along with some enhancements. GHC's Text.Regex marshals the data back and forth to C arrays, to call libc. This is far too slow (and strict). This module understands regular expression Strings via a Parsec parser and creates an internal data structure (Text.Regex.Lazy.Pattern). This is then transformed into a Parsec parser to process the input String, or into a DFA table for matching against the input String or FastPackedString. The input string is consumed lazily, so it may be an arbitrarily long or infinite source. 6. http://article.gmane.org/gmane.comp.lang.haskell.libraries/4464 7. http://sourceforge.net/projects/lazy-regex -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Evan Martin wrote:
Unfortunately, the output is intended to be human-readable, and this makes parsing it a bit of a pain. Here are some sample lines from its output:
France: Army Marseilles SUPPORT Army Paris -> Burgundy. Russia: Fleet St Petersburg (south coast) -> Gulf of Bothnia. England: 4 Supply centers, 3 Units: Builds 1 unit. The next phase of 'dip' will be Movement for Fall of 1901.
What's the difficulty? "SUPPORT" and "CONVOY" are simply keywords, as are "Army" and "Fleet", only other words are identifiers for locations. Parsec supports this out of the box; have a look at the Language and Token modules . Note that CONVOY orders can get complex, so a true parser is probably the right tool.
And that "Supply centers" line ends up being code filled with stuff lie "char ':'; skipMany space".
do power colon integer reserved "Supply centers," integer reserved "Units:" ((reserved "Builds" >> return id) <|> (reserved "Disbands" >> return negate)) `ap` integer reserved "units." <|> reserved "unit." Come on, it isn't nearly as bad as you make it sound. Use the combinators, they are far more powerful than ugly never-quite-correct regexes. Oh, and drop me a line when your Diplomacy bot is finished. Udo. -- Jeder echte Wettbewerb ist ruinös. Darum beruht jede funktionierende Wirtschaft auf Schiebung.

On 5/21/06, Udo Stenzel
do power colon integer reserved "Supply centers," integer reserved "Units:" ((reserved "Builds" >> return id) <|> (reserved "Disbands" >> return negate)) `ap` integer reserved "units." <|> reserved "unit."
Come on, it isn't nearly as bad as you make it sound. Use the combinators, they are far more powerful than ugly never-quite-correct regexes.
Thanks! I had looked at using the lexeme parser before but it didn't seem like you can make newlines significant. Here's the beginning of the file, where it's not obvious to me how to distinguish elements in the "::" section from the rest of the file. :: Judge: USDP Game: dip Variant: standard :: Deadline: F1901M Mon 20 Feb 2006 20:00 PST :: URL: http://www.diplom.org/dpjudge?game=dip Movement results for Fall of 1901. (dip.F1901M) I guess I could make "Movement" a reserved word?
Oh, and drop me a line when your Diplomacy bot is finished.
:) It's actually just for rendering nicer maps of the game state. http://neugierig.org/software/hsdip/mapview.html (It's draggable, too.) I was trying to do it with Firefox's SVG+XUL but it's terribly slow, XUL isn't quite there yet, and doing a large app with JavaScript is painful. http://neugierig.org/software/darcs/xuldip/dip.xul (no install necessary; only works in Firefox)

Evan Martin wrote:
Here's the beginning of the file, where it's not obvious to me how to distinguish elements in the "::" section from the rest of the file. :: Judge: USDP Game: dip Variant: standard :: Deadline: F1901M Mon 20 Feb 2006 20:00 PST :: URL: http://www.diplom.org/dpjudge?game=dip
You could make "::" the start-of-one-line-comment sequence or you could just parse these three lines. You can also throw them away like this: string " ::" >> many (satisfy (/='\n')) >> newline
Movement results for Fall of 1901. (dip.F1901M) I guess I could make "Movement" a reserved word?
Or simply treat it as such. 'reserved "Movement"' expands to 'lexeme (string "Movement")', so unless "Movement" appears as keyword where some other identifier could also occur, you don't need to treat it specially.
It's actually just for rendering nicer maps of the game state. http://neugierig.org/software/hsdip/mapview.html (It's draggable, too.)
That's nice, too. There's not much Diplomacy software out there that runs under Linux, and most that does is painfully slow. Udo. -- "Hey, Perl-Scripte sind wie Klobürsten. Da ist man nicht stolz drauf! Und man gibt sie auch nicht weiter. Schon gar nicht in benutzter Form." -- F. v. Leitner

On 5/20/06, Udo Stenzel
do power colon integer reserved "Supply centers," integer reserved "Units:" ((reserved "Builds" >> return id) <|> (reserved "Disbands" >> return negate)) `ap` integer reserved "units." <|> reserved "unit."
I always struggle with when I need to use 'try' with parsec. I tend to over use it because I've had unexpected results when I leave it out. In your other email you say that 'reserved "Movement"' expands to 'lexeme (string "Movement")'. So I would think that, reserved "units." <|> reserved "unit." would need to be wrapped in a 'try'. Something like, try (reserved "units.") <|> reserved "unit." My understanding is that if 'unit.' appears in the input the first parser will parse up to the '.' and then fail and consume the input up to that point, leaving the alternative with only the period as input so it will also fail. So I'm wondering if someone could explain to me what is wrong with my understanding of parsec or point me to a resource that explains this (probably common) misunderstanding. Specifically, I would like to develop a better understanding of when 'try' is needed, and when it is not needed. Thanks, Jason

Jason Dagit wrote:
reserved "units." <|> reserved "unit."
I always struggle with when I need to use 'try' with parsec.
My understanding is that if 'unit.' appears in the input the first parser will parse up to the '.' and then fail and consume the input up to that point, leaving the alternative with only the period as input so it will also fail.
So I'm wondering if someone could explain to me what is wrong with my understanding of parsec
Nothing at all. 'reserved' actually contains the necessary 'try', a 'notFollowedBy' in order not to swallow parts of longer identifiers and a useful error message. I tend to forget 'try', too, or add it in odd places. Actually, if ReadP hat better error reporting, I'd recommend that over Parsec. Udo. -- Time is an illusion. Lunchtime doubly so. -- Douglas Adams

On May 19, 2006, at 6:35 PM, Evan Martin wrote:
For a toy project I want to parse the output of a program. The program runs on someone else's machine and mails me the results, so I only have access to the output it generates,
Unfortunately, the output is intended to be human-readable, and this makes parsing it a bit of a pain. Here are some sample lines from its output:
France: Army Marseilles SUPPORT Army Paris -> Burgundy. Russia: Fleet St Petersburg (south coast) -> Gulf of Bothnia. England: 4 Supply centers, 3 Units: Builds 1 unit. The next phase of 'dip' will be Movement for Fall of 1901.
I've been using Parsec and it's felt rather complicated. For example, a "location" is a series of words and possibly parenthesis, except if the word is SUPPORT. And that "Supply centers" line ends up being code filled with stuff lie "char ':'; skipMany space".
I actually have a separate parser that's Javascript with a bunch of regular expressions and it's far shorter than my Haskell one, which makes sense as munging this sort of text feels to me more like a regexp job than a careful parsing job.
I'm considering writing a preprocessing stage in Ruby or Perl that munges those output lines into something a bit more "machine-readable", but before I did that I thought I'd ask here if anyone had any pointers, hints, or better ideas.
Hi Evan, if the text you want to parse is actually similar to natural language (some posters have suggested that it is much simpler), you may want to have a look at grammar formalisms designed for natural languages. Grammatical Framework (GF) [1] is such a formalism, where the grammars are functional programs. The GF implementation is written in Haskell, and it has an interactive mode and a Haskell API. [Disclaimer: I participate in the development of GF] /Björn [1] http://www.cs.chalmers.se/~aarne/GF/
participants (5)
-
Bjorn Bringert
-
Bulat Ziganshin
-
Evan Martin
-
Jason Dagit
-
Udo Stenzel