Parsing unstructured data

28 Nov 2007

      Hi all,

This e-mail may be a bit off topic. My question is more about methods and
algorithms than Haskell. I'm looking for links to methods or tools for
parsing unstructured data.

I'm currently working on data cleaning of a Customer Addresses database.
Addresses are stored as 3 lines of text without any structure and people
made used lots of imagination to create the data (20 years of data using no
rules at all). Postal code, street, city, state, region, country and other
details as suite, building, dock, doors, PO box, etc... are all stored in
free form in those 3 lines.

I already wrote a haskell program to do the job. It correctly parses about
2/3 addresses and parses much of the rest but with unrecognized parts left.
The current program works by trying to recognize words used to tag parts
like STE, SUITE, BLDG, street words (STR, AVE, CIRCLE, etc...) and countries
from a list (including typos). It uses regular expressions to recognize
variation of those words, lookup tables for countries, street words, regular
expression rules for postal codes, etc... The most difficult task is
splitting the address parts. There is no clearly defined separator for the
fields. It can be dot, space, comma, dash, slash, or anything you can
imagine using as a separator and this separator can of course also be found
inside an address part.

In the current application when part of an address is recognized it will not
be parsed again by the other rules. A system trying all rules and tagging
them with probabilities would probably give better results.

Any link to tools or methods that could help me in that task would be
greatly appreciated. I already searched for fuzzy, probabilistic or
statistical parsing but without much success.

Thanks,

Olivier.

PS: just in case someone's interested I attached the code and partial data
to this e-mail.

Olivier Boudry

Hans van Thiel

Olivier Boudry

Grzegorz Chrupala

Olivier Boudry

Reinier Lamers

Olivier Boudry

tags

participants (4)