
Hi all, This e-mail may be a bit off topic. My question is more about methods and algorithms than Haskell. I'm looking for links to methods or tools for parsing unstructured data. I'm currently working on data cleaning of a Customer Addresses database. Addresses are stored as 3 lines of text without any structure and people made used lots of imagination to create the data (20 years of data using no rules at all). Postal code, street, city, state, region, country and other details as suite, building, dock, doors, PO box, etc... are all stored in free form in those 3 lines. I already wrote a haskell program to do the job. It correctly parses about 2/3 addresses and parses much of the rest but with unrecognized parts left. The current program works by trying to recognize words used to tag parts like STE, SUITE, BLDG, street words (STR, AVE, CIRCLE, etc...) and countries from a list (including typos). It uses regular expressions to recognize variation of those words, lookup tables for countries, street words, regular expression rules for postal codes, etc... The most difficult task is splitting the address parts. There is no clearly defined separator for the fields. It can be dot, space, comma, dash, slash, or anything you can imagine using as a separator and this separator can of course also be found inside an address part. In the current application when part of an address is recognized it will not be parsed again by the other rules. A system trying all rules and tagging them with probabilities would probably give better results. Any link to tools or methods that could help me in that task would be greatly appreciated. I already searched for fuzzy, probabilistic or statistical parsing but without much success. Thanks, Olivier. PS: just in case someone's interested I attached the code and partial data to this e-mail.

Hi all,
This e-mail may be a bit off topic. My question is more about methods and algorithms than Haskell. I'm looking for links to methods or tools for parsing unstructured data.
I'm currently working on data cleaning of a Customer Addresses database. Addresses are stored as 3 lines of text without any structure and people made used lots of imagination to create the data (20 years of data using no rules at all). Postal code, street, city, state, region, country and other details as suite, building, dock, doors, PO box, etc... are all stored in free form in those 3 lines.
I already wrote a haskell program to do the job. It correctly parses about 2/3 addresses and parses much of the rest but with unrecognized parts left. The current program works by trying to recognize words used to tag parts like STE, SUITE, BLDG, street words (STR, AVE, CIRCLE, etc...) and countries from a list (including typos). It uses regular expressions to recognize variation of those words, lookup tables for countries, street words, regular expression rules for postal codes, etc... The most difficult task is splitting the address parts. There is no clearly defined separator for the fields. It can be dot, space, comma, dash, slash, or anything you can imagine using as a separator and this separator can of course also be found inside an address part.
In the current application when part of an address is recognized it will not be parsed again by the other rules. A system trying all rules and tagging them with probabilities would probably give better results. Have you looked at the Java Rule Engine (I believe JSR 94) and in
On Wed, 2007-11-28 at 12:58 -0500, Olivier Boudry wrote: particular Jess? http://herzberg.ca.sandia.gov/ I have no experience with it myself, though, just heard of it. Regards, Hans van Thiel
Any link to tools or methods that could help me in that task would be greatly appreciated. I already searched for fuzzy, probabilistic or statistical parsing but without much success.
Thanks,
Olivier.
PS: just in case someone's interested I attached the code and partial data to this e-mail.

On 11/28/07, Hans van Thiel
Have you looked at the Java Rule Engine (I believe JSR 94) and in particular Jess? http://herzberg.ca.sandia.gov/
I have no experience with it myself, though, just heard of it.
Regards,
Hans van Thiel
Hi Hans, Never heard of that tool but it looks interesting. I will definitely give it a try. Thanks for the link, Olivier.

Olivier Boudry wrote:
Hi all,
This e-mail may be a bit off topic. My question is more about methods and algorithms than Haskell. I'm looking for links to methods or tools for parsing unstructured data.
I'm currently working on data cleaning of a Customer Addresses database. Addresses are stored as 3 lines of text without any structure and people made used lots of imagination to create the data (20 years of data using no rules at all). Postal code, street, city, state, region, country and other details as suite, building, dock, doors, PO box, etc... are all stored in free form in those 3 lines.
I already wrote a haskell program to do the job. It correctly parses about 2/3 addresses and parses much of the rest but with unrecognized parts left. The current program works by trying to recognize words used to tag parts like STE, SUITE, BLDG, street words (STR, AVE, CIRCLE, etc...) and countries from a list (including typos). It uses regular expressions to recognize variation of those words, lookup tables for countries, street words, regular expression rules for postal codes, etc... The most difficult task is splitting the address parts. There is no clearly defined separator for the fields. It can be dot, space, comma, dash, slash, or anything you can imagine using as a separator and this separator can of course also be found inside an address part.
In the current application when part of an address is recognized it will not be parsed again by the other rules. A system trying all rules and tagging them with probabilities would probably give better results.
Any link to tools or methods that could help me in that task would be greatly appreciated. I already searched for fuzzy, probabilistic or statistical parsing but without much success.
You may have better luck checking out methods used in parsing natural language. In order to use statistical parsing techniques such as Probabilistic Context Free Grammars ([1],[2] ) the standard approach is to extract rule probabilities from an annotated corpus, that is collection of strings with associated parse trees. Maybe you could use your 2/3 of addresses that you know are correctly parsed as your training material. A PCFG parser can output all (or n-best) parses ordered according to probabilities so that would seem to be fit your requirements. [1] http://en.wikipedia.org/wiki/Stochastic_context-free_grammar [2] http://www.cs.colorado.edu/~martin/slp2.html#Chapter14 -- Best, Grzegorz -- View this message in context: http://www.nabble.com/Parsing-unstructured-data-tf4892274.html#a14011334 Sent from the Haskell - Haskell-Cafe mailing list archive at Nabble.com.

On 11/28/07, Grzegorz Chrupala
You may have better luck checking out methods used in parsing natural language. In order to use statistical parsing techniques such as Probabilistic Context Free Grammars ([1],[2] ) the standard approach is to extract rule probabilities from an annotated corpus, that is collection of strings with associated parse trees. Maybe you could use your 2/3 of addresses that you know are correctly parsed as your training material.
A PCFG parser can output all (or n-best) parses ordered according to probabilities so that would seem to be fit your requirements. [1] http://en.wikipedia.org/wiki/Stochastic_context-free_grammar [2] http://www.cs.colorado.edu/~martin/slp2.html#Chapter14 -- Best, Grzegorz --
Hi Grzegorz, Wow, Natural Language Processing looks quite complex! But it also seems to be closely related to my problem. If someone finds a "NPL for dummies" article or book I'm interested. ;-) Thanks for your help, Olivier.

Olivier Boudry wrote:
On 11/28/07, *Grzegorz Chrupala*
mailto:grzegorz.chrupala@computing.dcu.ie> wrote: You may have better luck checking out methods used in parsing natural language. In order to use statistical parsing techniques such as Probabilistic Context Free Grammars ([1],[2] ) the standard approach is to extract rule probabilities from an annotated corpus, that is collection of strings with associated parse trees. Maybe you could use your 2/3 of addresses that you know are correctly parsed as your training material.
A PCFG parser can output all (or n-best) parses ordered according to probabilities so that would seem to be fit your requirements. [1] http://en.wikipedia.org/wiki/Stochastic_context-free_grammar [2] http://www.cs.colorado.edu/~martin/slp2.html#Chapter14 http://www.cs.colorado.edu/%7Emartin/slp2.html#Chapter14
Wow, Natural Language Processing looks quite complex! But it also seems to be closely related to my problem. If someone finds a "NPL for dummies" article or book I'm interested. ;-)
Especially in the fuzzy cases like this one, NLP often turns to machine learning models. One could try to train a hidden Markov model or support vector machines to label parts of the string as "name", "street", "number", "city", etc. These techniques work very well for part of speech tagging in natural language, and this seems similar. However, you need a manually annotated set of examples to train the models. If you really have a big load of data and it seems like a good solution, you could use an off-the-shelf part-of-speech tagger like SVMTool (http://www.lsi.upc.edu/~nlp/SVMTool/) to do it. Reinier

On Nov 29, 2007 5:31 AM, Reinier Lamers
Especially in the fuzzy cases like this one, NLP often turns to machine learning models. One could try to train a hidden Markov model or support vector machines to label parts of the string as "name", "street", "number", "city", etc. These techniques work very well for part of speech tagging in natural language, and this seems similar. However, you need a manually annotated set of examples to train the models. If you really have a big load of data and it seems like a good solution, you could use an off-the-shelf part-of-speech tagger like SVMTool (http://www.lsi.upc.edu/~nlp/SVMTool/http://www.lsi.upc.edu/%7Enlp/SVMTool/) to do it.
Reinier
Hi Reinier, Thanks for the link to SVMTool. I don't have the basis to understand most of the NLP articles I found and get stuck on the first NLP's slang words. For me using an existing tool will be easier than build a new one. I'm currently looking at the tool's documentation and it looks quite promising. It seems to be very generic and highly reusable. Cheers, Olivier.
participants (4)
-
Grzegorz Chrupala
-
Hans van Thiel
-
Olivier Boudry
-
Reinier Lamers