google-like "do you mean?" feature

I'm thinking of writing a parser to load files that my customers have created. I'm a software requirements engineer; the data consists of the customers' thoughts in response to the latest release of the requirements doc. In fact, the files will probably be copies of the requirements doc itself, into which customers have entered their notes and made changes. The original requirements doc will have a format that can be parsed; probably something simple like lines marked with codes like //customer={customer name goes here} //requirement= {requirement text goes here} When I parse the documents that come back from the customers, they are likely to contain some errors. Field names may be mangled or misspelled. Customer names may be entered in unrecognizable variants (e.g. someone named "Michael" is indicated as "Mike") and so forth. I was thinking that it might be useful to have a Google-like "do you mean this?" feature. If the field name is //customer=, then the parser might recognize a huge list of variants like //ustomer=, //customor=, etc... that is, recognize them well enough to continue parsing and give a decent error message in context. Any ideas how to go about this? I don't think I would create a parser language that includes every variant, but instead the field names would be tokens that could be passed to another routine. The variants could be generated ahead of time. I would limit the number of variants to something manageable, like 10,000 for each field name. Thanks, Mike

2009/4/16 Michael Mossey
I was thinking that it might be useful to have a Google-like "do you mean this?" feature. If the field name is //customer=, then the parser might recognize a huge list of variants like //ustomer=, //customor=, etc... that is, recognize them well enough to continue parsing and give a decent error message in context.
Any ideas how to go about this?
To measure how similar two strings are, you can use a metric like Levenshtein distance, Damerau-Levenshtein distance, or Jaro-Winkler distance: http://en.wikipedia.org/wiki/Levenshtein_distance http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance http://en.wikipedia.org/wiki/Jaro-Winkler_distance The first two basically count the number of mistakes that a user would have to make to get from the correct string to the one you read from the file. There's an 'edit-distance' package in Hackage that implements the first two: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/edit-distance When you find an unrecognised field name in the file, you could calculate the edit distance to each correct field name, and if there's one within a certain threshold, assume that's what the user meant (if there's more than one close match, maybe it's better to report an error than risk choosing the wrong one). I imagine this brute-force approach would be fast enough, but if not you could look at the techniques used by spell checkers to suggest corrections. Maybe even use a spell checking library, if such a thing exists (either pure Haskell or a binding to a library like aspell, although I couldn't see either from a quick search in Hackage). Andy

To measure how similar two strings are, you can use a metric like Levenshtein distance, Damerau-Levenshtein distance, or Jaro-Winkler
I tried some of those just the other day, and in my app this worked much better: http://www.catalysoft.com/articles/StrikeAMatch.html

On Wed, 15 Apr 2009 23:31:50 -0700
Michael Mossey
I was thinking that it might be useful to have a Google-like "do you mean this?" feature. If the field name is //customer=, then the parser might recognize a huge list of variants like //ustomer=, //customor=, etc...
You could reduce the probability of such errors by providing a standard template that could be copy-pasted in wherever necessary. -- Robin

Robin Green wrote:
On Wed, 15 Apr 2009 23:31:50 -0700 Michael Mossey
wrote: I was thinking that it might be useful to have a Google-like "do you mean this?" feature. If the field name is //customer=, then the parser might recognize a huge list of variants like //ustomer=, //customor=, etc...
You could reduce the probability of such errors by providing a standard template that could be copy-pasted in wherever necessary.
Yes, that will be there. My example is not so good because it seems concerned with the keywords only. I'm more concerned about errors in the data they enter... for example, names of people and references to document names. Thanks, Mike

2009/4/16 Michael Mossey
I don't think I would create a parser language that includes every variant, but instead the field names would be tokens that could be passed to another routine.
Right.
The variants could be generated ahead of time. I would limit the number of variants to something manageable, like 10,000 for each field name.
Generating variants ahead of time isn't necessary. Instead, you could just look at the edit distance between the token you get and each possible valid field name using something like http://hackage.haskell.org/cgi-bin/hackage-scripts/package/edit-distance. The token with the least edit distance is the one you should suggest. Cheers, Max
participants (5)
-
Andy Smith
-
Max Bolingbroke
-
Michael Mossey
-
Robin Green
-
Simon Michael