
2009/4/16 Michael Mossey
I was thinking that it might be useful to have a Google-like "do you mean this?" feature. If the field name is //customer=, then the parser might recognize a huge list of variants like //ustomer=, //customor=, etc... that is, recognize them well enough to continue parsing and give a decent error message in context.
Any ideas how to go about this?
To measure how similar two strings are, you can use a metric like Levenshtein distance, Damerau-Levenshtein distance, or Jaro-Winkler distance: http://en.wikipedia.org/wiki/Levenshtein_distance http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance http://en.wikipedia.org/wiki/Jaro-Winkler_distance The first two basically count the number of mistakes that a user would have to make to get from the correct string to the one you read from the file. There's an 'edit-distance' package in Hackage that implements the first two: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/edit-distance When you find an unrecognised field name in the file, you could calculate the edit distance to each correct field name, and if there's one within a certain threshold, assume that's what the user meant (if there's more than one close match, maybe it's better to report an error than risk choosing the wrong one). I imagine this brute-force approach would be fast enough, but if not you could look at the techniques used by spell checkers to suggest corrections. Maybe even use a spell checking library, if such a thing exists (either pure Haskell or a binding to a library like aspell, although I couldn't see either from a quick search in Hackage). Andy