
Am Dienstag 26 Januar 2010 14:11:06 schrieb Eduard Sergeev:
Matthew Phillips-5 wrote:
I also found it to to be very slow.
My variant: http://a-ejeli-tak.livejournal.com/1326.html Spellchecker in Haskell String version runs in 2.5 sec, ByteString in 1.2 sec (just for one word e.g. just to build the tree). 8 sec to check input of 400 words (copied from Norvig's example).
Slower here, time for building the set is approximately equal to that of of the frequency map [either String or ByteString], lookup a little slower if one edit is needed, much faster if two are needed (of course). But the lazy Levenshtein distance is much faster again, for the 'tests2' data from Norvig's http://norvig.com/spell.py (400 words), $ xargs -a tdata.txt time ./nLDBSWSpelling > /dev/null 4.50user 0.03system 0:04.53elapsed 100%CPU $ time ./esergSpellBS big.txt tdata.txt > /dev/null 28.23user 0.09system 0:28.32elapsed 100%CPU surprisingly (?), your plain String version is faster for that than your ByteString version: $ time ./esergSpellS big.txt tdata.txt > /dev/null 25.07user 0.10system 0:25.18elapsed 99%CPU
I think laziness helps here to avoid unnecessary checks (once the first match is found).
But that's the point, these checks aren't unnecessary (unless the word under inspection is known). You want to propose the most likely correct word. If your input is "arthetic", should you return "aesthetic", just because it's the first of the (at least four) correct words with edit distance 2[*] which is produced by your arbitrary ordering of edit steps or it's the lexicographically smallest? I think you shouldn't.
Haven't tried it on a larger data sets neither tried to optimize it. Cheated on dictionary parsing though...
[*] The others I know are "bathetic", "pathetic" and "arthritic" - without context, I'd go for "arthritic" because I think spelling errors are more common i the middle of a word than at the beginning or at the end, but plain frequency analysis of the corpus suggests "pathetic".