Haskell version of Norvig's Python Spelling Corrector

Recently I read an interesting article by Peter Norvig[1] on how to write a spelling corrector in 21-lines of Python. I wanted to try and implement it in Haskell. My implementation is terribly slow and I was hoping for some tips on how to improve it and make it idiomatic. I'd love to see other Haskell implementations as well if anyone has a few moments to spare. Admittedly, it took me several hours to get my version working, but I'm a Haskell newbie. Unfortunately, I think it runs as slow as it took me to write it! There is definitely something wrong with it, a memory leak, because I can't correct more than a few words without a great deal of memory being consumed. Thanks, Pete [1] http://norvig.com/spell-correct.html

Hi Pete,
Recently I read an interesting article by Peter Norvig[1] on how to write a spelling corrector in 21-lines of Python. I wanted to try and implement it in Haskell. My implementation is terribly slow and I was hoping for some tips on how to improve it and make it idiomatic.
I had a quick look at this. One way to speed up your program is by replacing the sets of altered words by lists. Filtering out doubles is a noble cause, but in this case it takes more time than keeping them around and doing some extra lookups in your frequency map. (I tried this, it gave a speedup factor of ~3.)
I'd love to see other Haskell implementations as well if anyone has a few moments to spare. Admittedly, it took me several hours to get my version working, but I'm a Haskell newbie. Unfortunately, I think it runs as slow as it took me to write it! There is definitely something wrong with it, a memory leak, because I can't correct more than a few words without a great deal of memory being consumed.
Be careful when you apply the |train| function to more than one word; in this form it may compute the frequency map from start for each invocation. (It is better to let |train| take a frequency map, instead of a list of words.) Also be sure to compile your program with full optimisation ('-O2' for ghc).

Pete Kazmier
I'd love to see other Haskell implementations as well if anyone has a few moments to spare. Admittedly, it took me several hours to get my version working, but I'm a Haskell newbie. Unfortunately, I think it runs as slow as it took me to write it! There is definitely something wrong with it, a memory leak, because I can't correct more than a few words without a great deal of memory being consumed.
As monochrom pointed out on #haskell, I am using 'interact' incorrectly. For some reason I thought 'interact' applied its argument to each line of the input. I've replaced it as follows: interact $ unlines . map (show . (id &&& correct)) . lines The program is still terribly slow due to my use of lists. Is there a better way to write 'edits1'? Thanks, Pete

I try using WordSet = [String] (plus corresponding change in code) and get great speedup, actually way more than 3x. There was also a memory growth phenomenon using Set String, and replacement by [String] stops that too, now it's constant space (constant = 20M). It is possible to attribute part of the speedup to excellent rewrite rules in GHC regarding lists; however, I cannot explain the memory growth when using Set. Regarding the local WordFreq map under "train", I am shocked that ghc -O is smart enough to notice it and perform proper sharing, and only one copy is ever created. Nonetheless, I still decide to factor "train" into two, one builds the WordFreq and the other queries it, which eases blame analysis when necessary. On the interact line, I use "tokens" to break up the input, since it's already written (for the trainer), may as well reuse it. When reading holmes.txt, be aware that it is in UTF-8, while GHC still assumes ISO-8859-1. This will affect results. I have not checked the correctness of edits1. I am monochrom. My modification is attached.

Albert Y. C. Lai wrote:
I try using WordSet = [String] (plus corresponding change in code) and get great speedup, actually way more than 3x. There was also a memory growth phenomenon using Set String, and replacement by [String] stops that too, now it's constant space (constant = 20M). It is possible to attribute part of the speedup to excellent rewrite rules in GHC regarding lists; however, I cannot explain the memory growth when using Set.
Now I see. The higher memory usage of the Set version is not growth or leak, just brutal. For each input word, a ton of derived words are generated, and an optimal one is chosen (by a criterion that's essentially a mapping from words to scores). There are two ways to do this. You could put the derived words into a lazy list and then traverse it, and therefore they're generated, examined, and thrown away on demand. At most two words are ever going to be stored (the best one so far and the next candidate), maybe plus a couple of cons cells if the compiler does not optimize. Or you could put all derived words into a set first (to avoid duplicates, but note that you still have to generate a ton, you only save by storing just half a ton), then traverse it. Storing a ton, or even just half a ton, of these words is going to clog up memory. The perceived growth is only because some input words cause fewer derived words and some other input words encountered later cause more derived words. So, plotting memory over time, once in a while we have more derived words than before, so the heap grows for that; they are still thrown away correctly in due course. If you input the same input word over and over gain, space usage is still constant.

On Sun, 2007-04-22 at 00:25 -0400, Pete Kazmier wrote:
Pete Kazmier
writes: I'd love to see other Haskell implementations as well if anyone has a few moments to spare. Admittedly, it took me several hours to get my version working, but I'm a Haskell newbie. Unfortunately, I think it runs as slow as it took me to write it!
Hm - nobody suggested using ByteStrings yet? String is notoriously wasteful, and replacing it with ByteString normally gives a quite significant speedup. Worse - and this is true for ByteStrings, too - String comparisons are O(n), which means lookups in Sets and Maps are expensive. A trie (i.e, a search tree where each internal node corresponds to a word prefix, and has one branch per letter in the alphabet) will give you lookup that scales with word size (and possibly alphabet size). Instead of generating the (huge) list of misspelled words, you could calculate edit distance to each correctly spelled word? With a bound on allowable mistakes, this is a linear time operation using the standard dynamic programming algorithm. -k

Ketil Malde
On Sun, 2007-04-22 at 00:25 -0400, Pete Kazmier wrote:
Pete Kazmier
writes: I'd love to see other Haskell implementations as well if anyone has a few moments to spare. Admittedly, it took me several hours to get my version working, but I'm a Haskell newbie. Unfortunately, I think it runs as slow as it took me to write it!
Hm - nobody suggested using ByteStrings yet? String is notoriously wasteful, and replacing it with ByteString normally gives a quite significant speedup.
I actually have a ByteString version but it runs much slower. This part of the code is where all of the time is spent in the ByteString version: type WordFreq = M.Map B.ByteString Int train:: [B.ByteString] -> WordFreq train words = frequencyMap where frequencyMap = foldr incWordCount M.empty words incWordCount w m = M.insertWith (+) w 1 m
Worse - and this is true for ByteStrings, too - String comparisons are O(n), which means lookups in Sets and Maps are expensive. A trie (i.e, a search tree where each internal node corresponds to a word prefix, and has one branch per letter in the alphabet) will give you lookup that scales with word size (and possibly alphabet size).
Right. My first version was just a direct translation of Norvig's code with an emphasis on trying to keep the complexity and size of code to a minimum.
Instead of generating the (huge) list of misspelled words, you could calculate edit distance to each correctly spelled word? With a bound on allowable mistakes, this is a linear time operation using the standard dynamic programming algorithm.
Could you provide additional information on this "standard dynamic programming algorithm?" I'm not familiar with dynamic programming. Thanks! Pete

On Sun, 2007-04-22 at 11:51 -0400, Pete Kazmier wrote:
type WordFreq = M.Map B.ByteString Int
train:: [B.ByteString] -> WordFreq train words = frequencyMap where frequencyMap = foldr incWordCount M.empty words incWordCount w m = M.insertWith (+) w 1 m
Isn't this going to build unevaluate thunks for the counts? This is what frequency counting looks like in some of my old code: -- IM is Data.IntMap index ss = foldl' myupdate IM.empty ss where myupdate m k = let x = 1 + IM.findWithDefault 0 k m in x `seq` IM.insert k x m (edited a bit, but hopefully correct) I.e. use foldl', and look up previous value, succ and seq it, and put it back. Generally, I only seem to want strict Maps. It seems so easy to build a lazy one if you need it (data Lazy a = Lazy a), so I wonder if strict should be the default?
Instead of generating the (huge) list of misspelled words, you could calculate edit distance to each correctly spelled word? With a bound on allowable mistakes, this is a linear time operation using the standard dynamic programming algorithm.
Could you provide additional information on this "standard dynamic programming algorithm?" I'm not familiar with dynamic programming.
Sorry! Dynamic programming just means using memoizing on a recursive function to reduce its run time complexity (although I don't think I've seen that definition anywhere?). A simple example is finding the n'th Fibonacci number by calculating (and referring to) the whole list. For edit distance, you can do something like: data Edit a = Repl a a | Delete a | Insert a align (x:xs) (y:ys) = best [ Repl x y : align xs ys, Insert y : align (x:xs) ys, Delete x : align xs (y:ys)] Where 'best' is some fitness function, selecting the best alignment. Applying this directly gives an exponential algorithm, but since there are only n*m (n = lenght (x:xs), m = length (y:ys) possible ways to pick the remainders in the recursive calls throughout the execution, you might as well store them all in an n*m matrix - which only takes O(nm) time. This is probably the worst explanation of DP so far this century, but if you Google for it, you'll find tons of information. Wikipedia has an entry as well. Anyway: since the number of edits in this case is bounded (to four, since transposition can be modelled by an insertion and a deletion), you know the optimal solution will never stray further than two off the diagonal of the DP matrix, so you only need to caclulate a diagonal band (linear space/time), not the full matrix (quadratic). -k

On Sun, Apr 22, 2007 at 10:10:44PM +0200, Ketil Malde wrote:
On Sun, 2007-04-22 at 11:51 -0400, Pete Kazmier wrote:
type WordFreq = M.Map B.ByteString Int
train:: [B.ByteString] -> WordFreq train words = frequencyMap where frequencyMap = foldr incWordCount M.empty words incWordCount w m = M.insertWith (+) w 1 m
Isn't this going to build unevaluate thunks for the counts? This is what frequency counting looks like in some of my old code:
-- IM is Data.IntMap index ss = foldl' myupdate IM.empty ss where myupdate m k = let x = 1 + IM.findWithDefault 0 k m in x `seq` IM.insert k x m
(edited a bit, but hopefully correct)
I.e. use foldl', and look up previous value, succ and seq it, and put it back.
Generally, I only seem to want strict Maps. It seems so easy to build a lazy one if you need it (data Lazy a = Lazy a), so I wonder if strict should be the default?
Instead of generating the (huge) list of misspelled words, you could calculate edit distance to each correctly spelled word? With a bound on allowable mistakes, this is a linear time operation using the standard dynamic programming algorithm.
Could you provide additional information on this "standard dynamic programming algorithm?" I'm not familiar with dynamic programming.
Sorry! Dynamic programming just means using memoizing on a recursive function to reduce its run time complexity (although I don't think I've seen that definition anywhere?). A simple example is finding the n'th Fibonacci number by calculating (and referring to) the whole list.
For edit distance, you can do something like:
data Edit a = Repl a a | Delete a | Insert a
align (x:xs) (y:ys) = best [ Repl x y : align xs ys, Insert y : align (x:xs) ys, Delete x : align xs (y:ys)]
Where 'best' is some fitness function, selecting the best alignment.
Applying this directly gives an exponential algorithm, but since there are only n*m (n = lenght (x:xs), m = length (y:ys) possible ways to pick the remainders in the recursive calls throughout the execution, you might as well store them all in an n*m matrix - which only takes O(nm) time.
This is probably the worst explanation of DP so far this century, but if you Google for it, you'll find tons of information. Wikipedia has an entry as well.
Anyway: since the number of edits in this case is bounded (to four, since transposition can be modelled by an insertion and a deletion), you know the optimal solution will never stray further than two off the diagonal of the DP matrix, so you only need to caclulate a diagonal band (linear space/time), not the full matrix (quadratic).
You can almost always do DP with less space, and this is no exception: linear space for full Levenschtein, constant for the diagonal band. http://www.haskell.org/haskellwiki/Dynamic_programming_example#Optimization Stefan

On Sun, 2007-04-22 at 13:15 -0700, Stefan O'Rear wrote:
You can almost always do DP with less space, and this is no exception: linear space for full Levenschtein, constant for the diagonal band.
Right. Especially as we only want the final score and not the exact edits, we only need to keep one column of the matrix. If you want the actual edits, you must backtrack from the optimal score -- or, since this is Haskell, keep the edits around as lazy lists, which, I think, amounts to the same thing. Even if we do want the exact edits we can get linear space with a divide and conquer approach by finding the position in the middle column with the best score (summing the alignment scores of the prefix of xs vs ys and the suffix of xs vs ys), and then doing a separate alignment in the two quadrants (and recurse, of course). This means an extra log n time factor, though. This is all very theoretical for a problem where we align words of about ten characters, of course. :-) -k

On 4/22/07, Pete Kazmier
Worse - and this is true for ByteStrings, too - String comparisons are O(n), which means lookups in Sets and Maps are expensive. A trie (i.e, a search tree where each internal node corresponds to a word prefix, and has one branch per letter in the alphabet) will give you lookup that scales with word size (and possibly alphabet size).
Right. My first version was just a direct translation of Norvig's code with an emphasis on trying to keep the complexity and size of code to a minimum.
AFAIK, Python's dictionaries are implemented using hashtables, so lookups with them are much faster than with M.Map String Int. So should a direct translation use a Data.HashTable (despite its unpure interface)? But probably a trie should be better than both (just gambling ;-). Cheers, -- Felipe.

Ketil Malde wrote:
Hm - nobody suggested using ByteStrings yet?
I wrote an independent port of Norvig's spellchecker, because I figured it would make the basis of an interesting tutorial. For such a small program, it's been quite a challenge! I started out using lazy ByteStrings, Data.Map, and Data.Set. As Albert observed, using Data.Set is poison for heap size and performance. The result of switching from sets to lists was a > 90% reduction in memory usage, and a big (but unmeasured) speedup. After this switch, I found that spellchecking one word still took 4x as long in Haskell as Norvig's Python program. Since I was checking only one word in each case, essentially all of the runtime was taken up by building the word frequency map. In my profile results, I find that simply converting words to lower case accounts for a whopping 40% of time and allocation (see the attachment for my definition of the train function). COST CENTRE MODULE %time %alloc lower Spell 40.5 41.2 train Spell 26.3 14.3 mkWords Spell 21.9 24.1 I was interested in building a profile-enabled version of fps to see what might be going on inside the library, but was stymied by not knowing how to get GHC 6.6's base to coexist peacefully with fps (hiding base isn't very practical). My experiments are available here: darcs get http://darcs.serpentine.com/spell Norvig's training data is available from http://norvig.com/big.txt

Bryan O'Sullivan
After this switch, I found that spellchecking one word still took 4x as long in Haskell as Norvig's Python program. Since I was checking only one word in each case, essentially all of the runtime was taken up by building the word frequency map.
train = foldl' updateMap M.empty . map lower . mkWords where updateMap model word = M.insertWith' (+) word 1 model mkWords = filter (not . B.null) . X.splitWith isNotAlpha lower !s = X.map toLower s isNotAlpha !c = c < 0x41 || (c > 0x5a && c < 0x61) || c > 0x7a toLower !c | c >= 0x41 && c <= 0x5a = c + 0x20 | otherwise = c
After reading Bryan's post, I switched my right fold to a strict left fold and almost tripled my original speed. Could someone provide some guidelines on when to use each? I thought foldr should be used when building some large data structure such as the map I build. Bryan, out of curiosity, is a non bytestring version of your code faster? Thanks, Pete

Pete Kazmier wrote:
Bryan O'Sullivan
writes: After this switch, I found that spellchecking one word still took 4x as long in Haskell as Norvig's Python program. Since I was checking only one word in each case, essentially all of the runtime was taken up by building the word frequency map.
train = foldl' updateMap M.empty . map lower . mkWords where updateMap model word = M.insertWith' (+) word 1 model mkWords = filter (not . B.null) . X.splitWith isNotAlpha lower !s = X.map toLower s isNotAlpha !c = c < 0x41 || (c > 0x5a && c < 0x61) || c > 0x7a toLower !c | c >= 0x41 && c <= 0x5a = c + 0x20 | otherwise = c
After reading Bryan's post, I switched my right fold to a strict left fold and almost tripled my original speed. Could someone provide some guidelines on when to use each? I thought foldr should be used when building some large data structure such as the map I build.
Bryan, out of curiosity, is a non bytestring version of your code faster?

Derek Elkins
After reading Bryan's post, I switched my right fold to a strict left fold and almost tripled my original speed. Could someone provide some guidelines on when to use each? I thought foldr should be used when building some large data structure such as the map I build. Bryan, out of curiosity, is a non bytestring version of your code faster?
I read the article and understand it, but I am having a hard time applying that specifically to my use of foldr. Here is how I was using foldr: type WordFreq = M.Map B.ByteString Int train:: [B.ByteString] -> WordFreq train words = frequencyMap where frequencyMap = foldr incWordCount M.empty words incWordCount w m = M.insertWith (+) w 1 m So is 'incWordCount' strict in its second argument? I'm still not sure exactly what that means. According to the wiki page, if it is strict in the second argument, I should have used foldl' instead of foldr. Thanks, Pete

Pete Kazmier wrote:
Derek Elkins
writes: After reading Bryan's post, I switched my right fold to a strict left fold and almost tripled my original speed. Could someone provide some guidelines on when to use each? I thought foldr should be used when building some large data structure such as the map I build. Bryan, out of curiosity, is a non bytestring version of your code faster? http://www.haskell.org/hawiki/StackOverflow
I read the article and understand it, but I am having a hard time applying that specifically to my use of foldr. Here is how I was using foldr:
type WordFreq = M.Map B.ByteString Int
train:: [B.ByteString] -> WordFreq train words = frequencyMap where frequencyMap = foldr incWordCount M.empty words incWordCount w m = M.insertWith (+) w 1 m
So is 'incWordCount' strict in its second argument? I'm still not sure exactly what that means. According to the wiki page, if it is strict in the second argument, I should have used foldl' instead of foldr.
insertWith will definitely need to examine 'm' to perform it's action, therefore it is strict.

Pete Kazmier wrote:
train:: [B.ByteString] -> WordFreq train words = frequencyMap where frequencyMap = foldr incWordCount M.empty words incWordCount w m = M.insertWith (+) w 1 m
So is 'incWordCount' strict in its second argument? I'm still not sure exactly what that means.
Yes. incWordCount is strict in its second argument since incWordCount x undefined == undefined Of course you cannot see that from the definition of incWordCount alone, this depends on the behavior of M.insertWith.
According to the wiki page, if it is strict in the second argument, I should have used foldl' instead of foldr.
Remember that the difference between foldr and foldl is not one between left and right; both have to recurse from the left. But foldr is normal recursion, while foldl is accumulator recursion. You obviously wanted an accumulator, and it should usually be strictly evaluated. There is another bug of this sort in your code. Consider
incWordCount w m = M.insertWith (+) w 1 m
There is no reason to evaluate the sum inside the map, instead an unevaluated thunk is put in there. Unfortunately, you need to take the long way of using M.lookup and M.insert to build a strict replacement for M.insertWith. (A strict variant of Data.Map would be useful here, unfortunately, there is none.) -Udo -- Streitigkeiten dauerten nie lange, wenn nur eine Seite Unrecht hätte. -- de la Rochefoucauld

Udo Stenzel wrote:
There is another bug of this sort in your code. Consider
incWordCount w m = M.insertWith (+) w 1 m
There is no reason to evaluate the sum inside the map, instead an unevaluated thunk is put in there.
Would not Data.Map.insertWith' do the trick?

Bryan O'Sullivan wrote:
Udo Stenzel wrote:
There is another bug of this sort in your code. Consider
incWordCount w m = M.insertWith (+) w 1 m
There is no reason to evaluate the sum inside the map, instead an unevaluated thunk is put in there.
Would not Data.Map.insertWith' do the trick?
Oops, you're right, this is a fairly recent addition. -Udo -- Walk softly and carry a BFG-9000.

Pete Kazmier wrote:
Bryan, out of curiosity, is a non bytestring version of your code faster?
No, but the difference is smaller than I expected: the lazy ByteString version is about 1.8x faster than using String.

On Sun, 2007-04-22 at 12:55 -0400, Pete Kazmier wrote:
After reading Bryan's post, I switched my right fold to a strict left fold and almost tripled my original speed. Could someone provide some guidelines on when to use each? I thought foldr should be used when building some large data structure such as the map I build.
You are building a strict data structure, not a lazy one. You cannot use that map until every element has been inserted into it, there are no partial intermediate results you can use like say when you build a lazy list. Hence it's better to use a strict left fold. Rule of thumb: use a lazy fold to build lazy structures, use a strict fold to build strict ones. Duncan

Bryan O'Sullivan wrote:
In my profile results, I find that simply converting words to lower case accounts for a whopping 40% of time and allocation (see the attachment for my definition of the train function).
COST CENTRE MODULE %time %alloc
lower Spell 40.5 41.2 train Spell 26.3 14.3 mkWords Spell 21.9 24.1
A little more instrumentation says this (using the darcs head of fps built with -auto-all): loopU NewData.ByteString.Fusion 25.4 28.8 splitWith NewData.ByteString 15.4 17.2 train Spell 10.2 6.1 isNotAlpha Spell 9.4 12.2 compareBytes NewData.ByteString 8.8 9.6 compareBytes NewData.ByteString.Lazy 7.4 0.4 inlinePerformIO NewData.ByteString.Base 6.6 0.0 (At Stefan's suggestion, I renamed the modules in fps to NewData.*, in order to get around the name clashes when I try to concurrently use fps and base.)
participants (10)
-
Albert Y. C. Lai
-
Arie Peterson
-
Bryan O'Sullivan
-
Derek Elkins
-
Duncan Coutts
-
Felipe Almeida Lessa
-
Ketil Malde
-
Pete Kazmier
-
Stefan O'Rear
-
Udo Stenzel