Optimizing spelling correction program

Hi all, I want to learn how to optimize Haskell code on the Norvig's spelling correction program [1]. Peter Norvig has written the program in Python, quite a literate translation to Haskell (thanks to Grzegorz Chrupala) [2] was also available. Both versions are about 23 LOCs, and Haskell version is four times slower, 2m25s versus 36s. All the numbers I give come from running the program on a list of 806 misspellings, Haskell version compiled with - O2 and -fforce-recomp flags. I started with trial and error approach. Just preventing a repetitive call to keysSet brought the running time down to 1m48s. I also restructured the edits1 function and dropped all uses of sets, which haven't been pulling their own weight. This brought the running time down to 55s. Not quite there yet but close. Reading the Profiling chapter in the RWH book enabled me to do some more informed optimizing. At this point the GC time was about 2% of the overall runnig time, Grzegorz has done a good job of using strict versions of functions where necessary. The timing per expression however showed something useful, 70% of the time was being spent around determining membership in the known words dictionary (Data.Map). There are about 30 000 items in the dictionary. Right... Python uses hashtables while here I have a tree with log n access time. I did not want to use the Data.HashTable, it would pervade my program with IO. The alternative is an ideal hashmap that never gets changed. This program creates a dictionary at start which then is only being used to read from: an ideal application for the Data.PerfectHash by Mark Wotton available on Hackage [3]. The PerfectHash is a dictionary keyed by ByteStrings, so I had to migrate the program to use these instead of Strings. Sadly it increased the running time to 59s. Adding PerfectHash brought the running time down to 39s. I have done some code restructuring with the view to publishing it, which brought the running time up to 45s. The code is available on patch-tag [4] for anyone to scrutinize and hopefully point out mistakes (it uses big.txt and misspellings file which are linked to from Norvig's page [1]). At this point [5] I don't know how to proceed. The program is still slower than Python. The -ddump-simpl dump does not give me any clues, there is too much of it and I don't understand most of it. The GC time is about 10% of the total, and according to the profiler almost half the time is spent in the member function, and almost all the rest in edits1. Shouldn't it run much faster than the interpreted Python? Frankly, I am not impressed with the PerfectHash performance; It looks as if it is only two times faster than Data.Map. Is this because of the number of elements? 1. http://www.norvig.com/spell-correct.html 2. http://pitekus.blogspot.com/2007/04/norvigs-spelling-corrector-in-haskell.ht... 3. http://hackage.haskell.org/package/PerfectHash 4. http://patch-tag.com/r/spellcorrect/ 5. http://patch-tag.com/r/spellcorrect/snapshot/hash/20090621192727-a99e5-fed48...

kamil:
Hi all,
I want to learn how to optimize Haskell code on the Norvig's spelling correction program [1]. Peter Norvig has written the program in Python, quite a literate translation to Haskell (thanks to Grzegorz Chrupala) [2] was also available.
Both versions are about 23 LOCs, and Haskell version is four times slower, 2m25s versus 36s. All the numbers I give come from running the program on a list of 806 misspellings, Haskell version compiled with - O2 and -fforce-recomp flags.
I started with trial and error approach. Just preventing a repetitive call to keysSet brought the running time down to 1m48s. I also restructured the edits1 function and dropped all uses of sets, which haven't been pulling their own weight. This brought the running time down to 55s. Not quite there yet but close.
Reading the Profiling chapter in the RWH book enabled me to do some more informed optimizing. At this point the GC time was about 2% of the overall runnig time, Grzegorz has done a good job of using strict versions of functions where necessary. The timing per expression however showed something useful, 70% of the time was being spent around determining membership in the known words dictionary (Data.Map). There are about 30 000 items in the dictionary.
What are the keys in your Map? Strict ByteStrings? And you're using ghc 6.10.x? -- Don

What are the keys in your Map? Strict ByteStrings?
And you're using ghc 6.10.x?
For the record, I use 6.10.1 and strict ByteString everywhere now. I used to have some lazy IO with vanilla strings, but switching to Data.ByteString.Char8.readFile didn't change the time at all. The big.txt is about 6megs. Profiling output says there are about 100 000 000 entries for the lookup function. http://patch-tag.com/r/spellcorrect/snapshot/current/content/pretty

Hello Kamil, Monday, June 22, 2009, 12:01:40 AM, you wrote:
Right... Python uses hashtables while here I have a tree with log n
you can try this pure hashtable approach: import Prelude hiding (lookup) import qualified Data.HashTable import Data.Array import qualified Data.List as List data HT a b = HT (a->Int) (Array Int [(a,b)]) -- size is the size of array (we implent closed hash) -- hash is the hash function (a->Int) -- list is assoclist of items to put in hash create size hash list = HT hashfunc (accumArray (flip (:)) [] (0, arrsize-1) (map (\(a,b) -> (hashfunc a,b)) list) ) where arrsize = head$ filter (>size)$ iterate (\x->3*x+1) 1 hashfunc a = hash a `mod` arrsize lookup a (HT hash arr) = List.lookup a (arr!hash a) main = do let assoclist = [("one", 1), ("two", 2), ("three", 3)] hash = create 10 (fromEnum . Data.HashTable.hashString) assoclist print (lookup "one" hash) print (lookup "zero" hash) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On Jun 22, 6:46 am, Bulat Ziganshin
Hello Kamil,
Monday, June 22, 2009, 12:01:40 AM, you wrote:
Right... Python uses hashtables while here I have a tree with log n
you can try this pure hashtable approach:
import Prelude hiding (lookup) import qualified Data.HashTable import Data.Array import qualified Data.List as List
data HT a b = HT (a->Int) (Array Int [(a,b)])
-- size is the size of array (we implent closed hash) -- hash is the hash function (a->Int) -- list is assoclist of items to put in hash create size hash list = HT hashfunc (accumArray (flip (:)) [] (0, arrsize-1) (map (\(a,b) -> (hashfunc a,b)) list) )
where arrsize = head$ filter (>size)$ iterate (\x->3*x+1) 1 hashfunc a = hash a `mod` arrsize
lookup a (HT hash arr) = List.lookup a (arr!hash a)
main = do let assoclist = [("one", 1), ("two", 2), ("three", 3)] hash = create 10 (fromEnum . Data.HashTable.hashString) assoclist print (lookup "one" hash) print (lookup "zero" hash)
It does not compile: No instance for (Num (String, b)) arising from the literal `3' at foo.hs:23:61 Possible fix: add an instance declaration for (Num (String, b)) In the expression: 3 In the expression: ("three", 3) In the expression: [("one", 1), ("two", 2), ("three", 3)]

Am Montag 22 Juni 2009 21:31:50 schrieb Kamil Dworakowski:
On Jun 22, 6:46 am, Bulat Ziganshin
wrote: Hello Kamil,
Monday, June 22, 2009, 12:01:40 AM, you wrote:
Right... Python uses hashtables while here I have a tree with log n
you can try this pure hashtable approach:
import Prelude hiding (lookup) import qualified Data.HashTable import Data.Array import qualified Data.List as List
data HT a b = HT (a->Int) (Array Int [(a,b)])
-- size is the size of array (we implent closed hash) -- hash is the hash function (a->Int) -- list is assoclist of items to put in hash create size hash list = HT hashfunc (accumArray (flip (:)) [] (0, arrsize-1) (map (\(a,b) -> (hashfunc a,b))
Typo: should be map (\(a,b) -> (hashfunc a, (a,b))
list) )
where arrsize = head$ filter (>size)$ iterate (\x->3*x+1) 1 hashfunc a = hash a `mod` arrsize
lookup a (HT hash arr) = List.lookup a (arr!hash a)
main = do let assoclist = [("one", 1), ("two", 2), ("three", 3)] hash = create 10 (fromEnum . Data.HashTable.hashString) assoclist print (lookup "one" hash) print (lookup "zero" hash)
It does not compile:
No instance for (Num (String, b)) arising from the literal `3' at foo.hs:23:61 Possible fix: add an instance declaration for (Num (String, b)) In the expression: 3 In the expression: ("three", 3) In the expression: [("one", 1), ("two", 2), ("three", 3)]

On Jun 22, 9:06 pm, Daniel Fischer
Am Montag 22 Juni 2009 21:31:50 schrieb Kamil Dworakowski:
On Jun 22, 6:46 am, Bulat Ziganshin
wrote: Hello Kamil,
Monday, June 22, 2009, 12:01:40 AM, you wrote:
Right... Python uses hashtables while here I have a tree with log n
you can try this pure hashtable approach:
import Prelude hiding (lookup) import qualified Data.HashTable import Data.Array import qualified Data.List as List
data HT a b = HT (a->Int) (Array Int [(a,b)])
-- size is the size of array (we implent closed hash) -- hash is the hash function (a->Int) -- list is assoclist of items to put in hash create size hash list = HT hashfunc (accumArray (flip (:)) [] (0, arrsize-1) (map (\(a,b) -> (hashfunc a,b))
Typo: should be
map (\(a,b) -> (hashfunc a, (a,b))
Wait! Have you typed that definition into the msg off the top of your head? :) I went back to using Strings instead of ByteStrings and with that hashtable the program finishes in 31.5s! w00t!

Hello Kamil, Tuesday, June 23, 2009, 12:54:49 AM, you wrote:
I went back to using Strings instead of ByteStrings and with that hashtable the program finishes in 31.5s! w00t!
and GC times are? also, try ByteString+HT, it should be pretty easy to write hashByteString -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On Jun 22, 10:12 pm, Bulat Ziganshin
Hello Kamil,
Tuesday, June 23, 2009, 12:54:49 AM, you wrote:
I went back to using Strings instead of ByteStrings and with that hashtable the program finishes in 31.5s! w00t!
and GC times are? also, try ByteString+HT, it should be pretty easy to write hashByteString
GC time is 10%. I'll try ByteString+HT tonight. It might indeed be faster because BloomFilter+String is slower that BloomFilter +ByteString.

Am Montag 22 Juni 2009 22:54:49 schrieb Kamil Dworakowski:
Wait! Have you typed that definition into the msg off the top of your head? :)
No, took a bit of looking.
I went back to using Strings instead of ByteStrings and with that hashtable the program finishes in 31.5s! w00t!
Nice :D

Kamil Dworakowski
Right... Python uses hashtables while here I have a tree with log n access time. I did not want to use the Data.HashTable, it would pervade my program with IO. The alternative is an ideal hashmap that never gets changed. This program creates a dictionary at start which then is only being used to read from: an ideal application for the Data.PerfectHash by Mark Wotton available on Hackage [3].
If you are considering alternative data structures, you might want to look at tries or Bloom filters, both have O(n) lookup, both have Haskell implementations. The latter is probably faster but probabilistic (i.e. it will occasionally fail to detect a misspelling - which you can of course check against a "real" dictionary). -k -- If I haven't seen further, it is by standing in the footprints of giants

On Mon, Jun 22, 2009 at 10:10 AM, Ketil Malde
Kamil Dworakowski
writes: Right... Python uses hashtables while here I have a tree with log n access time. I did not want to use the Data.HashTable, it would pervade my program with IO. The alternative is an ideal hashmap that never gets changed. This program creates a dictionary at start which then is only being used to read from: an ideal application for the Data.PerfectHash by Mark Wotton available on Hackage [3].
If you are considering alternative data structures, you might want to look at tries or Bloom filters, both have O(n) lookup, both have Haskell implementations. The latter is probably faster but probabilistic (i.e. it will occasionally fail to detect a misspelling - which you can of course check against a "real" dictionary).
Typo? Bloom filters have O(1) lookup and tries O(m) lookup where m is the number of characters in the string. Cheers, Johan

Johan Tibell
Typo? Bloom filters have O(1) lookup and tries O(m) lookup where m is the number of characters in the string.
Typically you need to examine the (whole) search string in order to compute the hash function, so I think it is fair to consider them both O(m). (Sorry about the alphabet confusion, I should of course have made it clear that I referred to search pattern size, not the data size.) -k -- If I haven't seen further, it is by standing in the footprints of giants

On Mon, Jun 22, 2009 at 12:05 PM, Ketil Malde
Johan Tibell
writes: Typo? Bloom filters have O(1) lookup and tries O(m) lookup where m is the number of characters in the string.
Typically you need to examine the (whole) search string in order to compute the hash function, so I think it is fair to consider them both O(m).
Very true.

On Jun 22, 9:10 am, Ketil Malde
Kamil Dworakowski
writes: Right... Python uses hashtables while here I have a tree with log n access time. I did not want to use the Data.HashTable, it would pervade my program with IO. The alternative is an ideal hashmap that never gets changed. This program creates a dictionary at start which then is only being used to read from: an ideal application for the Data.PerfectHash by Mark Wotton available on Hackage [3].
If you are considering alternative data structures, you might want to look at tries or Bloom filters, both have O(n) lookup, both have Haskell implementations. The latter is probably faster but probabilistic (i.e. it will occasionally fail to detect a misspelling - which you can of course check against a "real" dictionary).
Using Bryan O'Sullivan's fantastic BloomFilter I got it down below Python's run time! Now it is 35.56s, 28% of the time is spent on GC, which I think means there is still some room for improvement. Do you say that PerfectHash runs with a penalty of calling into c library?

kamil:
On Jun 22, 9:10 am, Ketil Malde
wrote: Kamil Dworakowski
writes: Right... Python uses hashtables while here I have a tree with log n access time. I did not want to use the Data.HashTable, it would pervade my program with IO. The alternative is an ideal hashmap that never gets changed. This program creates a dictionary at start which then is only being used to read from: an ideal application for the Data.PerfectHash by Mark Wotton available on Hackage [3].
If you are considering alternative data structures, you might want to look at tries or Bloom filters, both have O(n) lookup, both have Haskell implementations. The latter is probably faster but probabilistic (i.e. it will occasionally fail to detect a misspelling - which you can of course check against a "real" dictionary).
Using Bryan O'Sullivan's fantastic BloomFilter I got it down below Python's run time! Now it is 35.56s, 28% of the time is spent on GC, which I think means there is still some room for improvement.
One easy way to fix the GC time is to increase the default heap size. ./a.out +RTS -A200M for example.

Hello Don, Tuesday, June 23, 2009, 1:22:46 AM, you wrote:
One easy way to fix the GC time is to increase the default heap size.
./a.out +RTS -A200M
to be exact, -A isn't a heap size - it's frequency of generation-1 collections. by default, collection perfromed every 512kbytes, tied to L2 cache of CPUs -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Using Bryan O'Sullivan's fantastic BloomFilter I got it down below Python's run time! Now it is 35.56s, 28% of the time is spent on GC, which I think means there is still some room for improvement.
One easy way to fix the GC time is to increase the default heap size.
./a.out +RTS -A200M
It does make the GC only 1.4% of run time but it increases it overall by 14s.

Hello Kamil, Tuesday, June 23, 2009, 11:17:43 AM, you wrote:
One easy way to fix the GC time is to increase the default heap size.
./a.out +RTS -A200M
It does make the GC only 1.4% of run time but it increases it overall by 14s.
not surprising - you lose L2 cache locality. try to use -A size that is equal to your L2 cache size -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On Jun 23, 9:35 am, Bulat Ziganshin
Hello Kamil,
Tuesday, June 23, 2009, 11:17:43 AM, you wrote:
One easy way to fix the GC time is to increase the default heap size.
./a.out +RTS -A200M It does make the GC only 1.4% of run time but it increases it overall by 14s.
not surprising - you lose L2 cache locality. try to use -A size that is equal to your L2 cache size
I have a dual core with 3M L2 cache, but setting this value to anyting more than 1M slows down the program.

Hey, you're using String I/O!
nWORDS <- fmap (train . map B.pack . words) (readFile "big.txt")
This should be
WORDS <- fmap (train . B.words) (B.readFile "big.txt")
By the way, which exact file do you use as a misspellings file? The
corpus linked to at Norvig's page has many.
And do you have a driver program that I could run and obtain your timings?
2009/6/22 Kamil Dworakowski
Hi all,
I want to learn how to optimize Haskell code on the Norvig's spelling correction program [1]. Peter Norvig has written the program in Python, quite a literate translation to Haskell (thanks to Grzegorz Chrupala) [2] was also available.
Both versions are about 23 LOCs, and Haskell version is four times slower, 2m25s versus 36s. All the numbers I give come from running the program on a list of 806 misspellings, Haskell version compiled with - O2 and -fforce-recomp flags.
I started with trial and error approach. Just preventing a repetitive call to keysSet brought the running time down to 1m48s. I also restructured the edits1 function and dropped all uses of sets, which haven't been pulling their own weight. This brought the running time down to 55s. Not quite there yet but close.
Reading the Profiling chapter in the RWH book enabled me to do some more informed optimizing. At this point the GC time was about 2% of the overall runnig time, Grzegorz has done a good job of using strict versions of functions where necessary. The timing per expression however showed something useful, 70% of the time was being spent around determining membership in the known words dictionary (Data.Map). There are about 30 000 items in the dictionary.
Right... Python uses hashtables while here I have a tree with log n access time. I did not want to use the Data.HashTable, it would pervade my program with IO. The alternative is an ideal hashmap that never gets changed. This program creates a dictionary at start which then is only being used to read from: an ideal application for the Data.PerfectHash by Mark Wotton available on Hackage [3].
The PerfectHash is a dictionary keyed by ByteStrings, so I had to migrate the program to use these instead of Strings. Sadly it increased the running time to 59s. Adding PerfectHash brought the running time down to 39s. I have done some code restructuring with the view to publishing it, which brought the running time up to 45s. The code is available on patch-tag [4] for anyone to scrutinize and hopefully point out mistakes (it uses big.txt and misspellings file which are linked to from Norvig's page [1]).
At this point [5] I don't know how to proceed. The program is still slower than Python. The -ddump-simpl dump does not give me any clues, there is too much of it and I don't understand most of it. The GC time is about 10% of the total, and according to the profiler almost half the time is spent in the member function, and almost all the rest in edits1. Shouldn't it run much faster than the interpreted Python?
Frankly, I am not impressed with the PerfectHash performance; It looks as if it is only two times faster than Data.Map. Is this because of the number of elements?
1. http://www.norvig.com/spell-correct.html 2. http://pitekus.blogspot.com/2007/04/norvigs-spelling-corrector-in-haskell.ht... 3. http://hackage.haskell.org/package/PerfectHash 4. http://patch-tag.com/r/spellcorrect/ 5. http://patch-tag.com/r/spellcorrect/snapshot/hash/20090621192727-a99e5-fed48... _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Eugene Kirpichov Web IR developer, market.yandex.ru

On Jun 22, 10:03 am, Eugene Kirpichov
Hey, you're using String I/O!
nWORDS <- fmap (train . map B.pack . words) (readFile "big.txt")
This should be
WORDS <- fmap (train . B.words) (B.readFile "big.txt")
By the way, which exact file do you use as a misspellings file? The corpus linked to at Norvig's page has many. And do you have a driver program that I could run and obtain your timings?
Yep, Don pointed that out and I have changed the program accordingly. It didn't make any difference though. The time spent on building the dictionary is a small portion of the overall run time. Please see the repository contents for the current version of the program: http://patch-tag.com/r/spellcorrect/snapshot/current/content/pretty The eval-bytestring.hs there is the program I used for timing. Inside of it you will find the name of the misspellings file needed. Thanks all for the suggestions. I'll try them when I get home tonight. -- Kamil Dworakowski

Kamil Dworakowski wrote:
On Jun 22, 10:03 am, Eugene Kirpichov
wrote: Hey, you're using String I/O!
nWORDS <- fmap (train . map B.pack . words) (readFile "big.txt")
This should be
WORDS <- fmap (train . B.words) (B.readFile "big.txt")
By the way, which exact file do you use as a misspellings file? The corpus linked to at Norvig's page has many. And do you have a driver program that I could run and obtain your timings?
Yep, Don pointed that out and I have changed the program accordingly. It didn't make any difference though. The time spent on building the dictionary is a small portion of the overall run time.
Please see the repository contents for the current version of the program: http://patch-tag.com/r/spellcorrect/snapshot/current/content/pretty
The eval-bytestring.hs there is the program I used for timing. Inside of it you will find the name of the misspellings file needed.
Thanks all for the suggestions. I'll try them when I get home tonight.
Another suggestion, is that you should try to make sure that the lists constructed by getCommonSpellingMistakesWithCorrections get optimized away. As written I'm not sure there will be sufficient inlining to ensure that. If you want to be explicit about removing them, something like the following should help:
module Main where import Prelude hiding (words) import SpellingCorrection import qualified Data.ByteString.Char8 as B import Data.Char import Data.IORef import Control.Monad (forM_)
main = do corrector <- getCorrector misspell_corpus <- B.readFile "FAWTHROP1DAT.643" n <- newIORef (0::Int) wrong <- newIORef (0::Int) forM_ (B.lines misspell_corpus) $ \line -> do modifyIORef' n (1+) let [ms,c] = map (B.map toLower) . B.words $ line if corrector ms /= c then modifyIORef' wrong (1+) else return () accuracy <- do n' <- readIORef n w' <- readIORef wrong return $! 100 * fromIntegral w' / fromIntegral n' putStrLn $ "accuracy" ++ show (100 - accuracy) ++ "%"
modifyIORef' :: IORef a -> (a -> a) -> IO () modifyIORef' r f = readIORef r >>= (writeIORef r $!) . f {-# INLINE modifyIORef' #-}
-- Live well, ~wren

Kamil Dworakowski wrote:
The timing per expression however showed something useful, 70% of the time was being spent around determining membership in the known words dictionary (Data.Map). There are about 30 000 items in the dictionary.
Right... Python uses hashtables while here I have a tree with log n access time. I did not want to use the Data.HashTable, it would pervade my program with IO. The alternative is an ideal hashmap that never gets changed. This program creates a dictionary at start which then is only being used to read from: an ideal application for the Data.PerfectHash by Mark Wotton available on Hackage [3].
The PerfectHash is a dictionary keyed by ByteStrings, so I had to migrate the program to use these instead of Strings. Sadly it increased the running time to 59s. Adding PerfectHash brought the running time down to 39s. I have done some code restructuring with the view to publishing it, which brought the running time up to 45s. The code is available on patch-tag [4] for anyone to scrutinize and hopefully point out mistakes (it uses big.txt and misspellings file which are linked to from Norvig's page [1]).
You should also consider using Data.Trie from the bytestring-trie package. It also uses ByteStrings as the keys, but it uses a big-endian patricia tree instead of a hash. This removes Data.Map's duplication of effort for comparing common prefixes of strings. Constructing a Data.Trie can be somewhat slower than constructing a Data.Map ByteString, but the lookup is much faster. If your misspellings file changes infrequently and you're not afraid of utilities, you can pre-compile the Data.Trie and read it in with Data.Binary. If it changes really infrequently and you're not afraid of recompilation, you can use Template Haskell to store the pre-compiled form directly in the executable.
Frankly, I am not impressed with the PerfectHash performance; It looks as if it is only two times faster than Data.Map. Is this because of the number of elements?
Perfection does have its cost. I haven't played with Data.PerfectHash, but if the hashing function is slow then it will kill your performance. In general hashing will do too much work because it looks at the whole ByteString which is usually unnecessary (though this extra work can be less than the costs of putting the work off, depending on the dataset). ... Though honestly, most spellcheckers are more interested in optimizing for space rather than optimizing for time. For this, you'll want to look into something like Golomb codes (that is, Huffman codes for exponentially-distributed infinite-alphabets) as was used in the Unix "spell" utility back in the day. Once you've defined a prefix-free code for all possible words, you can encode each word and pack the result into a ByteString (treated as an array of bits) and hand it off to Data.Trie which will do the right thing. Depending on the size of your data, you may want to squash the Data.Trie down into an array and use an algorithm rather than the datastructure to do your search (just as binary search algorithms on arrays actualize data structures like Data.Map (or datastructures like data.Map reify binary search algorithms, if you prefer)). -- Live well, ~wren
participants (8)
-
Bulat Ziganshin
-
Daniel Fischer
-
Don Stewart
-
Eugene Kirpichov
-
Johan Tibell
-
Kamil Dworakowski
-
Ketil Malde
-
wren ng thornton