
Hey, you're using String I/O!
nWORDS <- fmap (train . map B.pack . words) (readFile "big.txt")
This should be
WORDS <- fmap (train . B.words) (B.readFile "big.txt")
By the way, which exact file do you use as a misspellings file? The
corpus linked to at Norvig's page has many.
And do you have a driver program that I could run and obtain your timings?
2009/6/22 Kamil Dworakowski
Hi all,
I want to learn how to optimize Haskell code on the Norvig's spelling correction program [1]. Peter Norvig has written the program in Python, quite a literate translation to Haskell (thanks to Grzegorz Chrupala) [2] was also available.
Both versions are about 23 LOCs, and Haskell version is four times slower, 2m25s versus 36s. All the numbers I give come from running the program on a list of 806 misspellings, Haskell version compiled with - O2 and -fforce-recomp flags.
I started with trial and error approach. Just preventing a repetitive call to keysSet brought the running time down to 1m48s. I also restructured the edits1 function and dropped all uses of sets, which haven't been pulling their own weight. This brought the running time down to 55s. Not quite there yet but close.
Reading the Profiling chapter in the RWH book enabled me to do some more informed optimizing. At this point the GC time was about 2% of the overall runnig time, Grzegorz has done a good job of using strict versions of functions where necessary. The timing per expression however showed something useful, 70% of the time was being spent around determining membership in the known words dictionary (Data.Map). There are about 30 000 items in the dictionary.
Right... Python uses hashtables while here I have a tree with log n access time. I did not want to use the Data.HashTable, it would pervade my program with IO. The alternative is an ideal hashmap that never gets changed. This program creates a dictionary at start which then is only being used to read from: an ideal application for the Data.PerfectHash by Mark Wotton available on Hackage [3].
The PerfectHash is a dictionary keyed by ByteStrings, so I had to migrate the program to use these instead of Strings. Sadly it increased the running time to 59s. Adding PerfectHash brought the running time down to 39s. I have done some code restructuring with the view to publishing it, which brought the running time up to 45s. The code is available on patch-tag [4] for anyone to scrutinize and hopefully point out mistakes (it uses big.txt and misspellings file which are linked to from Norvig's page [1]).
At this point [5] I don't know how to proceed. The program is still slower than Python. The -ddump-simpl dump does not give me any clues, there is too much of it and I don't understand most of it. The GC time is about 10% of the total, and according to the profiler almost half the time is spent in the member function, and almost all the rest in edits1. Shouldn't it run much faster than the interpreted Python?
Frankly, I am not impressed with the PerfectHash performance; It looks as if it is only two times faster than Data.Map. Is this because of the number of elements?
1. http://www.norvig.com/spell-correct.html 2. http://pitekus.blogspot.com/2007/04/norvigs-spelling-corrector-in-haskell.ht... 3. http://hackage.haskell.org/package/PerfectHash 4. http://patch-tag.com/r/spellcorrect/ 5. http://patch-tag.com/r/spellcorrect/snapshot/hash/20090621192727-a99e5-fed48... _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Eugene Kirpichov Web IR developer, market.yandex.ru