
Ketil Malde wrote:
Hm - nobody suggested using ByteStrings yet?
I wrote an independent port of Norvig's spellchecker, because I figured it would make the basis of an interesting tutorial. For such a small program, it's been quite a challenge! I started out using lazy ByteStrings, Data.Map, and Data.Set. As Albert observed, using Data.Set is poison for heap size and performance. The result of switching from sets to lists was a > 90% reduction in memory usage, and a big (but unmeasured) speedup. After this switch, I found that spellchecking one word still took 4x as long in Haskell as Norvig's Python program. Since I was checking only one word in each case, essentially all of the runtime was taken up by building the word frequency map. In my profile results, I find that simply converting words to lower case accounts for a whopping 40% of time and allocation (see the attachment for my definition of the train function). COST CENTRE MODULE %time %alloc lower Spell 40.5 41.2 train Spell 26.3 14.3 mkWords Spell 21.9 24.1 I was interested in building a profile-enabled version of fps to see what might be going on inside the library, but was stymied by not knowing how to get GHC 6.6's base to coexist peacefully with fps (hiding base isn't very practical). My experiments are available here: darcs get http://darcs.serpentine.com/spell Norvig's training data is available from http://norvig.com/big.txt