Re: [Haskell-cafe] Optimizing spelling correction program

22 Jun 2009

      Hey, you're using String I/O!

nWORDS <- fmap (train . map B.pack . words) (readFile "big.txt")

This should be

WORDS <- fmap (train . B.words) (B.readFile "big.txt")

By the way, which exact file do you use as a misspellings file? The
corpus linked to at Norvig's page has many.
And do you have a driver program that I could run and obtain your timings?

2009/6/22 Kamil Dworakowski :
...
Hi all,
I want to learn how to optimize Haskell code on the Norvig's spelling
correction program [1]. Peter Norvig has written the program in
Python,
quite a literate translation to Haskell (thanks to Grzegorz Chrupala)
[2] was also available.
Both versions are about 23 LOCs, and Haskell version is four times
slower, 2m25s versus 36s. All the numbers I give come from running the
program on a list of 806 misspellings, Haskell version compiled with -
O2
and -fforce-recomp flags.
I started with trial and error approach. Just preventing a repetitive
call to keysSet brought the running time down to 1m48s. I also
restructured the edits1 function and dropped all uses of sets, which
haven't been pulling their own weight. This brought the running time
down to 55s. Not quite there yet but close.
Reading the Profiling chapter in the RWH book enabled me to do some
more
informed optimizing. At this point the GC time was about 2% of the
overall runnig time, Grzegorz has done a good job of using strict
versions of functions where necessary. The timing per expression
however
showed something useful, 70% of the time was being spent around
determining membership in the known words dictionary (Data.Map). There
are about 30 000 items in the dictionary.
Right... Python uses hashtables while here I have a tree with log n
access time. I did not want to use the Data.HashTable, it would
pervade
my program with IO. The alternative is an ideal hashmap that never
gets
changed. This program creates a dictionary at start which then is only
being used to read from: an ideal application for the Data.PerfectHash
by Mark Wotton available on Hackage [3].
The PerfectHash is a dictionary keyed by ByteStrings, so I had to
migrate the program to use these instead of Strings. Sadly it
increased
the running time to 59s. Adding PerfectHash brought the running time
down to 39s. I have done some code restructuring with the view to
publishing it, which brought the running time up to 45s. The code is
available on patch-tag [4] for anyone to scrutinize and hopefully
point
out mistakes (it uses big.txt and misspellings file which are linked
to
from Norvig's page [1]).
At this point [5] I don't know how to proceed. The program is still
slower than Python. The -ddump-simpl dump does not give me any clues,
there is too much of it and I don't understand most of it. The GC time
is about 10% of the total, and according to the profiler almost half
the
time is spent in the member function, and almost all the rest in
edits1.
Shouldn't it run much faster than the interpreted Python?
Frankly, I am not impressed with the PerfectHash performance; It looks
as if it is only two times faster than Data.Map. Is this because of
the
number of elements?
1. http://www.norvig.com/spell-correct.html
2. http://pitekus.blogspot.com/2007/04/norvigs-spelling-corrector-in-haskell.ht...
3. http://hackage.haskell.org/package/PerfectHash
4. http://patch-tag.com/r/spellcorrect/
5.
http://patch-tag.com/r/spellcorrect/snapshot/hash/20090621192727-a99e5-fed48...
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
-- 
Eugene Kirpichov
Web IR developer, market.yandex.ru