Re: [Haskell-cafe] Optimizing spelling correction program

24 Jun 2009

      Kamil Dworakowski wrote:
...
The timing per expression however
showed something useful, 70% of the time was being spent around
determining membership in the known words dictionary (Data.Map). There
are about 30 000 items in the dictionary.
Right... Python uses hashtables while here I have a tree with log n
access time. I did not want to use the Data.HashTable, it would
pervade
my program with IO. The alternative is an ideal hashmap that never
gets
changed. This program creates a dictionary at start which then is only
being used to read from: an ideal application for the Data.PerfectHash
by Mark Wotton available on Hackage [3].
The PerfectHash is a dictionary keyed by ByteStrings, so I had to
migrate the program to use these instead of Strings. Sadly it
increased
the running time to 59s. Adding PerfectHash brought the running time
down to 39s. I have done some code restructuring with the view to
publishing it, which brought the running time up to 45s. The code is
available on patch-tag [4] for anyone to scrutinize and hopefully
point
out mistakes (it uses big.txt and misspellings file which are linked
to
from Norvig's page [1]).
You should also consider using Data.Trie from the bytestring-trie 
package. It also uses ByteStrings as the keys, but it uses a big-endian 
patricia tree instead of a hash. This removes Data.Map's duplication of 
effort for comparing common prefixes of strings. Constructing a 
Data.Trie can be somewhat slower than constructing a Data.Map 
ByteString, but the lookup is much faster.

If your misspellings file changes infrequently and you're not afraid of 
utilities, you can pre-compile the Data.Trie and read it in with 
Data.Binary. If it changes really infrequently and you're not afraid of 
recompilation, you can use Template Haskell to store the pre-compiled 
form directly in the executable.
...
Frankly, I am not impressed with the PerfectHash performance; It looks
as if it is only two times faster than Data.Map. Is this because of
the number of elements?
Perfection does have its cost.

I haven't played with Data.PerfectHash, but if the hashing function is 
slow then it will kill your performance. In general hashing will do too 
much work because it looks at the whole ByteString which is usually 
unnecessary (though this extra work can be less than the costs of 
putting the work off, depending on the dataset).

...

Though honestly, most spellcheckers are more interested in optimizing 
for space rather than optimizing for time. For this, you'll want to look 
into something like Golomb codes (that is, Huffman codes for 
exponentially-distributed infinite-alphabets) as was used in the Unix 
"spell" utility back in the day.

Once you've defined a prefix-free code for all possible words, you can 
encode each word and pack the result into a ByteString (treated as an 
array of bits) and hand it off to Data.Trie which will do the right 
thing. Depending on the size of your data, you may want to squash the 
Data.Trie down into an array and use an algorithm rather than the 
datastructure to do your search (just as binary search algorithms on 
arrays actualize data structures like Data.Map (or datastructures like 
data.Map reify binary search algorithms, if you prefer)).

-- 
Live well,
~wren

Re: [Haskell-cafe] Optimizing spelling correction program

wren ng thornton