
Kamil Dworakowski wrote:
The timing per expression however showed something useful, 70% of the time was being spent around determining membership in the known words dictionary (Data.Map). There are about 30 000 items in the dictionary.
Right... Python uses hashtables while here I have a tree with log n access time. I did not want to use the Data.HashTable, it would pervade my program with IO. The alternative is an ideal hashmap that never gets changed. This program creates a dictionary at start which then is only being used to read from: an ideal application for the Data.PerfectHash by Mark Wotton available on Hackage [3].
The PerfectHash is a dictionary keyed by ByteStrings, so I had to migrate the program to use these instead of Strings. Sadly it increased the running time to 59s. Adding PerfectHash brought the running time down to 39s. I have done some code restructuring with the view to publishing it, which brought the running time up to 45s. The code is available on patch-tag [4] for anyone to scrutinize and hopefully point out mistakes (it uses big.txt and misspellings file which are linked to from Norvig's page [1]).
You should also consider using Data.Trie from the bytestring-trie package. It also uses ByteStrings as the keys, but it uses a big-endian patricia tree instead of a hash. This removes Data.Map's duplication of effort for comparing common prefixes of strings. Constructing a Data.Trie can be somewhat slower than constructing a Data.Map ByteString, but the lookup is much faster. If your misspellings file changes infrequently and you're not afraid of utilities, you can pre-compile the Data.Trie and read it in with Data.Binary. If it changes really infrequently and you're not afraid of recompilation, you can use Template Haskell to store the pre-compiled form directly in the executable.
Frankly, I am not impressed with the PerfectHash performance; It looks as if it is only two times faster than Data.Map. Is this because of the number of elements?
Perfection does have its cost. I haven't played with Data.PerfectHash, but if the hashing function is slow then it will kill your performance. In general hashing will do too much work because it looks at the whole ByteString which is usually unnecessary (though this extra work can be less than the costs of putting the work off, depending on the dataset). ... Though honestly, most spellcheckers are more interested in optimizing for space rather than optimizing for time. For this, you'll want to look into something like Golomb codes (that is, Huffman codes for exponentially-distributed infinite-alphabets) as was used in the Unix "spell" utility back in the day. Once you've defined a prefix-free code for all possible words, you can encode each word and pack the result into a ByteString (treated as an array of bits) and hand it off to Data.Trie which will do the right thing. Depending on the size of your data, you may want to squash the Data.Trie down into an array and use an algorithm rather than the datastructure to do your search (just as binary search algorithms on arrays actualize data structures like Data.Map (or datastructures like data.Map reify binary search algorithms, if you prefer)). -- Live well, ~wren