Re: [Haskell-cafe] Spelling checker exercise

24 Jan 2010

      Thanks very much Daniel for giving my (amateurish!) exercise such an in-depth a look-over. Comments inline below.

On 23/01/2010, at 12:11 AM, Daniel Fischer wrote:
...
Am Freitag 22 Januar 2010 07:51:27 schrieb Matthew Phillips:
...
Hello all,
sorry to bring up an old chestnut, but I’m trying to improve my
Haskell-fu by writing a small program, and chose Peter Norvig’s spelling
checker as my exercise material (http://norvig.com/spell-correct.html).
While I’ve gotten it working, and found it quite illuminating, I also
found it to to be very slow. And then I discovered that, of course,
others have been here well before me ([1] and [2]). Those discussions
were very interesting, but the code they refer to is mostly not
available, so the most I’ve gotten out of it so far is that:
(a) I should be using strict left folds and strict Map insertions for
the word frequency map (this shaved off about a second: ~5s -> ~4s for a
single word on my 2.4GHz MacBook Pro, GHC 6.10.4) (b) I should probably
be using ByteString’s
That does help, but the worst part is building the map. That takes a couple 
of seconds in Python, too. Just building the map takes 1.95s for Python, 
3.6s (including GC) with strict ByteStrings, 4.2s with lazy ByteStrings and 
6s with plain Strings here.
I get the Python version running in about 1s, compared to 5.6s for the Haskell:

$ time python spelling.py 
because

real	0m1.071s
user	0m0.821s
sys	0m0.139s

$ time ./spelling becuase
becuase -> because

real	0m5.589s
user	0m4.554s
sys	0m0.307s

And, strangely, the rewrite you provided (I called it "spelling_df") runs a fair bit slower:

$ time ./spelling_df becuase
becuase -> because

real	0m8.087s
user	0m6.966s
sys	0m0.193s

$ time ./spelling korrekt
korrekt -> correct

real	0m5.970s
user	0m4.885s
sys	0m0.300s

$ time ./spelling_df korrekt
korrekt -> correct

real	0m8.616s
user	0m7.538s
sys	0m0.187s
...
...
The code is at [3] (link to version at time of post). Profiling [4]
...
...
shows:
$ ./spelling becuase +RTS -p
 becuase -> because
 $ cat spelling.prof
   total time =  4.02 secs (201 ticks @ 20 ms)
   total alloc = 1,544,257,792 bytes (excludes profiling overheads)
COST CENTRE                    MODULE               %time %alloc
train                          Main                  52.7   19.7
 readFile                       Main                  28.9    8.6
 wordsBy                        Main                  10.9   49.5
 toLower                        Main                   7.0   21.8
 ...
So it appears that “train" (building the freq map) and “readFile” in
“nwords" are the places to hone.
readFile does not appear in my profile.
Apologies, I should have said that I’d inserted some SCC’s to try to tease out the cost of readFile (i.e. {-# SCC "readFile"}).
...
If you insert an SCC for updateMap,
where updateMap model word = {-# SCC "updateMap" #-} insertWith' (+) word 1 
model
, you'll see that the really bad citizen is updateMap (splitWords is rather 
bad, too, together they take some 95% of the time in that profile).
Maybe I'm doing this wrong, but I see "splitWords" in spelling_df taking 80% of runtime. Adding SCC's like this:

  splitWords = {-# SCC "filter" #-} filter (not . B.null) . {-# SCC "splitWith" #-} B.splitWith isNogud . {-# SCC "toLower" #-} B.map toLower

gives me:

  splitWords             Main     216           1   0.0    0.0    78.6   91.8
   filter                Main     217           1   1.9    3.0    78.6   91.8
    splitWith            Main     218           1  28.4   36.8    76.7   88.8
     isNogud             Main     221     6488666   4.2    4.1     4.2    4.1
     toLower             Main     219           1  44.2   47.9    44.2   47.9

i.e. it seems that "splitWith" and "toLower" (!) are the culprits. Why, I have no idea.

Am I reading this wrong?
...
But once you start needing two edits (try korrekt), correct and edits1 
start to show up. That shows that Norvig's algorithm isn't really good.
With two edit steps, you create a _lot_ of strings you need to look up, far 
more than there are in the map. That takes time. It'll be better to scan 
the map for entries with an edit distance (< 3) if you have a good method 
to check that 
(http://old.nabble.com/haskell-in-online-contests-td26546989.html contains 
pointers for that).
Will have a look at that: it looks like it'll be very informative.
...
Another thing is
allWords = keysSet wordCounts
Ouch. For each correction, you construct that set anew. Just use member 
from Data.Map instead of Data.Set.member and look up the words in the map.
Whoops! Just to be clear though: Haskell will memoise the result of "allWords" for a given invocation of "correct"? So this would only make a large difference for multiple corrections? (which I wasn't worrying about for the moment). The change seems to wipe off about 0.2s on average.
...
...
I will look at using Bloom Filters or
Trie’s instead of Data.Map, but I wonder if readFile should be taking
nearly %30 of the run time, even for a 6MB file?
No way. But it doesn't seem to, from my GHC's point of view.
Just to be sure I wasn't using the SCC incorrectly, I split out the readFile call into "myReadFile". The prof line comes out as:

  myReadFile            Main         210           1  35.8    8.6    35.8    8.6

i.e. 35.8%  of runtime.
...
...
Sorry to dump such a long post on the list — I’ll understand if no one
can be bothered rehashing this. But, in summary I’d like to know:
(a) how could I use ByteString’s for this to speed up I/O and reduce
memory usage without losing the nice readability?
And thanks for the other small tips (e.g. findWithDefault), and I didn't know you could use let the way you did either.

Cheers,

Matthew.