
Thanks very much Daniel for giving my (amateurish!) exercise such an in-depth a look-over. Comments inline below. On 23/01/2010, at 12:11 AM, Daniel Fischer wrote:
Am Freitag 22 Januar 2010 07:51:27 schrieb Matthew Phillips:
Hello all,
sorry to bring up an old chestnut, but I’m trying to improve my Haskell-fu by writing a small program, and chose Peter Norvig’s spelling checker as my exercise material (http://norvig.com/spell-correct.html).
While I’ve gotten it working, and found it quite illuminating, I also found it to to be very slow. And then I discovered that, of course, others have been here well before me ([1] and [2]). Those discussions were very interesting, but the code they refer to is mostly not available, so the most I’ve gotten out of it so far is that:
(a) I should be using strict left folds and strict Map insertions for the word frequency map (this shaved off about a second: ~5s -> ~4s for a single word on my 2.4GHz MacBook Pro, GHC 6.10.4) (b) I should probably be using ByteString’s
That does help, but the worst part is building the map. That takes a couple of seconds in Python, too. Just building the map takes 1.95s for Python, 3.6s (including GC) with strict ByteStrings, 4.2s with lazy ByteStrings and 6s with plain Strings here.
I get the Python version running in about 1s, compared to 5.6s for the Haskell: $ time python spelling.py because real 0m1.071s user 0m0.821s sys 0m0.139s $ time ./spelling becuase becuase -> because real 0m5.589s user 0m4.554s sys 0m0.307s And, strangely, the rewrite you provided (I called it "spelling_df") runs a fair bit slower: $ time ./spelling_df becuase becuase -> because real 0m8.087s user 0m6.966s sys 0m0.193s $ time ./spelling korrekt korrekt -> correct real 0m5.970s user 0m4.885s sys 0m0.300s $ time ./spelling_df korrekt korrekt -> correct real 0m8.616s user 0m7.538s sys 0m0.187s
The code is at [3] (link to version at time of post). Profiling [4]
shows:
$ ./spelling becuase +RTS -p becuase -> because $ cat spelling.prof total time = 4.02 secs (201 ticks @ 20 ms) total alloc = 1,544,257,792 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
train Main 52.7 19.7 readFile Main 28.9 8.6 wordsBy Main 10.9 49.5 toLower Main 7.0 21.8 ...
So it appears that “train" (building the freq map) and “readFile” in “nwords" are the places to hone.
readFile does not appear in my profile.
Apologies, I should have said that I’d inserted some SCC’s to try to tease out the cost of readFile (i.e. {-# SCC "readFile"}).
If you insert an SCC for updateMap,
where updateMap model word = {-# SCC "updateMap" #-} insertWith' (+) word 1 model
, you'll see that the really bad citizen is updateMap (splitWords is rather bad, too, together they take some 95% of the time in that profile).
Maybe I'm doing this wrong, but I see "splitWords" in spelling_df taking 80% of runtime. Adding SCC's like this: splitWords = {-# SCC "filter" #-} filter (not . B.null) . {-# SCC "splitWith" #-} B.splitWith isNogud . {-# SCC "toLower" #-} B.map toLower gives me: splitWords Main 216 1 0.0 0.0 78.6 91.8 filter Main 217 1 1.9 3.0 78.6 91.8 splitWith Main 218 1 28.4 36.8 76.7 88.8 isNogud Main 221 6488666 4.2 4.1 4.2 4.1 toLower Main 219 1 44.2 47.9 44.2 47.9 i.e. it seems that "splitWith" and "toLower" (!) are the culprits. Why, I have no idea. Am I reading this wrong?
But once you start needing two edits (try korrekt), correct and edits1 start to show up. That shows that Norvig's algorithm isn't really good. With two edit steps, you create a _lot_ of strings you need to look up, far more than there are in the map. That takes time. It'll be better to scan the map for entries with an edit distance (< 3) if you have a good method to check that (http://old.nabble.com/haskell-in-online-contests-td26546989.html contains pointers for that).
Will have a look at that: it looks like it'll be very informative.
Another thing is
allWords = keysSet wordCounts
Ouch. For each correction, you construct that set anew. Just use member from Data.Map instead of Data.Set.member and look up the words in the map.
Whoops! Just to be clear though: Haskell will memoise the result of "allWords" for a given invocation of "correct"? So this would only make a large difference for multiple corrections? (which I wasn't worrying about for the moment). The change seems to wipe off about 0.2s on average.
I will look at using Bloom Filters or Trie’s instead of Data.Map, but I wonder if readFile should be taking nearly %30 of the run time, even for a 6MB file?
No way. But it doesn't seem to, from my GHC's point of view.
Just to be sure I wasn't using the SCC incorrectly, I split out the readFile call into "myReadFile". The prof line comes out as: myReadFile Main 210 1 35.8 8.6 35.8 8.6 i.e. 35.8% of runtime.
Sorry to dump such a long post on the list — I’ll understand if no one can be bothered rehashing this. But, in summary I’d like to know:
(a) how could I use ByteString’s for this to speed up I/O and reduce memory usage without losing the nice readability?
And thanks for the other small tips (e.g. findWithDefault), and I didn't know you could use let the way you did either. Cheers, Matthew.