
On 24/01/2010, at 10:22 PM, Daniel Fischer wrote: <...>
I think I know what happened here:
$ ghc -fforce-recomp --make matthew -o matthew0
<...>
I habitually compile all code with -O2, unless I have a specific reason not to. I tend to forget that some compile without optimisations. For some kinds of code that makes hardly any difference, for others it makes a big difference.
I used the flags "-funbox-strict-fields -fvia-C -optc-O2", but *not* -O2. Whoops! I could kick myself: I blindly assumed that -optc-O2 would turn on optimisation, but of course that's just the flag for GCC. $ time ./spelling becuase becuase -> because real 0m4.467s user 0m3.865s sys 0m0.280s $ time ./spelling_df becuase becuase -> because real 0m2.422s user 0m2.198s sys 0m0.081s Your previous version is close to twice as fast, and now only 2.5 times slower than Python. <snipped new version of code with toLower removed> With your suggested changes, your latest version on my machine: $ time ./spelling_df becuase becuase -> because real 0m1.731s user 0m1.572s sys 0m0.056s Now, we're getting close: 4.7s -> 2.3s -> 1.7s.
But once you start needing two edits (try korrekt), correct and edits1 start to show up. That shows that Norvig's algorithm isn't really good. With two edit steps, you create a _lot_ of strings you need to look up, far more than there are in the map. That takes time. It'll be better to scan the map for entries with an edit distance (< 3) if you have a good method to check that
Indeed: $ time ./nLDBSWSpelling becuase becuase -> because 2.84user 0.02system 0:02.86elapsed 100%CPU $ time ./nLDBSWSpelling korrekt korrekt -> correct 2.83user 0.05system 0:02.88elapsed 100%CPU
Not sure if I see what you're saying here: do you mean to point out the 2.86 vs 2.88 elapsed?
Another thing is
allWords = keysSet wordCounts
Ouch. For each correction, you construct that set anew. Just use member from Data.Map instead of Data.Set.member and look up the words in the map.
Whoops! Just to be clear though: Haskell will memoise the result of "allWords" for a given invocation of "correct"?
Yes. But not across different invocations.
So this would only make a large difference for multiple corrections?
Right. But that's the interesting use case, isn't it?
It will be when I get the the rest of it working, yes :)
I will look at using Bloom Filters or Trie’s instead of Data.Map, but I wonder if readFile should be taking nearly %30 of the run time, even for a 6MB file?
No way. But it doesn't seem to, from my GHC's point of view.
Just to be sure I wasn't using the SCC incorrectly, I split out the readFile call into "myReadFile". The prof line comes out as:
myReadFile Main 210 1 35.8 8.6 35.8 8.6
i.e. 35.8% of runtime.
Can I see the exact code which gives that profile? Somehow, things which shouldn't must be attributed to readFile.
The version at this link has myReadFile split out. http://github.com/scramjet/spelling/blob/31071edb2166b2bc4d350358900d975228f... Doing the same to your version has the same result: myReadFile Main 210 1 45.7 9.6 45.7 9.6 It does seem that the profiler is wrong or misleading somehow. Two other quick questions: (1) you added parentheses to "candidates": candidates = known [word] `or` ((known e1) `or` known_edits2) The "or"'s should short circuit so that if "known [word]" is non-empty, the others shouldn't be evaluated. I read the above as forcing evaluation of the second "or" first: am I wrong? (2) you eliminated the "fold" in "correct" in favour of a tail-recursive search in "maxCount": was this for style or performance reasons (or both :)? Cheers, Matthew.