Re: [Haskell-cafe] Spelling checker exercise

25 Jan 2010

      On 24/01/2010, at 10:22 PM, Daniel Fischer wrote:

<...>
...
I think I know what happened here:
$ ghc -fforce-recomp --make matthew -o matthew0
<...>
...
I habitually compile all code with -O2, unless I have a specific reason not 
to. I tend to forget that some compile without optimisations.
For some kinds of code that makes hardly any difference, for others it 
makes a big difference.
I used the flags "-funbox-strict-fields -fvia-C -optc-O2", but *not* -O2. Whoops! I could kick myself: I blindly assumed that -optc-O2 would turn on optimisation, but of course that's just the flag for GCC.

  $ time ./spelling becuase
  becuase -> because

  real	0m4.467s
  user	0m3.865s
  sys	0m0.280s

  $ time ./spelling_df becuase
  becuase -> because

  real	0m2.422s
  user	0m2.198s
  sys	0m0.081s

Your previous version is close to twice as fast, and now only 2.5 times slower than Python.

<snipped new version of code with toLower removed>

With your suggested changes, your latest version on my machine:

  $ time ./spelling_df becuase
  becuase -> because

  real	0m1.731s
  user	0m1.572s
  sys	0m0.056s

Now, we're getting close: 4.7s -> 2.3s -> 1.7s.
...
...
...
But once you start needing two edits (try korrekt), correct and edits1
start to show up. That shows that Norvig's algorithm isn't really
good. With two edit steps, you create a _lot_ of strings you need to
look up, far more than there are in the map. That takes time. It'll be
better to scan the map for entries with an edit distance (< 3) if you
have a good method to check that
Indeed:
$ time ./nLDBSWSpelling becuase
becuase -> because
2.84user 0.02system 0:02.86elapsed 100%CPU
$ time ./nLDBSWSpelling korrekt
korrekt -> correct
2.83user 0.05system 0:02.88elapsed 100%CPU
Not sure if I see what you're saying here: do you mean to point out the 2.86 vs 2.88 elapsed?
...
...
...
Another thing is
allWords = keysSet wordCounts
Ouch. For each correction, you construct that set anew. Just use
member from Data.Map instead of Data.Set.member and look up the words
in the map.
Whoops! Just to be clear though: Haskell will memoise the result of
"allWords" for a given invocation of "correct"?
Yes. But not across different invocations.
...
So this would only make a large difference for multiple corrections?
Right. But that's the interesting use case, isn't it?
It will be when I get the the rest of it working, yes :)
...
...
...
...
I will look at using Bloom Filters or
Trie’s instead of Data.Map, but I wonder if readFile should be taking
nearly %30 of the run time, even for a 6MB file?
No way. But it doesn't seem to, from my GHC's point of view.
Just to be sure I wasn't using the SCC incorrectly, I split out the
readFile call into "myReadFile". The prof line comes out as:
myReadFile            Main         210           1  35.8    8.6   
35.8    8.6
i.e. 35.8%  of runtime.
Can I see the exact code which gives that profile?
Somehow, things which shouldn't must be attributed to readFile.
The version at this link has myReadFile split out.

http://github.com/scramjet/spelling/blob/31071edb2166b2bc4d350358900d975228f...

Doing the same to your version has the same result:

  myReadFile       Main       210           1  45.7    9.6    45.7    9.6

It does seem that the profiler is wrong or misleading somehow.

Two other quick questions:

(1) you added parentheses to "candidates":

  candidates = known [word] `or` ((known e1) `or` known_edits2)

The "or"'s should short circuit so that if "known [word]" is non-empty, the others shouldn't be evaluated. I read the above as forcing evaluation of the second "or" first: am I wrong?

(2) you eliminated the "fold" in "correct" in favour of a tail-recursive search in "maxCount": was this for style or performance reasons (or both :)?

Cheers,

Matthew.

Re: [Haskell-cafe] Spelling checker exercise

Matthew Phillips