Re: [Haskell-cafe] Spelling checker exercise

25 Jan 2010

      Am Montag 25 Januar 2010 05:34:50 schrieb Matthew Phillips:
...
On 24/01/2010, at 10:22 PM, Daniel Fischer wrote:
<...>
...
I think I know what happened here:
$ ghc -fforce-recomp --make matthew -o matthew0
<...>
...
I habitually compile all code with -O2, unless I have a specific
reason not to. I tend to forget that some compile without
optimisations. For some kinds of code that makes hardly any
difference, for others it makes a big difference.
I used the flags "-funbox-strict-fields -fvia-C -optc-O2", but *not*
-O2. Whoops! I could kick myself: I blindly assumed that -optc-O2 would
turn on optimisation, but of course that's just the flag for GCC.
By the way, compiling via C nowadays hardly ever produces faster code than 
the native code generator (most times I tried since 6.10, if there was a 
difference, native was faster).
...
$ time ./spelling becuase
  becuase -> because
real	0m4.467s
  user	0m3.865s
  sys	0m0.280s
$ time ./spelling_df becuase
  becuase -> because
real	0m2.422s
  user	0m2.198s
  sys	0m0.081s
Your previous version is close to twice as fast, and now only 2.5 times
slower than Python.
<snipped new version of code with toLower removed>
With your suggested changes, your latest version on my machine:
$ time ./spelling_df becuase
  becuase -> because
real	0m1.731s
  user	0m1.572s
  sys	0m0.056s
Now, we're getting close: 4.7s -> 2.3s -> 1.7s.
...
...
...
But once you start needing two edits (try korrekt), correct and
edits1 start to show up. That shows that Norvig's algorithm isn't
really good. With two edit steps, you create a _lot_ of strings you
need to look up, far more than there are in the map. That takes
time. It'll be better to scan the map for entries with an edit
distance (< 3) if you have a good method to check that
Indeed:
$ time ./nLDBSWSpelling becuase
becuase -> because
2.84user 0.02system 0:02.86elapsed 100%CPU
$ time ./nLDBSWSpelling korrekt
korrekt -> correct
2.83user 0.05system 0:02.88elapsed 100%CPU
Not sure if I see what you're saying here: do you mean to point out the
2.86 vs 2.88 elapsed?
Well, above the code, I had the times for Norvig's algorithm (creating all 
two-step edits and checking which are known):
...
...
And, without profiling:
$ time ./spellingBSW becuase
becuase -> because
2.84user 0.03system 0:02.88elapsed 99%CPU
Finally, building the set of two-step edits takes longer than the
additional lookups:
$ time ./spellingBSW becuase
becuase -> because
2.84user 0.03system 0:02.88elapsed 99%CPU
$ time ./spellingBSW korrekt
korrekt -> correct
3.50user 0.02system 0:03.52elapsed 100%CPU
vs.
$ time ./spellingBSWL becuase
becuase -> because
2.79user 0.04system 0:02.83elapsed 100%CPU
$ time ./spellingBSWL3 korrekt
korrekt -> correct
3.20user 0.02system 0:03.23elapsed 99%CPU
For "becuase", which is a one-step edit, all take the same time (within 
normal fluctuation), 2.8x seconds.
For the two-step edit "korrekt", the version building Sets takes 3.5 
seconds, the version which doesn't build a Set of two-step edits takes 3.2 
seconds, *and the version scanning the Map of known words for entries with 
a Levenshtein distance [modified to account for transpositions] less than 
3, takes 2.8y seconds, practically the same time as for the one-step edit*.

It becomes more obvious, perhaps, if we test a couple of two-steppers:

Lazy Levenshtein:

 time ./nLDBSWSpelling becrase korrekt halmos territoir gutzenperg
becrase -> because
korrekt -> correct
halmos -> holmes
territoir -> territory
gutzenperg -> gutenberg
2.94user 0.03system 0:02.97elapsed 100%CPU

just something like 0.1 - 0.15 seconds more than for "becuase", that makes 
0.02 - 0.04 seconds per two-edit correction.

Sets:

$ time ./spellingBSW becrase korrekt halmos territoir gutzenperg
becrase -> because
korrekt -> correct
halmos -> holmes
territoir -> territory
gutzenperg -> gutenberg
7.48user 0.03system 0:07.55elapsed 99%CPU

Gewalt geschrien! That takes almost a second per two-edit correction.

List:

$ time ./spellingBSWL3 becrase korrekt halmos territoir gutzenperg
becrase -> because
korrekt -> correct
halmos -> holmes
territoir -> territory
gutzenperg -> gutenberg
5.54user 0.04system 0:05.58elapsed 99%CPU

Better, but still bad, roughly half a second per correction.

Python without psyco:

$ time python ./norvig.py becrase korrekt halmos territoir gutzenperg
Trained in  1.46993017197 seconds
because
correct
holmes
territory
gutenberg
3.00user 0.08system 0:03.09elapsed 99%CPU
$ time python ./norvig.py becuase
Trained in  1.49027395248 seconds
because
1.46user 0.08system 0:01.55elapsed 100%CPU

about 0.3 seconds per correction

and with:

$ time python ./norvig.py becrase korrekt halmos territoir gutzenperg
Trained in  1.22417902946 seconds
because
correct
holmes
territory
gutenberg
2.12user 0.09system 0:02.21elapsed 100%CPU
$ time python ./norvig.py becuase
Trained in  1.17486715317 seconds
because
1.14user 0.08system 0:01.23elapsed 99%CPU

about 0.2 seconds per correction.

(Just for the record, I found out what Python did in the half second I 
couldn't account for: it looked for matches for "./norvig.py", I forgot 
that Python passes the name of the script in argv[0] - d'oh)
...
...
...
...
Another thing is
allWords = keysSet wordCounts
Ouch. For each correction, you construct that set anew. Just use
member from Data.Map instead of Data.Set.member and look up the
words in the map.
Whoops! Just to be clear though: Haskell will memoise the result of
"allWords" for a given invocation of "correct"?
Yes. But not across different invocations.
...
So this would only make a large difference for multiple corrections?
Right. But that's the interesting use case, isn't it?
It will be when I get the the rest of it working, yes :)
...
...
...
...
I will look at using Bloom Filters or
Trie’s instead of Data.Map, but I wonder if readFile should be
taking nearly %30 of the run time, even for a 6MB file?
No way. But it doesn't seem to, from my GHC's point of view.
Just to be sure I wasn't using the SCC incorrectly, I split out the
readFile call into "myReadFile". The prof line comes out as:
myReadFile            Main         210           1  35.8    8.6
35.8    8.6
i.e. 35.8%  of runtime.
Can I see the exact code which gives that profile?
Somehow, things which shouldn't must be attributed to readFile.
The version at this link has myReadFile split out.
http://github.com/scramjet/spelling/blob/31071edb2166b2bc4d350358900d975
228fd43b9/spelling.hs
Doing the same to your version has the same result:
myReadFile       Main       210           1  45.7    9.6    45.7   
9.6
It does seem that the profiler is wrong or misleading somehow.
Strange. Doing that here has no such effect. The only line in the profile 
(ByteString) where myReadFile occurs is

   myReadFile  Main  260   1   0.0  0.9   0.0    0.9   0   6507348

(snipped whitespace), entered once, allocated 6.5 million bytes, took not 
measurable time.

Both, compiled via C and with the native code generator.

With String IO, it costs 4.6% time and 8.1% allocation.
...
Two other quick questions:
(1) you added parentheses to "candidates":
candidates = known [word] `or` ((known e1) `or` known_edits2)
The "or"'s should short circuit so that if "known [word]" is non-empty,
the others shouldn't be evaluated. I read the above as forcing
evaluation of the second "or" first: am I wrong?
Au contraire, 
"Any operator lacking a fixity declaration is assumed to be infixl 9" 
says the report, so without parentheses, it's parsed as

(known [word] `or` known e1) `or` known_edits2

and both 'or's must be evaluated even if word is known. Since 'or' is lazy 
in its second argument, if we associate it

known [word] `or` (known e1 `or` known_edits2)

, in case of a known word, we needn't look at the parenthesis at all (it 
will probably not make a measurable difference in running time unless you 
check a couple of million known words, but still, it feels lazier). For a 
local definition, I thought an explicit fixity declaration was overkill.
...
(2) you eliminated the "fold" in "correct" in favour of a tail-recursive
search in "maxCount": was this for style or performance reasons (or both
:)?
Performance, kind of. Since the lists we fold over are in fact short, it 
doesn't really matter, but if they were long, there'd be the risk of 
unnecessarily allocating lots of pairs. I'm rather confident that with -O2, 
GHC will eliminate the pairs anyway, but without looking at the core, I'm 
not entirely sure.
However, in fact it was an unthought-through rewrite because I just had 
someone with a stack overflow in spite of foldl' due to the laziness of the 
data constructors[*]. So I made sure that that couldn't happen, without 
looking at the folded function to see if that already prevents it. And in 
fact, it does, so using a foldl' if you use lists instead of Sets is fine, 
with respect to style, even preferable.

[*] classic example: why will

average xs = sum / len
   where
      (sum,len) = foldl' accum (0,0) xs
      accum (sm,ln) x = (sm+x,ln+1)

cause a stack overflow for long lists?
...
Cheers,
Matthew.