
Am Montag 25 Januar 2010 05:34:50 schrieb Matthew Phillips:
On 24/01/2010, at 10:22 PM, Daniel Fischer wrote:
<...>
I think I know what happened here:
$ ghc -fforce-recomp --make matthew -o matthew0
<...>
I habitually compile all code with -O2, unless I have a specific reason not to. I tend to forget that some compile without optimisations. For some kinds of code that makes hardly any difference, for others it makes a big difference.
I used the flags "-funbox-strict-fields -fvia-C -optc-O2", but *not* -O2. Whoops! I could kick myself: I blindly assumed that -optc-O2 would turn on optimisation, but of course that's just the flag for GCC.
By the way, compiling via C nowadays hardly ever produces faster code than the native code generator (most times I tried since 6.10, if there was a difference, native was faster).
$ time ./spelling becuase becuase -> because
real 0m4.467s user 0m3.865s sys 0m0.280s
$ time ./spelling_df becuase becuase -> because
real 0m2.422s user 0m2.198s sys 0m0.081s
Your previous version is close to twice as fast, and now only 2.5 times slower than Python.
<snipped new version of code with toLower removed>
With your suggested changes, your latest version on my machine:
$ time ./spelling_df becuase becuase -> because
real 0m1.731s user 0m1.572s sys 0m0.056s
Now, we're getting close: 4.7s -> 2.3s -> 1.7s.
But once you start needing two edits (try korrekt), correct and edits1 start to show up. That shows that Norvig's algorithm isn't really good. With two edit steps, you create a _lot_ of strings you need to look up, far more than there are in the map. That takes time. It'll be better to scan the map for entries with an edit distance (< 3) if you have a good method to check that
Indeed: $ time ./nLDBSWSpelling becuase becuase -> because 2.84user 0.02system 0:02.86elapsed 100%CPU $ time ./nLDBSWSpelling korrekt korrekt -> correct 2.83user 0.05system 0:02.88elapsed 100%CPU
Not sure if I see what you're saying here: do you mean to point out the 2.86 vs 2.88 elapsed?
Well, above the code, I had the times for Norvig's algorithm (creating all two-step edits and checking which are known):
And, without profiling: $ time ./spellingBSW becuase becuase -> because 2.84user 0.03system 0:02.88elapsed 99%CPU
Finally, building the set of two-step edits takes longer than the additional lookups:
$ time ./spellingBSW becuase becuase -> because 2.84user 0.03system 0:02.88elapsed 99%CPU $ time ./spellingBSW korrekt korrekt -> correct 3.50user 0.02system 0:03.52elapsed 100%CPU
vs.
$ time ./spellingBSWL becuase becuase -> because 2.79user 0.04system 0:02.83elapsed 100%CPU $ time ./spellingBSWL3 korrekt korrekt -> correct 3.20user 0.02system 0:03.23elapsed 99%CPU
For "becuase", which is a one-step edit, all take the same time (within normal fluctuation), 2.8x seconds. For the two-step edit "korrekt", the version building Sets takes 3.5 seconds, the version which doesn't build a Set of two-step edits takes 3.2 seconds, *and the version scanning the Map of known words for entries with a Levenshtein distance [modified to account for transpositions] less than 3, takes 2.8y seconds, practically the same time as for the one-step edit*. It becomes more obvious, perhaps, if we test a couple of two-steppers: Lazy Levenshtein: time ./nLDBSWSpelling becrase korrekt halmos territoir gutzenperg becrase -> because korrekt -> correct halmos -> holmes territoir -> territory gutzenperg -> gutenberg 2.94user 0.03system 0:02.97elapsed 100%CPU just something like 0.1 - 0.15 seconds more than for "becuase", that makes 0.02 - 0.04 seconds per two-edit correction. Sets: $ time ./spellingBSW becrase korrekt halmos territoir gutzenperg becrase -> because korrekt -> correct halmos -> holmes territoir -> territory gutzenperg -> gutenberg 7.48user 0.03system 0:07.55elapsed 99%CPU Gewalt geschrien! That takes almost a second per two-edit correction. List: $ time ./spellingBSWL3 becrase korrekt halmos territoir gutzenperg becrase -> because korrekt -> correct halmos -> holmes territoir -> territory gutzenperg -> gutenberg 5.54user 0.04system 0:05.58elapsed 99%CPU Better, but still bad, roughly half a second per correction. Python without psyco: $ time python ./norvig.py becrase korrekt halmos territoir gutzenperg Trained in 1.46993017197 seconds because correct holmes territory gutenberg 3.00user 0.08system 0:03.09elapsed 99%CPU $ time python ./norvig.py becuase Trained in 1.49027395248 seconds because 1.46user 0.08system 0:01.55elapsed 100%CPU about 0.3 seconds per correction and with: $ time python ./norvig.py becrase korrekt halmos territoir gutzenperg Trained in 1.22417902946 seconds because correct holmes territory gutenberg 2.12user 0.09system 0:02.21elapsed 100%CPU $ time python ./norvig.py becuase Trained in 1.17486715317 seconds because 1.14user 0.08system 0:01.23elapsed 99%CPU about 0.2 seconds per correction. (Just for the record, I found out what Python did in the half second I couldn't account for: it looked for matches for "./norvig.py", I forgot that Python passes the name of the script in argv[0] - d'oh)
Another thing is
allWords = keysSet wordCounts
Ouch. For each correction, you construct that set anew. Just use member from Data.Map instead of Data.Set.member and look up the words in the map.
Whoops! Just to be clear though: Haskell will memoise the result of "allWords" for a given invocation of "correct"?
Yes. But not across different invocations.
So this would only make a large difference for multiple corrections?
Right. But that's the interesting use case, isn't it?
It will be when I get the the rest of it working, yes :)
I will look at using Bloom Filters or Trie’s instead of Data.Map, but I wonder if readFile should be taking nearly %30 of the run time, even for a 6MB file?
No way. But it doesn't seem to, from my GHC's point of view.
Just to be sure I wasn't using the SCC incorrectly, I split out the readFile call into "myReadFile". The prof line comes out as:
myReadFile Main 210 1 35.8 8.6 35.8 8.6
i.e. 35.8% of runtime.
Can I see the exact code which gives that profile? Somehow, things which shouldn't must be attributed to readFile.
The version at this link has myReadFile split out.
http://github.com/scramjet/spelling/blob/31071edb2166b2bc4d350358900d975 228fd43b9/spelling.hs
Doing the same to your version has the same result:
myReadFile Main 210 1 45.7 9.6 45.7 9.6
It does seem that the profiler is wrong or misleading somehow.
Strange. Doing that here has no such effect. The only line in the profile (ByteString) where myReadFile occurs is myReadFile Main 260 1 0.0 0.9 0.0 0.9 0 6507348 (snipped whitespace), entered once, allocated 6.5 million bytes, took not measurable time. Both, compiled via C and with the native code generator. With String IO, it costs 4.6% time and 8.1% allocation.
Two other quick questions:
(1) you added parentheses to "candidates":
candidates = known [word] `or` ((known e1) `or` known_edits2)
The "or"'s should short circuit so that if "known [word]" is non-empty, the others shouldn't be evaluated. I read the above as forcing evaluation of the second "or" first: am I wrong?
Au contraire, "Any operator lacking a fixity declaration is assumed to be infixl 9" says the report, so without parentheses, it's parsed as (known [word] `or` known e1) `or` known_edits2 and both 'or's must be evaluated even if word is known. Since 'or' is lazy in its second argument, if we associate it known [word] `or` (known e1 `or` known_edits2) , in case of a known word, we needn't look at the parenthesis at all (it will probably not make a measurable difference in running time unless you check a couple of million known words, but still, it feels lazier). For a local definition, I thought an explicit fixity declaration was overkill.
(2) you eliminated the "fold" in "correct" in favour of a tail-recursive search in "maxCount": was this for style or performance reasons (or both :)?
Performance, kind of. Since the lists we fold over are in fact short, it doesn't really matter, but if they were long, there'd be the risk of unnecessarily allocating lots of pairs. I'm rather confident that with -O2, GHC will eliminate the pairs anyway, but without looking at the core, I'm not entirely sure. However, in fact it was an unthought-through rewrite because I just had someone with a stack overflow in spite of foldl' due to the laziness of the data constructors[*]. So I made sure that that couldn't happen, without looking at the folded function to see if that already prevents it. And in fact, it does, so using a foldl' if you use lists instead of Sets is fine, with respect to style, even preferable. [*] classic example: why will average xs = sum / len where (sum,len) = foldl' accum (0,0) xs accum (sm,ln) x = (sm+x,ln+1) cause a stack overflow for long lists?
Cheers,
Matthew.