Spelling checker exercise

newer
Links to darcs.haskel.org in the...

Matthew Phillips

22 Jan 2010 22 Jan '10

1:51 a.m.

Hello all, sorry to bring up an old chestnut, but I’m trying to improve my Haskell-fu by writing a small program, and chose Peter Norvig’s spelling checker as my exercise material (http://norvig.com/spell-correct.html). While I’ve gotten it working, and found it quite illuminating, I also found it to to be very slow. And then I discovered that, of course, others have been here well before me ([1] and [2]). Those discussions were very interesting, but the code they refer to is mostly not available, so the most I’ve gotten out of it so far is that: (a) I should be using strict left folds and strict Map insertions for the word frequency map (this shaved off about a second: ~5s -> ~4s for a single word on my 2.4GHz MacBook Pro, GHC 6.10.4) (b) I should probably be using ByteString’s (c) Using Set’s for the edit permutations probably isn’t worth it (although I found using plain lists made it about a second slower) (b) is difficult because I’ve used matching patterns plus list comprehensions to generate the potential edits, and I really like how elegantly it pans out that way. Because ByteString’s are not lists, I can’t see a way to keep the current structure and use them. The code is at [3] (link to version at time of post). Profiling [4] shows: $ ./spelling becuase +RTS -p becuase -> because $ cat spelling.prof total time = 4.02 secs (201 ticks @ 20 ms) total alloc = 1,544,257,792 bytes (excludes profiling overheads) COST CENTRE MODULE %time %alloc train Main 52.7 19.7 readFile Main 28.9 8.6 wordsBy Main 10.9 49.5 toLower Main 7.0 21.8 ... So it appears that “train" (building the freq map) and “readFile” in “nwords" are the places to hone. I will look at using Bloom Filters or Trie’s instead of Data.Map, but I wonder if readFile should be taking nearly %30 of the run time, even for a 6MB file? Sorry to dump such a long post on the list — I’ll understand if no one can be bothered rehashing this. But, in summary I’d like to know: (a) how could I use ByteString’s for this to speed up I/O and reduce memory usage without losing the nice readability? (b) should readFile be so slow? (c) any other tips Possibly all my questions could be answered if someone has the code from the old posts. Cheers, Matthew. [1]: http://haskell.markmail.org/search/?q=norvig%20spelling#query:norvig%20spell... [2]: http://thread.gmane.org/gmane.comp.lang.haskell.cafe/21780 [3]: http://github.com/scramjet/spelling/blob/44fd336ef4f62d49c7087dbc1ffb9c009b7... [4]: http://github.com/scramjet/spelling/blob/f5146b24b77443c22975ff69fb91aa922cc...

Show replies by date

Daniel Fischer

22 Jan 22 Jan

8:41 a.m.

Am Freitag 22 Januar 2010 07:51:27 schrieb Matthew Phillips:

...

Hello all,

sorry to bring up an old chestnut, but I’m trying to improve my Haskell-fu by writing a small program, and chose Peter Norvig’s spelling checker as my exercise material (http://norvig.com/spell-correct.html).

While I’ve gotten it working, and found it quite illuminating, I also found it to to be very slow. And then I discovered that, of course, others have been here well before me ([1] and [2]). Those discussions were very interesting, but the code they refer to is mostly not available, so the most I’ve gotten out of it so far is that:

(a) I should be using strict left folds and strict Map insertions for the word frequency map (this shaved off about a second: ~5s -> ~4s for a single word on my 2.4GHz MacBook Pro, GHC 6.10.4) (b) I should probably be using ByteString’s

That does help, but the worst part is building the map. That takes a couple of seconds in Python, too. Just building the map takes 1.95s for Python, 3.6s (including GC) with strict ByteStrings, 4.2s with lazy ByteStrings and 6s with plain Strings here. So I'd go with strict ByteStrings, although that takes a little more memory than lazy, but waay less than Strings.

...

(c) Using Set’s for the edit permutations probably isn’t worth it (although I found using plain lists made it about a second slower)

Might make a difference once you need to take two edit steps on a not very short word.

...

(b) is difficult because I’ve used matching patterns plus list comprehensions to generate the potential edits, and I really like how elegantly it pans out that way. Because ByteString’s are not lists, I can’t see a way to keep the current structure and use them.

Train with ByteStrings, then do the edits on Strings and pack for lookup.

...

The code is at [3] (link to version at time of post). Profiling [4] shows:

$ ./spelling becuase +RTS -p becuase -> because $ cat spelling.prof total time = 4.02 secs (201 ticks @ 20 ms) total alloc = 1,544,257,792 bytes (excludes profiling overheads)

COST CENTRE MODULE %time %alloc

train Main 52.7 19.7 readFile Main 28.9 8.6 wordsBy Main 10.9 49.5 toLower Main 7.0 21.8 ...

So it appears that “train" (building the freq map) and “readFile” in “nwords" are the places to hone.

readFile does not appear in my profile. If you insert an SCC for updateMap, where updateMap model word = {-# SCC "updateMap" #-} insertWith' (+) word 1 model , you'll see that the really bad citizen is updateMap (splitWords is rather bad, too, together they take some 95% of the time in that profile). But once you start needing two edits (try korrekt), correct and edits1 start to show up. That shows that Norvig's algorithm isn't really good. With two edit steps, you create a _lot_ of strings you need to look up, far more than there are in the map. That takes time. It'll be better to scan the map for entries with an edit distance (< 3) if you have a good method to check that (http://old.nabble.com/haskell-in-online-contests-td26546989.html contains pointers for that). Another thing is allWords = keysSet wordCounts Ouch. For each correction, you construct that set anew. Just use member from Data.Map instead of Data.Set.member and look up the words in the map.

...

I will look at using Bloom Filters or Trie’s instead of Data.Map, but I wonder if readFile should be taking nearly %30 of the run time, even for a 6MB file?

No way. But it doesn't seem to, from my GHC's point of view.

...

Sorry to dump such a long post on the list — I’ll understand if no one can be bothered rehashing this. But, in summary I’d like to know:

(a) how could I use ByteString’s for this to speed up I/O and reduce memory usage without losing the nice readability?

A small rewrite of your code, I would have designed it slightly differently for using ByteStrings from the beginning, the packing in known and known_edits2 isn't too beautiful. ---------------------------------------------------------------------- {-# LANGUAGE BangPatterns #-} module Main (main) where import qualified Data.ByteString.Char8 as B import Data.Char (toLower) import Data.Map (Map, findWithDefault, insertWith', member) import qualified Data.Map as Map (empty) import Data.Set as Set (Set, fromList, toList, fold, null) import Data.List (inits, tails, foldl') import System.Environment (getArgs) dataFile = "big.txt" alphabet = "abcdefghijklmnopqrstuvwxyz" splitWords :: B.ByteString -> [B.ByteString] splitWords = filter (not . B.null) . B.splitWith isNogud . B.map toLower isNogud :: Char -> Bool isNogud c = c < 'a' || 'z' < c train :: [B.ByteString] -> Map B.ByteString Int train = foldl' updateMap Map.empty where updateMap model word = insertWith' (+) word 1 model nwords :: IO (Map B.ByteString Int) nwords = return . train . splitWords =<< B.readFile dataFile edits1 :: String -> [String] edits1 s = toList . fromList $ deletes ++ transposes ++ replaces ++ inserts where deletes = [a ++ bs | (a, _:bs) <- splits] transposes = [a ++ (b2:b1:bs) | (a, b1:b2:bs) <- splits] replaces = [a ++ (c:bs) | (a, _:bs) <- splits, c <- alphabet] inserts = [a ++ (c:b) | (a, b) <- splits, c <- alphabet] splits = zip (inits s) (tails s) correct :: Map B.ByteString Int -> String -> String correct wordCounts word = B.unpack . fst $ fold maxCount (B.pack "?", 0) candidates where candidates :: Set B.ByteString candidates = known [word] `or` ((known $ edits1 word) `or` known_edits2 word) known_edits2 :: String -> Set B.ByteString known_edits2 w = fromList [w3 | w1 <- edits1 w, w2 <- edits1 w1 , let w3 = B.pack w2, w3 `member` wordCounts] known :: [String] -> Set B.ByteString known ws = fromList [w | w <- map B.pack ws, w `member` wordCounts] maxCount :: B.ByteString -> (B.ByteString, Int) -> (B.ByteString, Int) maxCount word current@(_, currentMax) | count > currentMax = (word, count) | otherwise = current where count = findWithDefault 1 word wordCounts or :: Set B.ByteString -> Set B.ByteString -> Set B.ByteString or a b | Set.null a = b | otherwise = a main :: IO () main = do args <- getArgs wordCounts <- nwords mapM_ (printCorrect wordCounts) args where printCorrect :: Map B.ByteString Int -> String -> IO () printCorrect wordCounts word = putStrLn $ word ++ " -> " ++ correct wordCounts word ----------------------------------------------------------------------

...

(b) should readFile be so slow?

(c) any other tips

Choose a better algorithm for the two-edit case.

...

Possibly all my questions could be answered if someone has the code from the old posts.

Cheers,

Matthew.

Matthew Phillips

24 Jan 24 Jan

12:19 a.m.

Thanks very much Daniel for giving my (amateurish!) exercise such an in-depth a look-over. Comments inline below. On 23/01/2010, at 12:11 AM, Daniel Fischer wrote:

...

Am Freitag 22 Januar 2010 07:51:27 schrieb Matthew Phillips:

...
Hello all,

sorry to bring up an old chestnut, but I’m trying to improve my Haskell-fu by writing a small program, and chose Peter Norvig’s spelling checker as my exercise material (http://norvig.com/spell-correct.html).

While I’ve gotten it working, and found it quite illuminating, I also found it to to be very slow. And then I discovered that, of course, others have been here well before me ([1] and [2]). Those discussions were very interesting, but the code they refer to is mostly not available, so the most I’ve gotten out of it so far is that:

(a) I should be using strict left folds and strict Map insertions for the word frequency map (this shaved off about a second: ~5s -> ~4s for a single word on my 2.4GHz MacBook Pro, GHC 6.10.4) (b) I should probably be using ByteString’s

That does help, but the worst part is building the map. That takes a couple of seconds in Python, too. Just building the map takes 1.95s for Python, 3.6s (including GC) with strict ByteStrings, 4.2s with lazy ByteStrings and 6s with plain Strings here.

I get the Python version running in about 1s, compared to 5.6s for the Haskell: $ time python spelling.py because real 0m1.071s user 0m0.821s sys 0m0.139s $ time ./spelling becuase becuase -> because real 0m5.589s user 0m4.554s sys 0m0.307s And, strangely, the rewrite you provided (I called it "spelling_df") runs a fair bit slower: $ time ./spelling_df becuase becuase -> because real 0m8.087s user 0m6.966s sys 0m0.193s $ time ./spelling korrekt korrekt -> correct real 0m5.970s user 0m4.885s sys 0m0.300s $ time ./spelling_df korrekt korrekt -> correct real 0m8.616s user 0m7.538s sys 0m0.187s

...

...
The code is at [3] (link to version at time of post). Profiling [4]

...

...
shows:

$ ./spelling becuase +RTS -p becuase -> because $ cat spelling.prof total time = 4.02 secs (201 ticks @ 20 ms) total alloc = 1,544,257,792 bytes (excludes profiling overheads)

COST CENTRE MODULE %time %alloc

train Main 52.7 19.7 readFile Main 28.9 8.6 wordsBy Main 10.9 49.5 toLower Main 7.0 21.8 ...

So it appears that “train" (building the freq map) and “readFile” in “nwords" are the places to hone.

readFile does not appear in my profile.

Apologies, I should have said that I’d inserted some SCC’s to try to tease out the cost of readFile (i.e. {-# SCC "readFile"}).

...

If you insert an SCC for updateMap,

where updateMap model word = {-# SCC "updateMap" #-} insertWith' (+) word 1 model

, you'll see that the really bad citizen is updateMap (splitWords is rather bad, too, together they take some 95% of the time in that profile).

Maybe I'm doing this wrong, but I see "splitWords" in spelling_df taking 80% of runtime. Adding SCC's like this: splitWords = {-# SCC "filter" #-} filter (not . B.null) . {-# SCC "splitWith" #-} B.splitWith isNogud . {-# SCC "toLower" #-} B.map toLower gives me: splitWords Main 216 1 0.0 0.0 78.6 91.8 filter Main 217 1 1.9 3.0 78.6 91.8 splitWith Main 218 1 28.4 36.8 76.7 88.8 isNogud Main 221 6488666 4.2 4.1 4.2 4.1 toLower Main 219 1 44.2 47.9 44.2 47.9 i.e. it seems that "splitWith" and "toLower" (!) are the culprits. Why, I have no idea. Am I reading this wrong?

...

But once you start needing two edits (try korrekt), correct and edits1 start to show up. That shows that Norvig's algorithm isn't really good. With two edit steps, you create a _lot_ of strings you need to look up, far more than there are in the map. That takes time. It'll be better to scan the map for entries with an edit distance (< 3) if you have a good method to check that (http://old.nabble.com/haskell-in-online-contests-td26546989.html contains pointers for that).

Will have a look at that: it looks like it'll be very informative.

...

Another thing is

allWords = keysSet wordCounts

Ouch. For each correction, you construct that set anew. Just use member from Data.Map instead of Data.Set.member and look up the words in the map.

Whoops! Just to be clear though: Haskell will memoise the result of "allWords" for a given invocation of "correct"? So this would only make a large difference for multiple corrections? (which I wasn't worrying about for the moment). The change seems to wipe off about 0.2s on average.

...

...
I will look at using Bloom Filters or Trie’s instead of Data.Map, but I wonder if readFile should be taking nearly %30 of the run time, even for a 6MB file?

No way. But it doesn't seem to, from my GHC's point of view.

Just to be sure I wasn't using the SCC incorrectly, I split out the readFile call into "myReadFile". The prof line comes out as: myReadFile Main 210 1 35.8 8.6 35.8 8.6 i.e. 35.8% of runtime.

...

...
Sorry to dump such a long post on the list — I’ll understand if no one can be bothered rehashing this. But, in summary I’d like to know:

(a) how could I use ByteString’s for this to speed up I/O and reduce memory usage without losing the nice readability?

And thanks for the other small tips (e.g. findWithDefault), and I didn't know you could use let the way you did either. Cheers, Matthew.

Daniel Fischer

6:52 a.m.

Am Sonntag 24 Januar 2010 06:19:58 schrieb Matthew Phillips:

...

Thanks very much Daniel for giving my (amateurish!) exercise such an in-depth a look-over. Comments inline below.

On 23/01/2010, at 12:11 AM, Daniel Fischer wrote:

...
That does help, but the worst part is building the map. That takes a couple of seconds in Python, too. Just building the map takes 1.95s for Python, 3.6s (including GC) with strict ByteStrings, 4.2s with lazy ByteStrings and 6s with plain Strings here.

Correction: I took the total time for a no-argument run for the time it took Python to build the map, adding timing for the map-building, that says it takes ~1.45 seconds. I'm clueless as to what Python needs the remaining half second for.

...

I get the Python version running in about 1s, compared to 5.6s for the Haskell:

$ python --version Python 2.6 With psyco.full(): $ time python ./norvig.py becuase Trained in 1.16967606544 seconds ./norvig.py because 1.52user 0.08system 0:01.60elapsed 100%CPU without psyco: $ time python ./norvig.py becuase Trained in 1.45706319809 seconds ./norvig.py because 1.95user 0.08system 0:02.03elapsed 100%CPU $ time python ./norvig.py Trained in 1.46250891685 seconds ./norvig.py 1.95user 0.09system 0:02.04elapsed 100%CPU

...

$ time python spelling.py because

real 0m1.071s user 0m0.821s sys 0m0.139s

$ time ./spelling becuase becuase -> because

real 0m5.589s user 0m4.554s sys 0m0.307s

And, strangely, the rewrite you provided (I called it "spelling_df") runs a fair bit slower:

$ time ./spelling_df becuase becuase -> because

real 0m8.087s user 0m6.966s sys 0m0.193s

$ time ./spelling korrekt korrekt -> correct

real 0m5.970s user 0m4.885s sys 0m0.300s

$ time ./spelling_df korrekt korrekt -> correct

real 0m8.616s user 0m7.538s sys 0m0.187s

I think I know what happened here: $ ghc -fforce-recomp --make matthew -o matthew0 [1 of 1] Compiling Main ( matthew.hs, matthew.o ) Linking matthew0 ... $ ghc -O2 -fforce-recomp --make matthew -o matthew2 [1 of 1] Compiling Main ( matthew.hs, matthew.o ) Linking matthew2 ... $ time ./matthew0 becuase becuase -> because 7.07user 0.21system 0:07.28elapsed 99%CPU $ time ./matthew2 becuase becuase -> because 6.01user 0.19system 0:06.21elapsed 100%CPU $ ghc -fforce-recomp --make spellingBS -o spelling0 [1 of 1] Compiling Main ( spellingBS.hs, spellingBS.o ) Linking spelling0 ... $ ghc -O2 -fforce-recomp --make spellingBS -o spelling2 [1 of 1] Compiling Main ( spellingBS.hs, spellingBS.o ) Linking spelling2 ... $ time ./spelling0 becuase becuase -> because 9.78user 0.09system 0:09.87elapsed 100%CPU $ time ./spelling2 becuase becuase -> because 3.57user 0.03system 0:03.60elapsed 100%CPU I habitually compile all code with -O2, unless I have a specific reason not to. I tend to forget that some compile without optimisations. For some kinds of code that makes hardly any difference, for others it makes a big difference. *** Don't even think of using ByteStrings without optimising. ***

...

...
readFile does not appear in my profile.

Apologies, I should have said that I’d inserted some SCC’s to try to tease out the cost of readFile (i.e. {-# SCC "readFile"}).

...
If you insert an SCC for updateMap,

where updateMap model word = {-# SCC "updateMap" #-} insertWith' (+) word 1 model

, you'll see that the really bad citizen is updateMap (splitWords is rather bad, too, together they take some 95% of the time in that profile).

Maybe I'm doing this wrong, but I see "splitWords" in spelling_df taking 80% of runtime. Adding SCC's like this:

splitWords = {-# SCC "filter" #-} filter (not . B.null) . {-# SCC "splitWith" #-} B.splitWith isNogud . {-# SCC "toLower" #-} B.map toLower

gives me:

splitWords Main 216 1 0.0 0.0 78.6 91.8 filter Main 217 1 1.9 3.0 78.6 91.8 splitWith Main 218 1 28.4 36.8 76.7 88.8 isNogud Main 221 6488666 4.2 4.1 4.2 4.1 toLower Main 219 1 44.2 47.9 44.2 47.9

i.e. it seems that "splitWith" and "toLower" (!) are the culprits. Why, I have no idea.

Am I reading this wrong?

No, you're just compiling it wrong :) If I profile without optimising, I get Sun Jan 24 11:37 2010 Time and Allocation Profiling Report (Final) pspellingBS0 +RTS -P -RTS becuase total time = 16.46 secs (823 ticks @ 20 ms) total alloc = 4,088,410,184 bytes (excludes profiling overheads) COST CENTRE MODULE %time %alloc ticks bytes toLower Main 36.3 45.9 299 468806205 splitWith Main 30.6 33.9 252 346657565 train Main 25.8 13.5 212 138466403 isNogud Main 4.3 3.8 35 38931996 filter Main 2.1 2.7 17 27466769 which is compatible with your results. With -O2: Sun Jan 24 11:33 2010 Time and Allocation Profiling Report (Final) pspellingBS +RTS -P -RTS becuase total time = 5.66 secs (283 ticks @ 20 ms) total alloc = 708,686,372 bytes (excludes profiling overheads) COST CENTRE MODULE %time %alloc ticks bytes updateMap Main 68.6 76.9 194 136314135 toLower Main 16.6 0.9 47 1622182 splitWith Main 6.7 14.3 19 25366835 train Main 3.5 2.5 10 4362826 filter Main 2.1 4.4 6 7736998 splitWords Main 1.8 0.0 5 0 which gives a completely different picture. Still, toLower takes a significant part of the time, we can drastically reduce that by exploiting the fact that we don't need to handle the whole Unicode complexity: Sun Jan 24 11:53 2010 Time and Allocation Profiling Report (Final) pspellingBSW +RTS -P -RTS becuase total time = 4.72 secs (236 ticks @ 20 ms) total alloc = 708,686,372 bytes (excludes profiling overheads) COST CENTRE MODULE %time %alloc ticks bytes updateMap Main 76.3 76.9 180 136314135 splitWith Main 9.7 14.3 23 25366835 filter Main 5.1 4.4 12 7736998 train Main 3.0 2.5 7 4362826 splitWords Main 2.1 0.0 5 0 isNogud Main 2.1 0.0 5 0 mkLow Main 1.7 0.9 4 1622182 And, without profiling: $ time ./spellingBSW becuase becuase -> because 2.84user 0.03system 0:02.88elapsed 99%CPU Finally, building the set of two-step edits takes longer than the additional lookups: $ time ./spellingBSW becuase becuase -> because 2.84user 0.03system 0:02.88elapsed 99%CPU $ time ./spellingBSW korrekt korrekt -> correct 3.50user 0.02system 0:03.52elapsed 100%CPU vs. $ time ./spellingBSWL becuase becuase -> because 2.79user 0.04system 0:02.83elapsed 100%CPU $ time ./spellingBSWL3 korrekt korrekt -> correct 3.20user 0.02system 0:03.23elapsed 99%CPU Which is reached with ---------------------------------------------------------------------- {-# LANGUAGE BangPatterns #-} module Main (main) where import qualified Data.ByteString.Char8 as B import qualified Data.ByteString as BS import Data.Bits import Data.Word (Word8) import Data.Char (toLower) import Data.Map (Map, findWithDefault, insertWith', keysSet, empty, member) import qualified Data.Map as Map (lookup, empty, size) import Data.Set (toList, fromList) import Data.List (inits, tails, foldl') import System.Environment (getArgs) dataFile = "big.txt" alphabet = "abcdefghijklmnopqrstuvwxyz" splitWords :: B.ByteString -> [B.ByteString] splitWords = {-# SCC "filter" #-} filter (not . BS.null) . {-# SCC "splitWith" #-} BS.splitWith isNogud . {-# SCC "mkLow" #-} BS.map mkLow mkLow :: Word8 -> Word8 mkLow x = x .|. 32 isNogud :: Word8 -> Bool isNogud c = c < 97 || 122 < c train :: [B.ByteString] -> Map B.ByteString Int train = foldl' updateMap Map.empty where updateMap model word = {-# SCC "updateMap" #-} insertWith' (+) word 1 model nwords :: IO (Map B.ByteString Int) nwords = (return $!) . train . splitWords =<< B.readFile dataFile edits1 :: String -> [String] edits1 s = deletes ++ transposes ++ replaces ++ inserts where deletes = [a ++ bs | (a, _:bs) <- splits] transposes = [a ++ (b2:b1:bs) | (a, b1:b2:bs) <- splits] replaces = [a ++ (c:bs) | (a, _:bs) <- splits, c <- alphabet] inserts = [a ++ (c:b) | (a, b) <- splits, c <- alphabet] splits = zip (inits s) (tails s) correct :: Map B.ByteString Int -> String -> String correct wordCounts word = B.unpack $ maxCount (B.pack "?") 0 candidates where candidates :: [B.ByteString] candidates = known [word] `or` ((known e1) `or` known_edits2) e1 :: [String] e1 = toList . fromList $ edits1 word known_edits2 :: [B.ByteString] known_edits2 = [w3 | w1 <- e1, w2 <- edits1 w1, let w3 = B.pack w2, w3 `member` wordCounts] known :: [String] -> [B.ByteString] known ws = {-# SCC "known" #-} [w | w <- map B.pack ws, w `member` wordCounts] maxCount :: B.ByteString -> Int -> [B.ByteString] -> B.ByteString maxCount best cmax (word:more) | cmax < count = maxCount word count more | otherwise = maxCount best cmax more where count = findWithDefault 1 word wordCounts maxCount best _ _ = best or :: [B.ByteString] -> [B.ByteString] -> [B.ByteString] or a b | null a = b | otherwise = a main :: IO () main = do args <- getArgs wordCounts <- nwords mapM_ (printCorrect wordCounts) args where printCorrect :: Map B.ByteString Int -> String -> IO () printCorrect wordCounts word = putStrLn $ word ++ " -> " ++ correct wordCounts word ----------------------------------------------------------------------

...

...
But once you start needing two edits (try korrekt), correct and edits1 start to show up. That shows that Norvig's algorithm isn't really good. With two edit steps, you create a _lot_ of strings you need to look up, far more than there are in the map. That takes time. It'll be better to scan the map for entries with an edit distance (< 3) if you have a good method to check that

Indeed: $ time ./nLDBSWSpelling becuase becuase -> because 2.84user 0.02system 0:02.86elapsed 100%CPU $ time ./nLDBSWSpelling korrekt korrekt -> correct 2.83user 0.05system 0:02.88elapsed 100%CPU

...

...
(http://old.nabble.com/haskell-in-online-contests-td26546989.html contains pointers for that).

Will have a look at that: it looks like it'll be very informative.

...
Another thing is

allWords = keysSet wordCounts

Ouch. For each correction, you construct that set anew. Just use member from Data.Map instead of Data.Set.member and look up the words in the map.

Whoops! Just to be clear though: Haskell will memoise the result of "allWords" for a given invocation of "correct"?

Yes. But not across different invocations.

...

So this would only make a large difference for multiple corrections?

Right. But that's the interesting use case, isn't it?

...

(which I wasn't worrying about for the moment). The change seems to wipe off about 0.2s on average.

Which is pretty bad for multiple corrections.

...

...
...
I will look at using Bloom Filters or Trie’s instead of Data.Map, but I wonder if readFile should be taking nearly %30 of the run time, even for a 6MB file?

No way. But it doesn't seem to, from my GHC's point of view.

Just to be sure I wasn't using the SCC incorrectly, I split out the readFile call into "myReadFile". The prof line comes out as:

myReadFile Main 210 1 35.8 8.6 35.8 8.6

i.e. 35.8% of runtime.

Can I see the exact code which gives that profile? Somehow, things which shouldn't must be attributed to readFile.

...

...
...
Sorry to dump such a long post on the list — I’ll understand if no one can be bothered rehashing this. But, in summary I’d like to know:

(a) how could I use ByteString’s for this to speed up I/O and reduce memory usage without losing the nice readability?

And thanks for the other small tips (e.g. findWithDefault), and I didn't know you could use let the way you did either.

Cheers,

Matthew.

Matthew Phillips

11:34 p.m.

On 24/01/2010, at 10:22 PM, Daniel Fischer wrote: <...>

...

I think I know what happened here:

$ ghc -fforce-recomp --make matthew -o matthew0

<...>

...

I habitually compile all code with -O2, unless I have a specific reason not to. I tend to forget that some compile without optimisations. For some kinds of code that makes hardly any difference, for others it makes a big difference.

I used the flags "-funbox-strict-fields -fvia-C -optc-O2", but *not* -O2. Whoops! I could kick myself: I blindly assumed that -optc-O2 would turn on optimisation, but of course that's just the flag for GCC. $ time ./spelling becuase becuase -> because real 0m4.467s user 0m3.865s sys 0m0.280s $ time ./spelling_df becuase becuase -> because real 0m2.422s user 0m2.198s sys 0m0.081s Your previous version is close to twice as fast, and now only 2.5 times slower than Python. <snipped new version of code with toLower removed> With your suggested changes, your latest version on my machine: $ time ./spelling_df becuase becuase -> because real 0m1.731s user 0m1.572s sys 0m0.056s Now, we're getting close: 4.7s -> 2.3s -> 1.7s.

...

...
...
But once you start needing two edits (try korrekt), correct and edits1 start to show up. That shows that Norvig's algorithm isn't really good. With two edit steps, you create a _lot_ of strings you need to look up, far more than there are in the map. That takes time. It'll be better to scan the map for entries with an edit distance (< 3) if you have a good method to check that

Indeed: $ time ./nLDBSWSpelling becuase becuase -> because 2.84user 0.02system 0:02.86elapsed 100%CPU $ time ./nLDBSWSpelling korrekt korrekt -> correct 2.83user 0.05system 0:02.88elapsed 100%CPU

Not sure if I see what you're saying here: do you mean to point out the 2.86 vs 2.88 elapsed?

...

...
...
Another thing is

allWords = keysSet wordCounts

Ouch. For each correction, you construct that set anew. Just use member from Data.Map instead of Data.Set.member and look up the words in the map.

Whoops! Just to be clear though: Haskell will memoise the result of "allWords" for a given invocation of "correct"?

Yes. But not across different invocations.

...
So this would only make a large difference for multiple corrections?

Right. But that's the interesting use case, isn't it?

It will be when I get the the rest of it working, yes :)

...

...
...
...
I will look at using Bloom Filters or Trie’s instead of Data.Map, but I wonder if readFile should be taking nearly %30 of the run time, even for a 6MB file?

No way. But it doesn't seem to, from my GHC's point of view.

Just to be sure I wasn't using the SCC incorrectly, I split out the readFile call into "myReadFile". The prof line comes out as:

myReadFile Main 210 1 35.8 8.6 35.8 8.6

i.e. 35.8% of runtime.

Can I see the exact code which gives that profile? Somehow, things which shouldn't must be attributed to readFile.

The version at this link has myReadFile split out. http://github.com/scramjet/spelling/blob/31071edb2166b2bc4d350358900d975228f... Doing the same to your version has the same result: myReadFile Main 210 1 45.7 9.6 45.7 9.6 It does seem that the profiler is wrong or misleading somehow. Two other quick questions: (1) you added parentheses to "candidates": candidates = known [word] `or` ((known e1) `or` known_edits2) The "or"'s should short circuit so that if "known [word]" is non-empty, the others shouldn't be evaluated. I read the above as forcing evaluation of the second "or" first: am I wrong? (2) you eliminated the "fold" in "correct" in favour of a tail-recursive search in "maxCount": was this for style or performance reasons (or both :)? Cheers, Matthew.

Daniel Fischer

25 Jan 25 Jan

2:19 a.m.

Am Montag 25 Januar 2010 05:34:50 schrieb Matthew Phillips:

...

On 24/01/2010, at 10:22 PM, Daniel Fischer wrote:

<...>

...
I think I know what happened here:

$ ghc -fforce-recomp --make matthew -o matthew0

<...>

...
I habitually compile all code with -O2, unless I have a specific reason not to. I tend to forget that some compile without optimisations. For some kinds of code that makes hardly any difference, for others it makes a big difference.

I used the flags "-funbox-strict-fields -fvia-C -optc-O2", but *not* -O2. Whoops! I could kick myself: I blindly assumed that -optc-O2 would turn on optimisation, but of course that's just the flag for GCC.

By the way, compiling via C nowadays hardly ever produces faster code than the native code generator (most times I tried since 6.10, if there was a difference, native was faster).

...

$ time ./spelling becuase becuase -> because

real 0m4.467s user 0m3.865s sys 0m0.280s

$ time ./spelling_df becuase becuase -> because

real 0m2.422s user 0m2.198s sys 0m0.081s

Your previous version is close to twice as fast, and now only 2.5 times slower than Python.

<snipped new version of code with toLower removed>

With your suggested changes, your latest version on my machine:

$ time ./spelling_df becuase becuase -> because

real 0m1.731s user 0m1.572s sys 0m0.056s

Now, we're getting close: 4.7s -> 2.3s -> 1.7s.

...
...
...
But once you start needing two edits (try korrekt), correct and edits1 start to show up. That shows that Norvig's algorithm isn't really good. With two edit steps, you create a _lot_ of strings you need to look up, far more than there are in the map. That takes time. It'll be better to scan the map for entries with an edit distance (< 3) if you have a good method to check that

Indeed: $ time ./nLDBSWSpelling becuase becuase -> because 2.84user 0.02system 0:02.86elapsed 100%CPU $ time ./nLDBSWSpelling korrekt korrekt -> correct 2.83user 0.05system 0:02.88elapsed 100%CPU

Not sure if I see what you're saying here: do you mean to point out the 2.86 vs 2.88 elapsed?

Well, above the code, I had the times for Norvig's algorithm (creating all two-step edits and checking which are known):

...

...
And, without profiling: $ time ./spellingBSW becuase becuase -> because 2.84user 0.03system 0:02.88elapsed 99%CPU

Finally, building the set of two-step edits takes longer than the additional lookups:

$ time ./spellingBSW becuase becuase -> because 2.84user 0.03system 0:02.88elapsed 99%CPU $ time ./spellingBSW korrekt korrekt -> correct 3.50user 0.02system 0:03.52elapsed 100%CPU

vs.

$ time ./spellingBSWL becuase becuase -> because 2.79user 0.04system 0:02.83elapsed 100%CPU $ time ./spellingBSWL3 korrekt korrekt -> correct 3.20user 0.02system 0:03.23elapsed 99%CPU

For "becuase", which is a one-step edit, all take the same time (within normal fluctuation), 2.8x seconds. For the two-step edit "korrekt", the version building Sets takes 3.5 seconds, the version which doesn't build a Set of two-step edits takes 3.2 seconds, *and the version scanning the Map of known words for entries with a Levenshtein distance [modified to account for transpositions] less than 3, takes 2.8y seconds, practically the same time as for the one-step edit*. It becomes more obvious, perhaps, if we test a couple of two-steppers: Lazy Levenshtein: time ./nLDBSWSpelling becrase korrekt halmos territoir gutzenperg becrase -> because korrekt -> correct halmos -> holmes territoir -> territory gutzenperg -> gutenberg 2.94user 0.03system 0:02.97elapsed 100%CPU just something like 0.1 - 0.15 seconds more than for "becuase", that makes 0.02 - 0.04 seconds per two-edit correction. Sets: $ time ./spellingBSW becrase korrekt halmos territoir gutzenperg becrase -> because korrekt -> correct halmos -> holmes territoir -> territory gutzenperg -> gutenberg 7.48user 0.03system 0:07.55elapsed 99%CPU Gewalt geschrien! That takes almost a second per two-edit correction. List: $ time ./spellingBSWL3 becrase korrekt halmos territoir gutzenperg becrase -> because korrekt -> correct halmos -> holmes territoir -> territory gutzenperg -> gutenberg 5.54user 0.04system 0:05.58elapsed 99%CPU Better, but still bad, roughly half a second per correction. Python without psyco: $ time python ./norvig.py becrase korrekt halmos territoir gutzenperg Trained in 1.46993017197 seconds because correct holmes territory gutenberg 3.00user 0.08system 0:03.09elapsed 99%CPU $ time python ./norvig.py becuase Trained in 1.49027395248 seconds because 1.46user 0.08system 0:01.55elapsed 100%CPU about 0.3 seconds per correction and with: $ time python ./norvig.py becrase korrekt halmos territoir gutzenperg Trained in 1.22417902946 seconds because correct holmes territory gutenberg 2.12user 0.09system 0:02.21elapsed 100%CPU $ time python ./norvig.py becuase Trained in 1.17486715317 seconds because 1.14user 0.08system 0:01.23elapsed 99%CPU about 0.2 seconds per correction. (Just for the record, I found out what Python did in the half second I couldn't account for: it looked for matches for "./norvig.py", I forgot that Python passes the name of the script in argv[0] - d'oh)

...

...
...
...
Another thing is

allWords = keysSet wordCounts

Ouch. For each correction, you construct that set anew. Just use member from Data.Map instead of Data.Set.member and look up the words in the map.

Whoops! Just to be clear though: Haskell will memoise the result of "allWords" for a given invocation of "correct"?

Yes. But not across different invocations.

...
So this would only make a large difference for multiple corrections?

Right. But that's the interesting use case, isn't it?

It will be when I get the the rest of it working, yes :)

...
...
...
...
I will look at using Bloom Filters or Trie’s instead of Data.Map, but I wonder if readFile should be taking nearly %30 of the run time, even for a 6MB file?

No way. But it doesn't seem to, from my GHC's point of view.

Just to be sure I wasn't using the SCC incorrectly, I split out the readFile call into "myReadFile". The prof line comes out as:

myReadFile Main 210 1 35.8 8.6 35.8 8.6

i.e. 35.8% of runtime.

Can I see the exact code which gives that profile? Somehow, things which shouldn't must be attributed to readFile.

The version at this link has myReadFile split out.

http://github.com/scramjet/spelling/blob/31071edb2166b2bc4d350358900d975 228fd43b9/spelling.hs

Doing the same to your version has the same result:

myReadFile Main 210 1 45.7 9.6 45.7 9.6

It does seem that the profiler is wrong or misleading somehow.

Strange. Doing that here has no such effect. The only line in the profile (ByteString) where myReadFile occurs is myReadFile Main 260 1 0.0 0.9 0.0 0.9 0 6507348 (snipped whitespace), entered once, allocated 6.5 million bytes, took not measurable time. Both, compiled via C and with the native code generator. With String IO, it costs 4.6% time and 8.1% allocation.

...

Two other quick questions:

(1) you added parentheses to "candidates":

candidates = known [word] `or` ((known e1) `or` known_edits2)

The "or"'s should short circuit so that if "known [word]" is non-empty, the others shouldn't be evaluated. I read the above as forcing evaluation of the second "or" first: am I wrong?

Au contraire, "Any operator lacking a fixity declaration is assumed to be infixl 9" says the report, so without parentheses, it's parsed as (known [word] `or` known e1) `or` known_edits2 and both 'or's must be evaluated even if word is known. Since 'or' is lazy in its second argument, if we associate it known [word] `or` (known e1 `or` known_edits2) , in case of a known word, we needn't look at the parenthesis at all (it will probably not make a measurable difference in running time unless you check a couple of million known words, but still, it feels lazier). For a local definition, I thought an explicit fixity declaration was overkill.

...

(2) you eliminated the "fold" in "correct" in favour of a tail-recursive search in "maxCount": was this for style or performance reasons (or both :)?

Performance, kind of. Since the lists we fold over are in fact short, it doesn't really matter, but if they were long, there'd be the risk of unnecessarily allocating lots of pairs. I'm rather confident that with -O2, GHC will eliminate the pairs anyway, but without looking at the core, I'm not entirely sure. However, in fact it was an unthought-through rewrite because I just had someone with a stack overflow in spite of foldl' due to the laziness of the data constructors[*]. So I made sure that that couldn't happen, without looking at the folded function to see if that already prevents it. And in fact, it does, so using a foldl' if you use lists instead of Sets is fine, with respect to style, even preferable. [*] classic example: why will average xs = sum / len where (sum,len) = foldl' accum (0,0) xs accum (sm,ln) x = (sm+x,ln+1) cause a stack overflow for long lists?

...

Cheers,

Matthew.

Matthew Phillips

8 Feb 8 Feb

4:58 p.m.

Just wanted to follow up on this, after getting distracted by less important things. Thanks to Daniel's suggestions re the ByteString variant, the fairly-faithful Haskell translation of Norvig's Python clocks in at about 1.7 times slower than spelling.py, most of which is down to the fact that "train" doesn't have a O(1) hashmap to use. I tried using Data.Trie rather than Data.Map, but that made it slightly slower. A counting Bloom filter might be better, but that's going further than I have time for right now. So, in "real" usage this simple version would probably be usable, since train would be done once. Of course there are far cleverer algorithms, but the version I have so far I think retains the value that Norvig's has as a learning example. The latest version I came up with is here: http://github.com/scramjet/spelling/blob/master/spelling.hs Cheers, and thanks again, Matthew. P.S.

...

Daniel:

...

[*] classic example: why will

average xs = sum / len where (sum,len) = foldl' accum (0,0) xs accum (sm,ln) x = (sm+x,ln+1)

cause a stack overflow for long lists?

You gave a strong hint before this, so I'd guess it's due to the lazy tuple creation in "accum"? On 25/01/2010, at 5:49 PM, Daniel Fischer wrote:

...

Am Montag 25 Januar 2010 05:34:50 schrieb Matthew Phillips:

...
On 24/01/2010, at 10:22 PM, Daniel Fischer wrote:

<...>

...
I think I know what happened here:

$ ghc -fforce-recomp --make matthew -o matthew0

<...>

...
I habitually compile all code with -O2, unless I have a specific reason not to. I tend to forget that some compile without optimisations. For some kinds of code that makes hardly any difference, for others it makes a big difference.

I used the flags "-funbox-strict-fields -fvia-C -optc-O2", but *not* -O2. Whoops! I could kick myself: I blindly assumed that -optc-O2 would turn on optimisation, but of course that's just the flag for GCC.

By the way, compiling via C nowadays hardly ever produces faster code than the native code generator (most times I tried since 6.10, if there was a difference, native was faster).

...
$ time ./spelling becuase becuase -> because

real 0m4.467s user 0m3.865s sys 0m0.280s

$ time ./spelling_df becuase becuase -> because

real 0m2.422s user 0m2.198s sys 0m0.081s

Your previous version is close to twice as fast, and now only 2.5 times slower than Python.

<snipped new version of code with toLower removed>

With your suggested changes, your latest version on my machine:

$ time ./spelling_df becuase becuase -> because

real 0m1.731s user 0m1.572s sys 0m0.056s

Now, we're getting close: 4.7s -> 2.3s -> 1.7s.

...
...
...
But once you start needing two edits (try korrekt), correct and edits1 start to show up. That shows that Norvig's algorithm isn't really good. With two edit steps, you create a _lot_ of strings you need to look up, far more than there are in the map. That takes time. It'll be better to scan the map for entries with an edit distance (< 3) if you have a good method to check that

Indeed: $ time ./nLDBSWSpelling becuase becuase -> because 2.84user 0.02system 0:02.86elapsed 100%CPU $ time ./nLDBSWSpelling korrekt korrekt -> correct 2.83user 0.05system 0:02.88elapsed 100%CPU

Not sure if I see what you're saying here: do you mean to point out the 2.86 vs 2.88 elapsed?

Well, above the code, I had the times for Norvig's algorithm (creating all two-step edits and checking which are known):

...
...
And, without profiling: $ time ./spellingBSW becuase becuase -> because 2.84user 0.03system 0:02.88elapsed 99%CPU

Finally, building the set of two-step edits takes longer than the additional lookups:

$ time ./spellingBSW becuase becuase -> because 2.84user 0.03system 0:02.88elapsed 99%CPU $ time ./spellingBSW korrekt korrekt -> correct 3.50user 0.02system 0:03.52elapsed 100%CPU

vs.

$ time ./spellingBSWL becuase becuase -> because 2.79user 0.04system 0:02.83elapsed 100%CPU $ time ./spellingBSWL3 korrekt korrekt -> correct 3.20user 0.02system 0:03.23elapsed 99%CPU

For "becuase", which is a one-step edit, all take the same time (within normal fluctuation), 2.8x seconds. For the two-step edit "korrekt", the version building Sets takes 3.5 seconds, the version which doesn't build a Set of two-step edits takes 3.2 seconds, *and the version scanning the Map of known words for entries with a Levenshtein distance [modified to account for transpositions] less than 3, takes 2.8y seconds, practically the same time as for the one-step edit*.

It becomes more obvious, perhaps, if we test a couple of two-steppers:

Lazy Levenshtein:

time ./nLDBSWSpelling becrase korrekt halmos territoir gutzenperg becrase -> because korrekt -> correct halmos -> holmes territoir -> territory gutzenperg -> gutenberg 2.94user 0.03system 0:02.97elapsed 100%CPU

just something like 0.1 - 0.15 seconds more than for "becuase", that makes 0.02 - 0.04 seconds per two-edit correction.

Sets:

$ time ./spellingBSW becrase korrekt halmos territoir gutzenperg becrase -> because korrekt -> correct halmos -> holmes territoir -> territory gutzenperg -> gutenberg 7.48user 0.03system 0:07.55elapsed 99%CPU

Gewalt geschrien! That takes almost a second per two-edit correction.

List:

$ time ./spellingBSWL3 becrase korrekt halmos territoir gutzenperg becrase -> because korrekt -> correct halmos -> holmes territoir -> territory gutzenperg -> gutenberg 5.54user 0.04system 0:05.58elapsed 99%CPU

Better, but still bad, roughly half a second per correction.

Python without psyco:

$ time python ./norvig.py becrase korrekt halmos territoir gutzenperg Trained in 1.46993017197 seconds because correct holmes territory gutenberg 3.00user 0.08system 0:03.09elapsed 99%CPU $ time python ./norvig.py becuase Trained in 1.49027395248 seconds because 1.46user 0.08system 0:01.55elapsed 100%CPU

about 0.3 seconds per correction

and with:

$ time python ./norvig.py becrase korrekt halmos territoir gutzenperg Trained in 1.22417902946 seconds because correct holmes territory gutenberg 2.12user 0.09system 0:02.21elapsed 100%CPU $ time python ./norvig.py becuase Trained in 1.17486715317 seconds because 1.14user 0.08system 0:01.23elapsed 99%CPU

about 0.2 seconds per correction.

(Just for the record, I found out what Python did in the half second I couldn't account for: it looked for matches for "./norvig.py", I forgot that Python passes the name of the script in argv[0] - d'oh)

...
...
...
...
Another thing is

allWords = keysSet wordCounts

Ouch. For each correction, you construct that set anew. Just use member from Data.Map instead of Data.Set.member and look up the words in the map.

Whoops! Just to be clear though: Haskell will memoise the result of "allWords" for a given invocation of "correct"?

Yes. But not across different invocations.

...
So this would only make a large difference for multiple corrections?

Right. But that's the interesting use case, isn't it?

It will be when I get the the rest of it working, yes :)

...
...
...
...
I will look at using Bloom Filters or Trie’s instead of Data.Map, but I wonder if readFile should be taking nearly %30 of the run time, even for a 6MB file?

No way. But it doesn't seem to, from my GHC's point of view.

Just to be sure I wasn't using the SCC incorrectly, I split out the readFile call into "myReadFile". The prof line comes out as:

myReadFile Main 210 1 35.8 8.6 35.8 8.6

i.e. 35.8% of runtime.

Can I see the exact code which gives that profile? Somehow, things which shouldn't must be attributed to readFile.

The version at this link has myReadFile split out.

http://github.com/scramjet/spelling/blob/31071edb2166b2bc4d350358900d975 228fd43b9/spelling.hs

Doing the same to your version has the same result:

myReadFile Main 210 1 45.7 9.6 45.7 9.6

It does seem that the profiler is wrong or misleading somehow.

Strange. Doing that here has no such effect. The only line in the profile (ByteString) where myReadFile occurs is

myReadFile Main 260 1 0.0 0.9 0.0 0.9 0 6507348

(snipped whitespace), entered once, allocated 6.5 million bytes, took not measurable time.

Both, compiled via C and with the native code generator.

With String IO, it costs 4.6% time and 8.1% allocation.

...
Two other quick questions:

(1) you added parentheses to "candidates":

candidates = known [word] `or` ((known e1) `or` known_edits2)

The "or"'s should short circuit so that if "known [word]" is non-empty, the others shouldn't be evaluated. I read the above as forcing evaluation of the second "or" first: am I wrong?

Au contraire, "Any operator lacking a fixity declaration is assumed to be infixl 9" says the report, so without parentheses, it's parsed as

(known [word] `or` known e1) `or` known_edits2

and both 'or's must be evaluated even if word is known. Since 'or' is lazy in its second argument, if we associate it

known [word] `or` (known e1 `or` known_edits2)

, in case of a known word, we needn't look at the parenthesis at all (it will probably not make a measurable difference in running time unless you check a couple of million known words, but still, it feels lazier). For a local definition, I thought an explicit fixity declaration was overkill.

...
(2) you eliminated the "fold" in "correct" in favour of a tail-recursive search in "maxCount": was this for style or performance reasons (or both :)?

Performance, kind of. Since the lists we fold over are in fact short, it doesn't really matter, but if they were long, there'd be the risk of unnecessarily allocating lots of pairs. I'm rather confident that with -O2, GHC will eliminate the pairs anyway, but without looking at the core, I'm not entirely sure. However, in fact it was an unthought-through rewrite because I just had someone with a stack overflow in spite of foldl' due to the laziness of the data constructors[*]. So I made sure that that couldn't happen, without looking at the folded function to see if that already prevents it. And in fact, it does, so using a foldl' if you use lists instead of Sets is fine, with respect to style, even preferable.

[*] classic example: why will

average xs = sum / len where (sum,len) = foldl' accum (0,0) xs accum (sm,ln) x = (sm+x,ln+1)

cause a stack overflow for long lists?

...
Cheers,

Matthew.

Daniel Fischer

5:30 p.m.

Am Montag 08 Februar 2010 22:58:03 schrieb Matthew Phillips:

...

P.S.

...
Daniel:

[*] classic example: why will

average xs = sum / len where (sum,len) = foldl' accum (0,0) xs accum (sm,ln) x = (sm+x,ln+1)

cause a stack overflow for long lists?

You gave a strong hint before this, so I'd guess it's due to the lazy tuple creation in "accum"?

Well, the tuples are forced, so in that sense, tuple creation is not lazy. But the components of the tuple are *not* forced, so they become large thunks and in that sense, tuple creation is lazy and that's indeed the cause of the stack overflow. I think you get full marks for this :)

Eduard Sergeev

26 Jan 26 Jan

8:11 a.m.

Matthew Phillips-5 wrote:

...

I also found it to to be very slow.

My variant: http://a-ejeli-tak.livejournal.com/1326.html Spellchecker in Haskell String version runs in 2.5 sec, ByteString in 1.2 sec (just for one word e.g. just to build the tree). 8 sec to check input of 400 words (copied from Norvig's example). I think laziness helps here to avoid unnecessary checks (once the first match is found). Haven't tried it on a larger data sets neither tried to optimize it. Cheated on dictionary parsing though... -- View this message in context: http://old.nabble.com/Spelling-checker-exercise-tp27269320p27322382.html Sent from the Haskell - Haskell-Cafe mailing list archive at Nabble.com.

Daniel Fischer

10:13 a.m.

Am Dienstag 26 Januar 2010 14:11:06 schrieb Eduard Sergeev:

...

Matthew Phillips-5 wrote:

...
I also found it to to be very slow.

My variant: http://a-ejeli-tak.livejournal.com/1326.html Spellchecker in Haskell String version runs in 2.5 sec, ByteString in 1.2 sec (just for one word e.g. just to build the tree). 8 sec to check input of 400 words (copied from Norvig's example).

Slower here, time for building the set is approximately equal to that of of the frequency map [either String or ByteString], lookup a little slower if one edit is needed, much faster if two are needed (of course). But the lazy Levenshtein distance is much faster again, for the 'tests2' data from Norvig's http://norvig.com/spell.py (400 words), $ xargs -a tdata.txt time ./nLDBSWSpelling > /dev/null 4.50user 0.03system 0:04.53elapsed 100%CPU $ time ./esergSpellBS big.txt tdata.txt > /dev/null 28.23user 0.09system 0:28.32elapsed 100%CPU surprisingly (?), your plain String version is faster for that than your ByteString version: $ time ./esergSpellS big.txt tdata.txt > /dev/null 25.07user 0.10system 0:25.18elapsed 99%CPU

...

I think laziness helps here to avoid unnecessary checks (once the first match is found).

But that's the point, these checks aren't unnecessary (unless the word under inspection is known). You want to propose the most likely correct word. If your input is "arthetic", should you return "aesthetic", just because it's the first of the (at least four) correct words with edit distance 2[*] which is produced by your arbitrary ordering of edit steps or it's the lexicographically smallest? I think you shouldn't.

...

Haven't tried it on a larger data sets neither tried to optimize it. Cheated on dictionary parsing though...

[*] The others I know are "bathetic", "pathetic" and "arthritic" - without context, I'd go for "arthritic" because I think spelling errors are more common i the middle of a word than at the beginning or at the end, but plain frequency analysis of the corpus suggests "pathetic".

Eduard Sergeev

10:46 a.m.

Daniel Fischer-4 wrote:

...

But that's the point, these checks aren't unnecessary (unless the word under inspection is known). You want to propose the most likely correct word.

I just wanted to rewrite original Nornig's Python code in Haskell :) (maybe I misunderstood the algorithm?). Of course it is far from being able to produce 'most likely correct' result. Btw, where can I find the source for this super-fast 'nLDBSWSpelling' variant? -- View this message in context: http://old.nabble.com/Spelling-checker-exercise-tp27269320p27324740.html Sent from the Haskell - Haskell-Cafe mailing list archive at Nabble.com.

Daniel Fischer

3:06 p.m.

Am Dienstag 26 Januar 2010 16:46:42 schrieb Eduard Sergeev:

...

Daniel Fischer-4 wrote:

...
But that's the point, these checks aren't unnecessary (unless the word under inspection is known). You want to propose the most likely correct word.

I just wanted to rewrite original Norvig's Python code in Haskell :) (maybe I misunderstood the algorithm?).

Seems so. NWORDS is the frequency map built from the corpus. return max(candidates, key=NWORDS.get) returns the candidate with the highest value in NWORDS, i.e. the candidate that occurred most often in the corpus (if there are several with the same highest count, I think the one found first is taken, the order in which an iterator traverses a Python set is not specified, IIRC, so it might be any of those).

...

Of course it is far from being able to produce 'most likely correct' result.

Even taking word frequency into account doesn't get really close. You'd have to take into account that some errors are more common than others (e.g. award a penalty for words starting with a different letter, substitution cost should be lower for letters adjacent on common keyboards than for letters far apart, but it should also be lower for letter pairs of similar sound [e <-> i, d <-> t and so on], insertion/deletion cost should be lower for double letters ["diging" is more likely to be a misspelling of "digging" than of "diving", although g and v are neighbours on qwerty and qwertz keyboards], -able <-> -ible confusion is extremely common). It's really hairy. But a combination of edit distance and word frequency is a good start.

...

Btw, where can I find the source for this super-fast 'nLDBSWSpelling' variant?

Nowhere, unless you come over with a sixpack or two ;) It originated from a contest-related (codechef, www.codechef.com , a fork or similar of SPOJ) question end of November. To not spoil the contest, I didn't post the code then. When I first mentioned the idea in this thread, I hadn't ported the code to the current setting yet, so I couldn't post it, even if I wanted, besides I didn't want to distract from the topic of proting Norvig's algorithm. But since you ask and it's been long enough ago (and not directly applicable to the contest), here comes the modified source, I've added comments and a few further improvements, time for the 400 words is now 4.02user 0.04system 0:04.07elapsed 100%CPU 2.8s for building the map, so on average 3 milliseconds per correction :D ---------------------------------------------------------------------- {-# LANGUAGE BangPatterns #-} module Main (main) where import Data.ByteString.Unsafe (unsafeIndex) import qualified Data.ByteString.Char8 as B import qualified Data.ByteString as BS import Data.Char (toLower) import Data.Map (Map, findWithDefault, insertWith', member, assocs, empty) import Data.List (inits, tails, foldl') import System.Environment (getArgs) import Data.Word (Word8) import Data.Bits ((.|.)) dataFile = "big.txt" alphabet = "abcdefghijklmnopqrstuvwxyz" infixl 9 ! {-# INLINE (!) #-} (!) :: B.ByteString -> Int -> Word8 (!) = unsafeIndex {- Lazily calculate Levenshtein distance, cut off at 3, modified to have transpositions count as one edit. -} distance :: B.ByteString -> B.ByteString -> Int distance start target = go 0 m n where m = B.length start n = B.length target go l i j {- if number of edits so far + difference of lengths left is larger than 2, the total number of edits will be at least 3 -} | l+i > j+2 || l+j > i+2 = 3 {- if start is completely consumed, we need j additional inserts -} | i == 0 = l+j {- if target is completely consumed, we need i additional deletions -} | j == 0 = l+i {- no edit nor branch if we look at identical letters -} | a == b = go l (i-1) (j-1) | otherwise = let -- replace x = go (l+1) (i-1) (j-1) -- insert y = go (l+1) i (j-1) -- delete z = go (l+1) (i-1) j -- transpose w = go (l+1) (i-2) (j-2) -- but only if the letters match t | i > 1 && j > 1 && b == start!(i-2) && a == target! (j-2) = w | otherwise = 3 in case compare i j of -- if there's more of target left than of start, a deletion -- can't give a path of length < 3, since after that we'd -- need at least two inserts LT -> t `seq` x `seq` y `seq` min x (min y t) -- if both remaining segments have the same length, -- we must try all edit steps EQ -> t `seq` x `seq` y `seq` z `seq` min x (min y (min t z)) -- if there's more of start left than of target, an -- insert would be pointless GT -> t `seq` x `seq` z `seq` min x (min z t) where a = start!(i-1) b = target!(j-1) splitWords :: B.ByteString -> [B.ByteString] splitWords = filter (not . BS.null) . BS.splitWith isNogud . BS.map mkLow {- quick and dirty toLower for ASCII letters -} mkLow :: Word8 -> Word8 mkLow x = x .|. 32 {- not a lowercase ASCII letter -} isNogud :: Word8 -> Bool isNogud c = c < 97 || 122 < c {- build map (word -> how often seen) -} train :: [B.ByteString] -> Map B.ByteString Int train = foldl' updateMap empty where updateMap model word = insertWith' (+) word 1 model {- read corpus and build map -} nwords :: IO (Map B.ByteString Int) nwords = (return $!) . train . splitWords =<< B.readFile dataFile {- single edit modifications, don't reproduce original word -} edits1 :: String -> [String] edits1 s = deletes ++ transposes ++ replaces ++ inserts where deletes = [a ++ bs | (a, _:bs) <- splits] transposes = [a ++ (b2:b1:bs) | (a, b1:b2:bs) <- splits, b1 /= b2] replaces = [a ++ (c:bs) | (a, l:bs) <- splits, c <- alphabet, c /= l] inserts = [a ++ (c:b) | (a, b) <- splits, c <- alphabet] splits = zip (inits s) (tails s) correct :: Map B.ByteString Int -> String -> String correct wordCounts word -- known word, trivial case | wrd `member` wordCounts = word -- no known single edit modification, so scan corpus -- of known words for entries of distance 2 | null ed1 = mxBy2 qm 0 (assocs wordCounts) -- at least one known single edit modification, look for -- most frequent of them | otherwise = mxBy qm 0 ed1 where wrd = B.pack word qm = B.pack "?" -- list of known single edit modifications and their count ed1 = [(pw,c) | w <- edits1 word , let { pw = B.pack w ; c = findWithDefault 0 pw wordCounts } , c > 0] mxBy w _ [] = B.unpack w mxBy w m ((n,c):ps) | m < c = mxBy n c ps -- new highest count | otherwise = mxBy w m ps -- if we land here, all known words have a distance of at least 2, -- we want the one with distance 2 and the highest count among those -- (if there are any), we start with the unknown-marker and count 0 mxBy2 w _ [] = B.unpack w mxBy2 w f ((n,c):ps) -- if the new word's count isn't larger than the best we've -- found so far, we can discard it immediately -- otherwise, calculate distance, if that's larger than 2, -- discard the word, otherwise we've found a new best | c <= f || d > 2 = mxBy2 w f ps | otherwise = mxBy2 n c ps where d = distance wrd n main :: IO () main = do args <- getArgs wordCounts <- nwords mapM_ (printCorrect wordCounts) $ map (map toLower) args where printCorrect :: Map B.ByteString Int -> String -> IO () printCorrect wordCounts word = putStrLn $ word ++ " -> " ++ correct wordCounts word ----------------------------------------------------------------------

5744

Age (days ago)

5761

Last active (days ago)

List overview

Download

11 comments

3 participants

participants (3)

Daniel Fischer
Eduard Sergeev
Matthew Phillips