
Am Freitag 22 Januar 2010 07:51:27 schrieb Matthew Phillips:
Hello all,
sorry to bring up an old chestnut, but I’m trying to improve my Haskell-fu by writing a small program, and chose Peter Norvig’s spelling checker as my exercise material (http://norvig.com/spell-correct.html).
While I’ve gotten it working, and found it quite illuminating, I also found it to to be very slow. And then I discovered that, of course, others have been here well before me ([1] and [2]). Those discussions were very interesting, but the code they refer to is mostly not available, so the most I’ve gotten out of it so far is that:
(a) I should be using strict left folds and strict Map insertions for the word frequency map (this shaved off about a second: ~5s -> ~4s for a single word on my 2.4GHz MacBook Pro, GHC 6.10.4) (b) I should probably be using ByteString’s
That does help, but the worst part is building the map. That takes a couple of seconds in Python, too. Just building the map takes 1.95s for Python, 3.6s (including GC) with strict ByteStrings, 4.2s with lazy ByteStrings and 6s with plain Strings here. So I'd go with strict ByteStrings, although that takes a little more memory than lazy, but waay less than Strings.
(c) Using Set’s for the edit permutations probably isn’t worth it (although I found using plain lists made it about a second slower)
Might make a difference once you need to take two edit steps on a not very short word.
(b) is difficult because I’ve used matching patterns plus list comprehensions to generate the potential edits, and I really like how elegantly it pans out that way. Because ByteString’s are not lists, I can’t see a way to keep the current structure and use them.
Train with ByteStrings, then do the edits on Strings and pack for lookup.
The code is at [3] (link to version at time of post). Profiling [4] shows:
$ ./spelling becuase +RTS -p becuase -> because $ cat spelling.prof total time = 4.02 secs (201 ticks @ 20 ms) total alloc = 1,544,257,792 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
train Main 52.7 19.7 readFile Main 28.9 8.6 wordsBy Main 10.9 49.5 toLower Main 7.0 21.8 ...
So it appears that “train" (building the freq map) and “readFile” in “nwords" are the places to hone.
readFile does not appear in my profile. If you insert an SCC for updateMap, where updateMap model word = {-# SCC "updateMap" #-} insertWith' (+) word 1 model , you'll see that the really bad citizen is updateMap (splitWords is rather bad, too, together they take some 95% of the time in that profile). But once you start needing two edits (try korrekt), correct and edits1 start to show up. That shows that Norvig's algorithm isn't really good. With two edit steps, you create a _lot_ of strings you need to look up, far more than there are in the map. That takes time. It'll be better to scan the map for entries with an edit distance (< 3) if you have a good method to check that (http://old.nabble.com/haskell-in-online-contests-td26546989.html contains pointers for that). Another thing is allWords = keysSet wordCounts Ouch. For each correction, you construct that set anew. Just use member from Data.Map instead of Data.Set.member and look up the words in the map.
I will look at using Bloom Filters or Trie’s instead of Data.Map, but I wonder if readFile should be taking nearly %30 of the run time, even for a 6MB file?
No way. But it doesn't seem to, from my GHC's point of view.
Sorry to dump such a long post on the list — I’ll understand if no one can be bothered rehashing this. But, in summary I’d like to know:
(a) how could I use ByteString’s for this to speed up I/O and reduce memory usage without losing the nice readability?
A small rewrite of your code, I would have designed it slightly differently for using ByteStrings from the beginning, the packing in known and known_edits2 isn't too beautiful. ---------------------------------------------------------------------- {-# LANGUAGE BangPatterns #-} module Main (main) where import qualified Data.ByteString.Char8 as B import Data.Char (toLower) import Data.Map (Map, findWithDefault, insertWith', member) import qualified Data.Map as Map (empty) import Data.Set as Set (Set, fromList, toList, fold, null) import Data.List (inits, tails, foldl') import System.Environment (getArgs) dataFile = "big.txt" alphabet = "abcdefghijklmnopqrstuvwxyz" splitWords :: B.ByteString -> [B.ByteString] splitWords = filter (not . B.null) . B.splitWith isNogud . B.map toLower isNogud :: Char -> Bool isNogud c = c < 'a' || 'z' < c train :: [B.ByteString] -> Map B.ByteString Int train = foldl' updateMap Map.empty where updateMap model word = insertWith' (+) word 1 model nwords :: IO (Map B.ByteString Int) nwords = return . train . splitWords =<< B.readFile dataFile edits1 :: String -> [String] edits1 s = toList . fromList $ deletes ++ transposes ++ replaces ++ inserts where deletes = [a ++ bs | (a, _:bs) <- splits] transposes = [a ++ (b2:b1:bs) | (a, b1:b2:bs) <- splits] replaces = [a ++ (c:bs) | (a, _:bs) <- splits, c <- alphabet] inserts = [a ++ (c:b) | (a, b) <- splits, c <- alphabet] splits = zip (inits s) (tails s) correct :: Map B.ByteString Int -> String -> String correct wordCounts word = B.unpack . fst $ fold maxCount (B.pack "?", 0) candidates where candidates :: Set B.ByteString candidates = known [word] `or` ((known $ edits1 word) `or` known_edits2 word) known_edits2 :: String -> Set B.ByteString known_edits2 w = fromList [w3 | w1 <- edits1 w, w2 <- edits1 w1 , let w3 = B.pack w2, w3 `member` wordCounts] known :: [String] -> Set B.ByteString known ws = fromList [w | w <- map B.pack ws, w `member` wordCounts] maxCount :: B.ByteString -> (B.ByteString, Int) -> (B.ByteString, Int) maxCount word current@(_, currentMax) | count > currentMax = (word, count) | otherwise = current where count = findWithDefault 1 word wordCounts or :: Set B.ByteString -> Set B.ByteString -> Set B.ByteString or a b | Set.null a = b | otherwise = a main :: IO () main = do args <- getArgs wordCounts <- nwords mapM_ (printCorrect wordCounts) args where printCorrect :: Map B.ByteString Int -> String -> IO () printCorrect wordCounts word = putStrLn $ word ++ " -> " ++ correct wordCounts word ----------------------------------------------------------------------
(b) should readFile be so slow?
(c) any other tips
Choose a better algorithm for the two-edit case.
Possibly all my questions could be answered if someone has the code from the old posts.
Cheers,
Matthew.