
Am Sonntag 24 Januar 2010 06:19:58 schrieb Matthew Phillips:
Thanks very much Daniel for giving my (amateurish!) exercise such an in-depth a look-over. Comments inline below.
On 23/01/2010, at 12:11 AM, Daniel Fischer wrote:
That does help, but the worst part is building the map. That takes a couple of seconds in Python, too. Just building the map takes 1.95s for Python, 3.6s (including GC) with strict ByteStrings, 4.2s with lazy ByteStrings and 6s with plain Strings here.
Correction: I took the total time for a no-argument run for the time it took Python to build the map, adding timing for the map-building, that says it takes ~1.45 seconds. I'm clueless as to what Python needs the remaining half second for.
I get the Python version running in about 1s, compared to 5.6s for the Haskell:
$ python --version Python 2.6 With psyco.full(): $ time python ./norvig.py becuase Trained in 1.16967606544 seconds ./norvig.py because 1.52user 0.08system 0:01.60elapsed 100%CPU without psyco: $ time python ./norvig.py becuase Trained in 1.45706319809 seconds ./norvig.py because 1.95user 0.08system 0:02.03elapsed 100%CPU $ time python ./norvig.py Trained in 1.46250891685 seconds ./norvig.py 1.95user 0.09system 0:02.04elapsed 100%CPU
$ time python spelling.py because
real 0m1.071s user 0m0.821s sys 0m0.139s
$ time ./spelling becuase becuase -> because
real 0m5.589s user 0m4.554s sys 0m0.307s
And, strangely, the rewrite you provided (I called it "spelling_df") runs a fair bit slower:
$ time ./spelling_df becuase becuase -> because
real 0m8.087s user 0m6.966s sys 0m0.193s
$ time ./spelling korrekt korrekt -> correct
real 0m5.970s user 0m4.885s sys 0m0.300s
$ time ./spelling_df korrekt korrekt -> correct
real 0m8.616s user 0m7.538s sys 0m0.187s
I think I know what happened here: $ ghc -fforce-recomp --make matthew -o matthew0 [1 of 1] Compiling Main ( matthew.hs, matthew.o ) Linking matthew0 ... $ ghc -O2 -fforce-recomp --make matthew -o matthew2 [1 of 1] Compiling Main ( matthew.hs, matthew.o ) Linking matthew2 ... $ time ./matthew0 becuase becuase -> because 7.07user 0.21system 0:07.28elapsed 99%CPU $ time ./matthew2 becuase becuase -> because 6.01user 0.19system 0:06.21elapsed 100%CPU $ ghc -fforce-recomp --make spellingBS -o spelling0 [1 of 1] Compiling Main ( spellingBS.hs, spellingBS.o ) Linking spelling0 ... $ ghc -O2 -fforce-recomp --make spellingBS -o spelling2 [1 of 1] Compiling Main ( spellingBS.hs, spellingBS.o ) Linking spelling2 ... $ time ./spelling0 becuase becuase -> because 9.78user 0.09system 0:09.87elapsed 100%CPU $ time ./spelling2 becuase becuase -> because 3.57user 0.03system 0:03.60elapsed 100%CPU I habitually compile all code with -O2, unless I have a specific reason not to. I tend to forget that some compile without optimisations. For some kinds of code that makes hardly any difference, for others it makes a big difference. *** Don't even think of using ByteStrings without optimising. ***
readFile does not appear in my profile.
Apologies, I should have said that I’d inserted some SCC’s to try to tease out the cost of readFile (i.e. {-# SCC "readFile"}).
If you insert an SCC for updateMap,
where updateMap model word = {-# SCC "updateMap" #-} insertWith' (+) word 1 model
, you'll see that the really bad citizen is updateMap (splitWords is rather bad, too, together they take some 95% of the time in that profile).
Maybe I'm doing this wrong, but I see "splitWords" in spelling_df taking 80% of runtime. Adding SCC's like this:
splitWords = {-# SCC "filter" #-} filter (not . B.null) . {-# SCC "splitWith" #-} B.splitWith isNogud . {-# SCC "toLower" #-} B.map toLower
gives me:
splitWords Main 216 1 0.0 0.0 78.6 91.8 filter Main 217 1 1.9 3.0 78.6 91.8 splitWith Main 218 1 28.4 36.8 76.7 88.8 isNogud Main 221 6488666 4.2 4.1 4.2 4.1 toLower Main 219 1 44.2 47.9 44.2 47.9
i.e. it seems that "splitWith" and "toLower" (!) are the culprits. Why, I have no idea.
Am I reading this wrong?
No, you're just compiling it wrong :) If I profile without optimising, I get Sun Jan 24 11:37 2010 Time and Allocation Profiling Report (Final) pspellingBS0 +RTS -P -RTS becuase total time = 16.46 secs (823 ticks @ 20 ms) total alloc = 4,088,410,184 bytes (excludes profiling overheads) COST CENTRE MODULE %time %alloc ticks bytes toLower Main 36.3 45.9 299 468806205 splitWith Main 30.6 33.9 252 346657565 train Main 25.8 13.5 212 138466403 isNogud Main 4.3 3.8 35 38931996 filter Main 2.1 2.7 17 27466769 which is compatible with your results. With -O2: Sun Jan 24 11:33 2010 Time and Allocation Profiling Report (Final) pspellingBS +RTS -P -RTS becuase total time = 5.66 secs (283 ticks @ 20 ms) total alloc = 708,686,372 bytes (excludes profiling overheads) COST CENTRE MODULE %time %alloc ticks bytes updateMap Main 68.6 76.9 194 136314135 toLower Main 16.6 0.9 47 1622182 splitWith Main 6.7 14.3 19 25366835 train Main 3.5 2.5 10 4362826 filter Main 2.1 4.4 6 7736998 splitWords Main 1.8 0.0 5 0 which gives a completely different picture. Still, toLower takes a significant part of the time, we can drastically reduce that by exploiting the fact that we don't need to handle the whole Unicode complexity: Sun Jan 24 11:53 2010 Time and Allocation Profiling Report (Final) pspellingBSW +RTS -P -RTS becuase total time = 4.72 secs (236 ticks @ 20 ms) total alloc = 708,686,372 bytes (excludes profiling overheads) COST CENTRE MODULE %time %alloc ticks bytes updateMap Main 76.3 76.9 180 136314135 splitWith Main 9.7 14.3 23 25366835 filter Main 5.1 4.4 12 7736998 train Main 3.0 2.5 7 4362826 splitWords Main 2.1 0.0 5 0 isNogud Main 2.1 0.0 5 0 mkLow Main 1.7 0.9 4 1622182 And, without profiling: $ time ./spellingBSW becuase becuase -> because 2.84user 0.03system 0:02.88elapsed 99%CPU Finally, building the set of two-step edits takes longer than the additional lookups: $ time ./spellingBSW becuase becuase -> because 2.84user 0.03system 0:02.88elapsed 99%CPU $ time ./spellingBSW korrekt korrekt -> correct 3.50user 0.02system 0:03.52elapsed 100%CPU vs. $ time ./spellingBSWL becuase becuase -> because 2.79user 0.04system 0:02.83elapsed 100%CPU $ time ./spellingBSWL3 korrekt korrekt -> correct 3.20user 0.02system 0:03.23elapsed 99%CPU Which is reached with ---------------------------------------------------------------------- {-# LANGUAGE BangPatterns #-} module Main (main) where import qualified Data.ByteString.Char8 as B import qualified Data.ByteString as BS import Data.Bits import Data.Word (Word8) import Data.Char (toLower) import Data.Map (Map, findWithDefault, insertWith', keysSet, empty, member) import qualified Data.Map as Map (lookup, empty, size) import Data.Set (toList, fromList) import Data.List (inits, tails, foldl') import System.Environment (getArgs) dataFile = "big.txt" alphabet = "abcdefghijklmnopqrstuvwxyz" splitWords :: B.ByteString -> [B.ByteString] splitWords = {-# SCC "filter" #-} filter (not . BS.null) . {-# SCC "splitWith" #-} BS.splitWith isNogud . {-# SCC "mkLow" #-} BS.map mkLow mkLow :: Word8 -> Word8 mkLow x = x .|. 32 isNogud :: Word8 -> Bool isNogud c = c < 97 || 122 < c train :: [B.ByteString] -> Map B.ByteString Int train = foldl' updateMap Map.empty where updateMap model word = {-# SCC "updateMap" #-} insertWith' (+) word 1 model nwords :: IO (Map B.ByteString Int) nwords = (return $!) . train . splitWords =<< B.readFile dataFile edits1 :: String -> [String] edits1 s = deletes ++ transposes ++ replaces ++ inserts where deletes = [a ++ bs | (a, _:bs) <- splits] transposes = [a ++ (b2:b1:bs) | (a, b1:b2:bs) <- splits] replaces = [a ++ (c:bs) | (a, _:bs) <- splits, c <- alphabet] inserts = [a ++ (c:b) | (a, b) <- splits, c <- alphabet] splits = zip (inits s) (tails s) correct :: Map B.ByteString Int -> String -> String correct wordCounts word = B.unpack $ maxCount (B.pack "?") 0 candidates where candidates :: [B.ByteString] candidates = known [word] `or` ((known e1) `or` known_edits2) e1 :: [String] e1 = toList . fromList $ edits1 word known_edits2 :: [B.ByteString] known_edits2 = [w3 | w1 <- e1, w2 <- edits1 w1, let w3 = B.pack w2, w3 `member` wordCounts] known :: [String] -> [B.ByteString] known ws = {-# SCC "known" #-} [w | w <- map B.pack ws, w `member` wordCounts] maxCount :: B.ByteString -> Int -> [B.ByteString] -> B.ByteString maxCount best cmax (word:more) | cmax < count = maxCount word count more | otherwise = maxCount best cmax more where count = findWithDefault 1 word wordCounts maxCount best _ _ = best or :: [B.ByteString] -> [B.ByteString] -> [B.ByteString] or a b | null a = b | otherwise = a main :: IO () main = do args <- getArgs wordCounts <- nwords mapM_ (printCorrect wordCounts) args where printCorrect :: Map B.ByteString Int -> String -> IO () printCorrect wordCounts word = putStrLn $ word ++ " -> " ++ correct wordCounts word ----------------------------------------------------------------------
But once you start needing two edits (try korrekt), correct and edits1 start to show up. That shows that Norvig's algorithm isn't really good. With two edit steps, you create a _lot_ of strings you need to look up, far more than there are in the map. That takes time. It'll be better to scan the map for entries with an edit distance (< 3) if you have a good method to check that
Indeed: $ time ./nLDBSWSpelling becuase becuase -> because 2.84user 0.02system 0:02.86elapsed 100%CPU $ time ./nLDBSWSpelling korrekt korrekt -> correct 2.83user 0.05system 0:02.88elapsed 100%CPU
(http://old.nabble.com/haskell-in-online-contests-td26546989.html contains pointers for that).
Will have a look at that: it looks like it'll be very informative.
Another thing is
allWords = keysSet wordCounts
Ouch. For each correction, you construct that set anew. Just use member from Data.Map instead of Data.Set.member and look up the words in the map.
Whoops! Just to be clear though: Haskell will memoise the result of "allWords" for a given invocation of "correct"?
Yes. But not across different invocations.
So this would only make a large difference for multiple corrections?
Right. But that's the interesting use case, isn't it?
(which I wasn't worrying about for the moment). The change seems to wipe off about 0.2s on average.
Which is pretty bad for multiple corrections.
I will look at using Bloom Filters or Trie’s instead of Data.Map, but I wonder if readFile should be taking nearly %30 of the run time, even for a 6MB file?
No way. But it doesn't seem to, from my GHC's point of view.
Just to be sure I wasn't using the SCC incorrectly, I split out the readFile call into "myReadFile". The prof line comes out as:
myReadFile Main 210 1 35.8 8.6 35.8 8.6
i.e. 35.8% of runtime.
Can I see the exact code which gives that profile? Somehow, things which shouldn't must be attributed to readFile.
Sorry to dump such a long post on the list — I’ll understand if no one can be bothered rehashing this. But, in summary I’d like to know:
(a) how could I use ByteString’s for this to speed up I/O and reduce memory usage without losing the nice readability?
And thanks for the other small tips (e.g. findWithDefault), and I didn't know you could use let the way you did either.
Cheers,
Matthew.