
Am Dienstag 26 Januar 2010 16:46:42 schrieb Eduard Sergeev:
Daniel Fischer-4 wrote:
But that's the point, these checks aren't unnecessary (unless the word under inspection is known). You want to propose the most likely correct word.
I just wanted to rewrite original Norvig's Python code in Haskell :) (maybe I misunderstood the algorithm?).
Seems so. NWORDS is the frequency map built from the corpus. return max(candidates, key=NWORDS.get) returns the candidate with the highest value in NWORDS, i.e. the candidate that occurred most often in the corpus (if there are several with the same highest count, I think the one found first is taken, the order in which an iterator traverses a Python set is not specified, IIRC, so it might be any of those).
Of course it is far from being able to produce 'most likely correct' result.
Even taking word frequency into account doesn't get really close. You'd have to take into account that some errors are more common than others (e.g. award a penalty for words starting with a different letter, substitution cost should be lower for letters adjacent on common keyboards than for letters far apart, but it should also be lower for letter pairs of similar sound [e <-> i, d <-> t and so on], insertion/deletion cost should be lower for double letters ["diging" is more likely to be a misspelling of "digging" than of "diving", although g and v are neighbours on qwerty and qwertz keyboards], -able <-> -ible confusion is extremely common). It's really hairy. But a combination of edit distance and word frequency is a good start.
Btw, where can I find the source for this super-fast 'nLDBSWSpelling' variant?
Nowhere, unless you come over with a sixpack or two ;) It originated from a contest-related (codechef, www.codechef.com , a fork or similar of SPOJ) question end of November. To not spoil the contest, I didn't post the code then. When I first mentioned the idea in this thread, I hadn't ported the code to the current setting yet, so I couldn't post it, even if I wanted, besides I didn't want to distract from the topic of proting Norvig's algorithm. But since you ask and it's been long enough ago (and not directly applicable to the contest), here comes the modified source, I've added comments and a few further improvements, time for the 400 words is now 4.02user 0.04system 0:04.07elapsed 100%CPU 2.8s for building the map, so on average 3 milliseconds per correction :D ---------------------------------------------------------------------- {-# LANGUAGE BangPatterns #-} module Main (main) where import Data.ByteString.Unsafe (unsafeIndex) import qualified Data.ByteString.Char8 as B import qualified Data.ByteString as BS import Data.Char (toLower) import Data.Map (Map, findWithDefault, insertWith', member, assocs, empty) import Data.List (inits, tails, foldl') import System.Environment (getArgs) import Data.Word (Word8) import Data.Bits ((.|.)) dataFile = "big.txt" alphabet = "abcdefghijklmnopqrstuvwxyz" infixl 9 ! {-# INLINE (!) #-} (!) :: B.ByteString -> Int -> Word8 (!) = unsafeIndex {- Lazily calculate Levenshtein distance, cut off at 3, modified to have transpositions count as one edit. -} distance :: B.ByteString -> B.ByteString -> Int distance start target = go 0 m n where m = B.length start n = B.length target go l i j {- if number of edits so far + difference of lengths left is larger than 2, the total number of edits will be at least 3 -} | l+i > j+2 || l+j > i+2 = 3 {- if start is completely consumed, we need j additional inserts -} | i == 0 = l+j {- if target is completely consumed, we need i additional deletions -} | j == 0 = l+i {- no edit nor branch if we look at identical letters -} | a == b = go l (i-1) (j-1) | otherwise = let -- replace x = go (l+1) (i-1) (j-1) -- insert y = go (l+1) i (j-1) -- delete z = go (l+1) (i-1) j -- transpose w = go (l+1) (i-2) (j-2) -- but only if the letters match t | i > 1 && j > 1 && b == start!(i-2) && a == target! (j-2) = w | otherwise = 3 in case compare i j of -- if there's more of target left than of start, a deletion -- can't give a path of length < 3, since after that we'd -- need at least two inserts LT -> t `seq` x `seq` y `seq` min x (min y t) -- if both remaining segments have the same length, -- we must try all edit steps EQ -> t `seq` x `seq` y `seq` z `seq` min x (min y (min t z)) -- if there's more of start left than of target, an -- insert would be pointless GT -> t `seq` x `seq` z `seq` min x (min z t) where a = start!(i-1) b = target!(j-1) splitWords :: B.ByteString -> [B.ByteString] splitWords = filter (not . BS.null) . BS.splitWith isNogud . BS.map mkLow {- quick and dirty toLower for ASCII letters -} mkLow :: Word8 -> Word8 mkLow x = x .|. 32 {- not a lowercase ASCII letter -} isNogud :: Word8 -> Bool isNogud c = c < 97 || 122 < c {- build map (word -> how often seen) -} train :: [B.ByteString] -> Map B.ByteString Int train = foldl' updateMap empty where updateMap model word = insertWith' (+) word 1 model {- read corpus and build map -} nwords :: IO (Map B.ByteString Int) nwords = (return $!) . train . splitWords =<< B.readFile dataFile {- single edit modifications, don't reproduce original word -} edits1 :: String -> [String] edits1 s = deletes ++ transposes ++ replaces ++ inserts where deletes = [a ++ bs | (a, _:bs) <- splits] transposes = [a ++ (b2:b1:bs) | (a, b1:b2:bs) <- splits, b1 /= b2] replaces = [a ++ (c:bs) | (a, l:bs) <- splits, c <- alphabet, c /= l] inserts = [a ++ (c:b) | (a, b) <- splits, c <- alphabet] splits = zip (inits s) (tails s) correct :: Map B.ByteString Int -> String -> String correct wordCounts word -- known word, trivial case | wrd `member` wordCounts = word -- no known single edit modification, so scan corpus -- of known words for entries of distance 2 | null ed1 = mxBy2 qm 0 (assocs wordCounts) -- at least one known single edit modification, look for -- most frequent of them | otherwise = mxBy qm 0 ed1 where wrd = B.pack word qm = B.pack "?" -- list of known single edit modifications and their count ed1 = [(pw,c) | w <- edits1 word , let { pw = B.pack w ; c = findWithDefault 0 pw wordCounts } , c > 0] mxBy w _ [] = B.unpack w mxBy w m ((n,c):ps) | m < c = mxBy n c ps -- new highest count | otherwise = mxBy w m ps -- if we land here, all known words have a distance of at least 2, -- we want the one with distance 2 and the highest count among those -- (if there are any), we start with the unknown-marker and count 0 mxBy2 w _ [] = B.unpack w mxBy2 w f ((n,c):ps) -- if the new word's count isn't larger than the best we've -- found so far, we can discard it immediately -- otherwise, calculate distance, if that's larger than 2, -- discard the word, otherwise we've found a new best | c <= f || d > 2 = mxBy2 w f ps | otherwise = mxBy2 n c ps where d = distance wrd n main :: IO () main = do args <- getArgs wordCounts <- nwords mapM_ (printCorrect wordCounts) $ map (map toLower) args where printCorrect :: Map B.ByteString Int -> String -> IO () printCorrect wordCounts word = putStrLn $ word ++ " -> " ++ correct wordCounts word ----------------------------------------------------------------------