
On Sun, 2007-04-22 at 11:51 -0400, Pete Kazmier wrote:
type WordFreq = M.Map B.ByteString Int
train:: [B.ByteString] -> WordFreq train words = frequencyMap where frequencyMap = foldr incWordCount M.empty words incWordCount w m = M.insertWith (+) w 1 m
Isn't this going to build unevaluate thunks for the counts? This is what frequency counting looks like in some of my old code: -- IM is Data.IntMap index ss = foldl' myupdate IM.empty ss where myupdate m k = let x = 1 + IM.findWithDefault 0 k m in x `seq` IM.insert k x m (edited a bit, but hopefully correct) I.e. use foldl', and look up previous value, succ and seq it, and put it back. Generally, I only seem to want strict Maps. It seems so easy to build a lazy one if you need it (data Lazy a = Lazy a), so I wonder if strict should be the default?
Instead of generating the (huge) list of misspelled words, you could calculate edit distance to each correctly spelled word? With a bound on allowable mistakes, this is a linear time operation using the standard dynamic programming algorithm.
Could you provide additional information on this "standard dynamic programming algorithm?" I'm not familiar with dynamic programming.
Sorry! Dynamic programming just means using memoizing on a recursive function to reduce its run time complexity (although I don't think I've seen that definition anywhere?). A simple example is finding the n'th Fibonacci number by calculating (and referring to) the whole list. For edit distance, you can do something like: data Edit a = Repl a a | Delete a | Insert a align (x:xs) (y:ys) = best [ Repl x y : align xs ys, Insert y : align (x:xs) ys, Delete x : align xs (y:ys)] Where 'best' is some fitness function, selecting the best alignment. Applying this directly gives an exponential algorithm, but since there are only n*m (n = lenght (x:xs), m = length (y:ys) possible ways to pick the remainders in the recursive calls throughout the execution, you might as well store them all in an n*m matrix - which only takes O(nm) time. This is probably the worst explanation of DP so far this century, but if you Google for it, you'll find tons of information. Wikipedia has an entry as well. Anyway: since the number of edits in this case is bounded (to four, since transposition can be modelled by an insertion and a deletion), you know the optimal solution will never stray further than two off the diagonal of the DP matrix, so you only need to caclulate a diagonal band (linear space/time), not the full matrix (quadratic). -k