
Am Samstag 28 November 2009 21:21:20 schrieb vishnu:
this is where I've gotten to. http://moonpatio.com/fastcgi/hpaste.fcgi/view?id=5120#a5120 strangely enough Ive gotten no speedup at all from the substitution cost UArray (though I had to make it Int, Int to deal with digits.).
Converting the characters with letterValue takes time. I'm a little surprised that it takes so much time, though. I would have expected it to still be faster than Map. If you make subArray :: UArray (Char,Char) Int subArray = array (('0','0'),('z'z')) ... you avoid the conversion at the price of a larger array. It's still small enough to have the entire computation data in the cache, so it should be faster. However, the Chars are converted to Ints for array-indexing anyway (I think Char is internally represented as a machine integer [wrapped in a constructor], so this is basically a no-op, even if not, it's going to be much faster than letterValue), so why not avoid the conversions (except once on reading) completely and work with Ints? Change all (UArray Int Char) to (UArray Int Int) and let getArray :: BS.ByteString -> UArray Int Int getArray xs = listArray (1, fromIntegral (BS.length xs)) (map letterValue $ BS.unpack xs) replace substitutionCost (orig ! i) (new ! j) with subArray!(orig!i,new!j) and remove substitutionCost (optional), that's all you need to change in your code - except: please fix the typo -- calculating the Leveishtein distance as described here ...................^^^^^^^^^^^^^^^^^^ :)
But still I wonder if there's something else I missed. Im really curious what lazyness you used to go from 60 to 1.6? I always thought lazyness was automatic and seq made strictness possible.
What you need is a sufficiently lazy *algorithm* to compute (min 3 $ distance orig new). For top speed, you must implement that algorithm sufficiently strictly ;) You might want to read carefully the "Possible improvements" section on WP to get an idea. I'll try to explain without giving too much away to respect the spirit of the codechef challenge. The Levenshtein algorithm for computing the cost of the cheapest editing sequence(s) transforming start (length m) into target (length n) computes the lowest costs for transforming initial sequences of start (length i) to initial sequences of target (length j), i ranging from 0 to m, j from 0 to n, altogether (m+1)*(n+1) costs. The costs for i == 0 or j == 0 are easily determined and if you calculate the costs in an appropriate order, calculating each cost is cheap. We are only interested in whether the cost (distance) is 0, 1, 2 or larger than 2. So whenever we stray more than two steps from the diagonal, we can stop. You approximate that behaviour by writing the value 3 to all cells far enough off the diagonal. But you're still writing (m+1)*(n+1) values/thunks to the array. Since the actual calculation of the costs is cheap, you don't win very much (cuts down execution time by a little more than half - not bad, but much more is possible). Also, you're always walking down the entire diagonal, even if one can see much sooner that the cost is larger than 2. Consider thisends -> herestop The last letters differ, so the cost is one of a) 2+cost (thisend -> heresto) -- substitution (s,p) b) 1+cost (thisends -> heresto) -- insert p c) 1+cost (thisend -> herestop) -- delete s a) last letters differ, another branch adding at least 1 to the cost, so after the second step we know that route leads to a total larger than 2 b) and c) need three steps to ascertain that the total cost exceeds 2 Now for long strings with large Levenshtein distance, this is typical (occasionally you'll encounter identical letters, but that doesn't take much time since it doesn't involve a branching), after three levels of branching, you know the cost exceeds 2, no need to go further. So a properly lazy algorithm stops processing as soon as it's certain that the distance is larger than 2. One way to do it is to calculate the distance using lazy Peano numbers and checking whether it's larger than 2: --------------------------------------------------------------------- data Nat = Zero | Succ Nat n2i :: Nat -> Int n2i (Succ n) = 1 + n2i n n2i _ = 0 i2n :: Int -> Nat i2n 0 = Zero i2n n = Succ (i2n (n-1)) minN :: Nat -> Nat -> Nat minN (Succ m) (Succ n) = Succ (minN m n) minN _ _ = Zero ldistance :: UArray Int Char -> UArray Int Char -> Nat ldistance orig new = minN (Succ (Succ (Succ Zero))) $ go m n where m = snd $ bounds orig n = snd $ bounds new go i j | i == 0 = i2n j | j == 0 = i2n i | a == b = go (i-1) (j-1) | otherwise = let h = costArray!(a,b) x = case h of 1 -> Succ (go (i-1) (j-1)) 2 -> Succ (Succ (go (i-1) (j-1))) y = Succ (go i (j-1)) z = Succ (go (i-1) j) in minN x (minN y z) where a = orig!i b = new!j distance :: UArray Int Char -> UArray Int Char -> Int distance orig new = n2i $ ldistance orig new --------------------------------------------------------------------- Performance still sucks (20s), partly because Nat is slow, partly because I intentionally pessimised (5s without that pessimisation), but it is a sufficiently lazy algorithm (almost, there's still a way to stop even earlier). Now find an efficient way to break early (hint: don't use a lazy datatype, use Int).
thanks Vishnu