
Am Montag 19 April 2010 14:13:53 schrieb Heinrich Apfelmus:
Arnoldo Muller wrote:
I want to generate some hamming distance statistics about a set of strings.
filter (\x -> x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs <- exampl, ys <- exampl]
[...]
-- function posted in this mailing list hamming2 :: String -> String -> Int hamming2 xs ys = length (filter not (zipWith (==) xs ys))
I am executing these functions millions of times and the bottleneck of my program is in them as explained by running in profiling mode with +RTS -K400M -p -RTS
The costlier function is the hamming distance COST CENTRE MODULE %time %alloc
hamming Distances 66.6 41.9
Another way to look at it is that you shouldn't optimize hamming itself, but rather make sure that it's called less often!
For instance, your expression can be replaced by
filter (/=0) [hammingX x y | (x:xs) <- tails example, y <- xs]
which cuts the total running time in half. It's still quadratic in the length of example . I'm sure there are faster algorithms out there that
If it's likely that there are many repetitions, collect the Strings in a Set/Map (depending on whether you're interested in the count) and call hamming only on the distinct pairs.
can bring it down to O(n log n) if you want.
I don't think so. You can't calculate the Hamming distance of x and z from the distances between x and y and y and z, so you'd have to consider all pairs of distinct strings, wouldn't you?
Regards, Heinrich Apfelmus