
Daniel Fischer wrote:
Heinrich Apfelmus:
For instance, your expression can be replaced by
filter (/=0) [hammingX x y | (x:xs) <- tails example, y <- xs]
which cuts the total running time in half. It's still quadratic in the length of example . I'm sure there are faster algorithms out there that can bring it down to O(n log n) if you want.
I don't think so. You can't calculate the Hamming distance of x and z from the distances between x and y and y and z, so you'd have to consider all pairs of distinct strings, wouldn't you?
And there I was sure about something once, only to see that it's actually really doubtful... ;) The thing about the Hamming distance is that it's not a black box, so you can't get a lower bound by counting the number of minimum calls to hamming that have to be made. You are essentially arguing that the different Hamming distances are independent, which they are not. Not to mention that there are also "black-box" restrictions like the triangle inequality d(x,z) <= d(x,y) + d(y,z) but that one is probably harmless. In short, the situation is similar to how the sorting bound O(n*log n) does not apply to radix sort. Still, you are right to question my O(n*log n) claim; so far, my attempts at finding such an algorithm myself have failed. More precisely, the goal is to make a histogram of the possible hamming distances. We need at least O(n*w) time to do that, where n is the number of strings and w their maximum length; after all, we have to "touch" every character. For simplicity, that the characters are just one bit each. Furthermore, we can assume that w <= log n, otherwise there are lots of duplicate strings which can be grouped together. In this notation, the simple algorithm takes O(n^2*w) time. I did find a straightforward divide-and-conquer algorithm to tally small Hamming distances, but it's not good enough for the whole histogram. Here's the specification: countHemming :: Int -> [Bool] -> [Bool] countHemming d xs ys = length [() | x<-xs, y<-ys, hamming x y == d] In other words, countHemming d xs ys counts the number of pairings (x,y) whose Hamming distance is exactly d . Now, the idea is that this can be made faster for small d . For instance, for d == 0 , we are essentially just calculating the number of different elements of xs and ys . By requiring that xs and ys be sorted, this can be done in linear time countHemming 0 xs ys = ... a variation on merge xs ys And for slightly larger d , we can partition the lists by their first bits and continue recursively countHemming _ [] [] = 0 countHemming d xs ys = countHemming (d-1) x0 y1 + countHemming (d-1) x1 y0 + countHemming d x0 y0 + countHemming d x1 y1 where (x0,x1) = partitionByHead xs (y0,y1) = partitionByHead ys partitionByHead xs = (behead True xs, behead False xs) behead b xs = [bs | (b':bs) <- xs, b == b'] To estimate the running time, we set n = length (xs ++ ys) and let T(d,n) = running time of countHamming d xs ys We started with T(0,n) = O(n) and want to discuss the recursive case. The idea is that each list is divided in half, so we get T(d,n) = 2*T(d-1,n/2) + 2*T(d,n/2)
From this, we can calculate
T(1,n) = 2*T(0,n/2) + 2*T(1,n/2) = O(n) + 2*T(1,n/2) -- remember quicksort! = O(n*log n) T(2,n) = O(n*log n) + 2*T(2,n/2) = O(n*(log n)^2) and so on, yielding T(d,n) = O(n*(log n)^d) Alas, this can be used to search a dictionary while accounting for spelling errors, but it's no good to calculate a full histogram because it becomes prohibitively expensive when d ~ w/2 ~ (log n)/2 . Regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com