[Haskell-cafe] Re: hamming distance allocation

20 Apr 2010

      Daniel Fischer wrote:
...
Heinrich Apfelmus:
...
For instance, your expression can be replaced by
filter (/=0) [hammingX x y | (x:xs) <- tails example, y <- xs]
which cuts the total running time in half. It's still quadratic in the
length of  example . I'm sure there are faster algorithms out there that
can bring it down to O(n log n) if you want.
I don't think so. You can't calculate the Hamming distance of x and z from 
the distances between x and y and y and z, so you'd have to consider all 
pairs of distinct strings, wouldn't you?
And there I was sure about something once, only to see that it's
actually really doubtful... ;)

The thing about the Hamming distance is that it's not a black box, so
you can't get a lower bound by counting the number of minimum calls to
hamming  that have to be made. You are essentially arguing that the
different Hamming distances are independent, which they are not. Not to
mention that there are also "black-box" restrictions like the triangle
inequality

    d(x,z) <= d(x,y) + d(y,z)

but that one is probably harmless. In short, the situation is similar to
how the sorting bound O(n*log n) does not apply to radix sort.

Still, you are right to question my O(n*log n) claim; so far, my
attempts at finding such an algorithm myself have failed.

More precisely, the goal is to make a histogram of the possible hamming
distances. We need at least O(n*w) time to do that, where n is the
number of strings and w their maximum length; after all, we have to
"touch" every character. For simplicity, that the characters are just
one bit each. Furthermore, we can assume that w <= log n, otherwise
there are lots of duplicate strings which can be grouped together. In
this notation, the simple algorithm takes O(n^2*w) time.

I did find a straightforward divide-and-conquer algorithm to tally small
Hamming distances, but it's not good enough for the whole histogram.
Here's the specification:

    countHemming :: Int -> [Bool] -> [Bool]
    countHemming d xs ys = length [() | x<-xs, y<-ys, hamming x y == d]

In other words,  countHemming d xs ys  counts the number of pairings
(x,y) whose Hamming distance is exactly  d .

Now, the idea is that this can be made faster for small  d . For
instance, for  d == 0 , we are essentially just calculating the number
of different elements of  xs  and  ys . By requiring that  xs  and  ys
be sorted, this can be done in linear time

    countHemming 0 xs ys = ... a variation on  merge xs ys

And for slightly larger  d , we can partition the lists by their first
bits and continue recursively

    countHemming _ [] [] = 0
    countHemming d xs ys =
          countHemming (d-1) x0 y1 + countHemming (d-1) x1 y0
        + countHemming d     x0 y0 + countHemming d     x1 y1
        where
        (x0,x1) = partitionByHead xs
        (y0,y1) = partitionByHead ys

    partitionByHead xs = (behead True xs, behead False xs)
    behead b xs = [bs | (b':bs) <- xs, b == b']

To estimate the running time, we set  n = length (xs ++ ys) and let

    T(d,n) = running time of  countHamming d xs ys

We started with

    T(0,n) = O(n)

and want to discuss the recursive case. The idea is that each list is
divided in half, so we get

    T(d,n) = 2*T(d-1,n/2) + 2*T(d,n/2)
...
From this, we can calculate
T(1,n) = 2*T(0,n/2) + 2*T(1,n/2)
           = O(n) + 2*T(1,n/2)     -- remember quicksort!
           = O(n*log n)
    T(2,n) = O(n*log n) + 2*T(2,n/2)
           = O(n*(log n)^2)

and so on, yielding

    T(d,n) = O(n*(log n)^d)

Alas, this can be used to search a dictionary while accounting for
spelling errors, but it's no good to calculate a full histogram because
it becomes prohibitively expensive when  d ~ w/2 ~ (log n)/2  .

Regards,
Heinrich Apfelmus

--
http://apfelmus.nfshost.com