[Haskell-cafe] Optimizing nearest-k code

24 May 2018

      Dear Haskell Cafe,

Given a set of sets, and a particular target set, I want to find the sets
that are nearest (in terms of Hamming distance) to the target set.

I am using the following code:

import Data.List
import qualified Data.Set as Set

nearest_k :: Ord a => Int -> [(Set.Set a, v)] -> Set.Set a -> [(Set.Set a,
v)]
nearest_k k bs b = take k bs' where
    bs' = sortOn (hamming b) bs

hamming :: Ord a => Set.Set a -> (Set.Set a, v) -> Int
hamming x (y, _) = hamming_distance x y

hamming_distance :: Ord a => Set.Set a -> Set.Set a -> Int
hamming_distance xs ys = Set.size (Set.difference xs ys) + Set.size
(Set.difference ys xs)

subsets :: [a] -> [[a]]
subsets []  = [[]]
subsets (x:xs) = subsets xs ++ map (x:) (subsets xs)

int_lists :: [[Int]]
int_lists = subsets [1..20]

values :: [(Set.Set Int, Int)]
values = map f (zip [1..] int_lists) where
    f (i, x) = (Set.fromList x, i)

test = nearest_k 8 values (Set.fromList [1,2,3])

----

This works ok for the test above (with sets of ints), but is rather slow in
my actual application (in which the sets are large sets of ground atoms of
first-order logic). Is there some major optimization I should be doing
here?

thanks,
Richard

[Haskell-cafe] Optimizing nearest-k code

Richard Evans