Re: [Haskell-cafe] FGL/Haskell and Hierarchical Clustering/dendograms

24 Dec 2009

      For completeness, you might then do the actual clustering something like:

------------------------------------------------------------------------
import Data.Tree
import Data.List
import Data.Function

-- ... code from before ...

cluster :: Ord cost
        => (a -> b)
        -> (a -> cost)
        -> Tree a -> Cluster (cost,[b])
cluster proj cost t =
    -- List can't be empty since Tree can't.
  let o:os = sortBy (compare `on` fst)
            . flatten
            . agglom proj cost
            $ t
  in foldl' cons (One o) os

data Cluster a
  = One a
  | Many [Cluster a]
  deriving(Eq,Ord,Read,Show)

instance Functor Cluster where
  fmap f (One a) = One (f a)
  fmap f (Many cs) = Many ((fmap . fmap) f cs)

cons :: Cluster a -> a -> Cluster a
cons c a = Many [c,One a]

{-
ghci> let c = cluster fst snd t

ghci> :t c
c :: Cluster (Cost, [Id])
ghci> c
Many [Many [Many [One (0,[4]),One (1,[3,1])],One (3,[2])],One (12,[5])]

ghci> :t fmap snd c
fmap snd c :: Cluster [Id]
ghci> fmap snd c
Many [Many [Many [One [4],One [3,1]],One [2]],One [5]]

ghci> :t fmap fst c
fmap fst c :: Cluster Cost
ghci> fmap fst c
Many [Many [Many [One 0,One 1],One 3],One 12]
-}
-------------------------------------------------------------------------------

Matt

On 12/23/09, Matt Morrow  wrote:
...
Hi Nikolas,
Interesting problem. I'd do something like the following, where
the initial spanning tree from you example (re-tree-ified) is:
{-
ghci> :t t
t :: Tree (Id, Cost)
g
ghci> ppT t
(4,0)
|
+- (3,1)
|  |
|  `- (1,1)
|
`- (2,3)
   |
   `- (5,12)
-}
and which results in the tree:
{-
ghci> let s = agglom fst snd t
ghci> :t s
s :: Tree (Cost, [Id])
ghci> ppT s
(0,[4])
|
+- (1,[3,1])
|
`- (3,[2])
   |
   `- (12,[5])
-}
which can then be flattened/etc as needed by further steps of the algo.
The code for `agglom':
-----------------------------------------------------------------------------
import Data.Tree
import Data.List
type Id = Int
type Cost = Int
t :: Tree (Id,Cost)
t = Node (4,0)
      [Node (3,1) [Node (1,1) []]
      ,Node (2,3) [Node (5,12) []]]
ppT :: Show a => Tree a -> IO ()
ppT = putStrLn . drawTree . fmap show
-- | Compress the incoming @Tree a@ with @accumEq@.
agglom :: Eq cost
       => (a -> b)
       -> (a -> cost)
       -> Tree a -> Tree (cost,[b])
agglom proj cost = go
  where accum = accumEq proj cost
        go (Node a []) = Node (cost a,[proj a]) []
        go (Node a ts) = let b = proj a
                             c = cost a
                             (bs,ss) = accum c ts
                          in Node (c,b:bs) (fmap go ss)
-- | Repeatedly @splitEq@, and return a pair
-- whose /first/ element is a list of the projected
-- @b@s from those root values along paths from
-- the roots of the trees in the incoming @[Tree a]@
-- which have @cost@ equal to the third function parameter,
-- and whose /second/ element is the (concatenation of the)
-- list(s) gotten from each of the @splitEq@ calls.
accumEq :: Eq cost
        => (a -> b)
        -> (a -> cost) -> cost
        -> [Tree a] -> ([b],[Tree a])
accumEq proj cost c = go [] []
  where split ts = splitEq proj cost c ts
        go xs ys [] = (xs,ys)
        go xs ys ts = let (eqs,neqs) = split ts
                      in case eqs of
                          []-> ([],ts)
                          _ -> let (bs,tss) = unzip eqs
                                in go (bs++xs)
                                      (neqs++ys)
                                      (concat tss)
-- | Split the incoming trees into
--  (1) a @[(b,Tree a)]@ of each @b@ is the
-- @proj@ected value from an @a@ where
-- the @cost@ of that @a@ is equal to
-- the third function parameter, and (2)
-- the members of the incoming @[Tree a]@
-- whose roots' costs are /not/ equal to
-- the third function parameter.
splitEq :: Eq cost
        => (a -> b)
        -> (a -> cost) -> cost
        -> [Tree a] -> ([(b,[Tree a])],[Tree a])
splitEq proj cost c = foldl' go ([],[])
  where go (!eqs,!neqs)
           t@(Node a ts)
          | c==cost a = ((proj a,ts):eqs,neqs)
          | otherwise = (eqs,t:neqs)
-----------------------------------------------------------------------------
Cheers,
Matt
On 12/23/09, Nikolas Borrel-Jensen  wrote:
...
Hi! I have some trouble implementing single-linkage clustering algorithm
by
using a minimum-spanning tree, so I would appreciate if some of you could
give me some advise.
I am implementing a single-linkage clustering algorithm, and my approach
is
to use minimum spanning trees for that task. I am using the library FGL (
http://web.engr.oregonstate.edu/~erwig/fgl/haskell/), and I have managed
to
compute a minimum spanning tree from an arbitrary fully connected graph
with
5 nodes. I get [ [(4,0) ] , [ (3,1) , (4,0) ] , [ (1,1) , (3,1) , (4,0) ]
,
[ (2,3) , (4,0) ] , [ (5,12) , (2,3) , (4,0) ] ], which is the root path
tree of the minimum spanning tree created by the function msTreeAt.
...
From that I would create a dendrogram. [ (1,1) , (3,1) , (4,0) ]  is
telling
that node 1,3 and 4 has the same cost, namely cost 1. Therefore these are
merged at level 1. At level 1 we now have 3 clusters: (1,3,4), 2 and 5.
Now
the second lowest should be merged, that is 2 and 4. BUT because 4 is
already merged in the cluster (1,3,4), we should merge (1,3,4) and 2 at
level 3 (because the cost is 3). Now at level 3 we have 2 clusters,
(1,2,3,4) and 5. Now we merge the last one at level 12: (1,2,3,4,5), and
we
are finished.
I have very hard to see, how this could be done efficiently without
pointers
(as in C). I have thought of just saving the nodes from the start of the
root path, and traversing it, but a lot of searching should be done all
the
time.
Can you please give me some advise on that?
Kind regards
Nikolas Borrel-Jensen
Computer Science
University Of Copenhagen