FGL/Haskell and Hierarchical Clustering/dendograms

From that I would create a dendrogram. [ (1,1) , (3,1) , (4,0) ] is telling
Hi! I have some trouble implementing single-linkage clustering algorithm by using a minimum-spanning tree, so I would appreciate if some of you could give me some advise. I am implementing a single-linkage clustering algorithm, and my approach is to use minimum spanning trees for that task. I am using the library FGL ( http://web.engr.oregonstate.edu/~erwig/fgl/haskell/), and I have managed to compute a minimum spanning tree from an arbitrary fully connected graph with 5 nodes. I get [ [(4,0) ] , [ (3,1) , (4,0) ] , [ (1,1) , (3,1) , (4,0) ] , [ (2,3) , (4,0) ] , [ (5,12) , (2,3) , (4,0) ] ], which is the root path tree of the minimum spanning tree created by the function msTreeAt. that node 1,3 and 4 has the same cost, namely cost 1. Therefore these are merged at level 1. At level 1 we now have 3 clusters: (1,3,4), 2 and 5. Now the second lowest should be merged, that is 2 and 4. BUT because 4 is already merged in the cluster (1,3,4), we should merge (1,3,4) and 2 at level 3 (because the cost is 3). Now at level 3 we have 2 clusters, (1,2,3,4) and 5. Now we merge the last one at level 12: (1,2,3,4,5), and we are finished. I have very hard to see, how this could be done efficiently without pointers (as in C). I have thought of just saving the nodes from the start of the root path, and traversing it, but a lot of searching should be done all the time. Can you please give me some advise on that? Kind regards Nikolas Borrel-Jensen Computer Science University Of Copenhagen

Nikolas Borrel-Jensen
I have very hard to see, how this could be done efficiently without pointers (as in C). I have thought of just saving the nodes from the start of the root path, and traversing it, but a lot of searching should be done all the time.
I must admit I didn't follow your examples. But when I implemented single linkage clustering, I maintained a list of current clusters. Each cluster held a Set of its nodes, and traversing the list of edges from least cost to greatest, the clusters containing the end points of each edge was identified, and, if different, merged. It's probably possible to do it more efficiently, but I don't think it's too bad. -k -- If I haven't seen further, it is by standing in the footprints of giants

Hi Nikolas,
Interesting problem. I'd do something like the following, where
the initial spanning tree from you example (re-tree-ified) is:
{-
ghci> :t t
t :: Tree (Id, Cost)
g
ghci> ppT t
(4,0)
|
+- (3,1)
| |
| `- (1,1)
|
`- (2,3)
|
`- (5,12)
-}
and which results in the tree:
{-
ghci> let s = agglom fst snd t
ghci> :t s
s :: Tree (Cost, [Id])
ghci> ppT s
(0,[4])
|
+- (1,[3,1])
|
`- (3,[2])
|
`- (12,[5])
-}
which can then be flattened/etc as needed by further steps of the algo.
The code for `agglom':
-----------------------------------------------------------------------------
import Data.Tree
import Data.List
type Id = Int
type Cost = Int
t :: Tree (Id,Cost)
t = Node (4,0)
[Node (3,1) [Node (1,1) []]
,Node (2,3) [Node (5,12) []]]
ppT :: Show a => Tree a -> IO ()
ppT = putStrLn . drawTree . fmap show
-- | Compress the incoming @Tree a@ with @accumEq@.
agglom :: Eq cost
=> (a -> b)
-> (a -> cost)
-> Tree a -> Tree (cost,[b])
agglom proj cost = go
where accum = accumEq proj cost
go (Node a []) = Node (cost a,[proj a]) []
go (Node a ts) = let b = proj a
c = cost a
(bs,ss) = accum c ts
in Node (c,b:bs) (fmap go ss)
-- | Repeatedly @splitEq@, and return a pair
-- whose /first/ element is a list of the projected
-- @b@s from those root values along paths from
-- the roots of the trees in the incoming @[Tree a]@
-- which have @cost@ equal to the third function parameter,
-- and whose /second/ element is the (concatenation of the)
-- list(s) gotten from each of the @splitEq@ calls.
accumEq :: Eq cost
=> (a -> b)
-> (a -> cost) -> cost
-> [Tree a] -> ([b],[Tree a])
accumEq proj cost c = go [] []
where split ts = splitEq proj cost c ts
go xs ys [] = (xs,ys)
go xs ys ts = let (eqs,neqs) = split ts
in case eqs of
[]-> ([],ts)
_ -> let (bs,tss) = unzip eqs
in go (bs++xs)
(neqs++ys)
(concat tss)
-- | Split the incoming trees into
-- (1) a @[(b,Tree a)]@ of each @b@ is the
-- @proj@ected value from an @a@ where
-- the @cost@ of that @a@ is equal to
-- the third function parameter, and (2)
-- the members of the incoming @[Tree a]@
-- whose roots' costs are /not/ equal to
-- the third function parameter.
splitEq :: Eq cost
=> (a -> b)
-> (a -> cost) -> cost
-> [Tree a] -> ([(b,[Tree a])],[Tree a])
splitEq proj cost c = foldl' go ([],[])
where go (!eqs,!neqs)
t@(Node a ts)
| c==cost a = ((proj a,ts):eqs,neqs)
| otherwise = (eqs,t:neqs)
-----------------------------------------------------------------------------
Cheers,
Matt
On 12/23/09, Nikolas Borrel-Jensen
Hi! I have some trouble implementing single-linkage clustering algorithm by using a minimum-spanning tree, so I would appreciate if some of you could give me some advise.
I am implementing a single-linkage clustering algorithm, and my approach is to use minimum spanning trees for that task. I am using the library FGL ( http://web.engr.oregonstate.edu/~erwig/fgl/haskell/), and I have managed to compute a minimum spanning tree from an arbitrary fully connected graph with 5 nodes. I get [ [(4,0) ] , [ (3,1) , (4,0) ] , [ (1,1) , (3,1) , (4,0) ] , [ (2,3) , (4,0) ] , [ (5,12) , (2,3) , (4,0) ] ], which is the root path tree of the minimum spanning tree created by the function msTreeAt.
From that I would create a dendrogram. [ (1,1) , (3,1) , (4,0) ] is telling that node 1,3 and 4 has the same cost, namely cost 1. Therefore these are merged at level 1. At level 1 we now have 3 clusters: (1,3,4), 2 and 5. Now the second lowest should be merged, that is 2 and 4. BUT because 4 is already merged in the cluster (1,3,4), we should merge (1,3,4) and 2 at level 3 (because the cost is 3). Now at level 3 we have 2 clusters, (1,2,3,4) and 5. Now we merge the last one at level 12: (1,2,3,4,5), and we are finished.
I have very hard to see, how this could be done efficiently without pointers (as in C). I have thought of just saving the nodes from the start of the root path, and traversing it, but a lot of searching should be done all the time.
Can you please give me some advise on that?
Kind regards
Nikolas Borrel-Jensen Computer Science University Of Copenhagen

For completeness, you might then do the actual clustering something like:
------------------------------------------------------------------------
import Data.Tree
import Data.List
import Data.Function
-- ... code from before ...
cluster :: Ord cost
=> (a -> b)
-> (a -> cost)
-> Tree a -> Cluster (cost,[b])
cluster proj cost t =
-- List can't be empty since Tree can't.
let o:os = sortBy (compare `on` fst)
. flatten
. agglom proj cost
$ t
in foldl' cons (One o) os
data Cluster a
= One a
| Many [Cluster a]
deriving(Eq,Ord,Read,Show)
instance Functor Cluster where
fmap f (One a) = One (f a)
fmap f (Many cs) = Many ((fmap . fmap) f cs)
cons :: Cluster a -> a -> Cluster a
cons c a = Many [c,One a]
{-
ghci> let c = cluster fst snd t
ghci> :t c
c :: Cluster (Cost, [Id])
ghci> c
Many [Many [Many [One (0,[4]),One (1,[3,1])],One (3,[2])],One (12,[5])]
ghci> :t fmap snd c
fmap snd c :: Cluster [Id]
ghci> fmap snd c
Many [Many [Many [One [4],One [3,1]],One [2]],One [5]]
ghci> :t fmap fst c
fmap fst c :: Cluster Cost
ghci> fmap fst c
Many [Many [Many [One 0,One 1],One 3],One 12]
-}
-------------------------------------------------------------------------------
Matt
On 12/23/09, Matt Morrow
Hi Nikolas,
Interesting problem. I'd do something like the following, where the initial spanning tree from you example (re-tree-ified) is:
{- ghci> :t t t :: Tree (Id, Cost) g ghci> ppT t (4,0) | +- (3,1) | | | `- (1,1) | `- (2,3) | `- (5,12) -}
and which results in the tree:
{- ghci> let s = agglom fst snd t ghci> :t s s :: Tree (Cost, [Id]) ghci> ppT s (0,[4]) | +- (1,[3,1]) | `- (3,[2]) | `- (12,[5]) -}
which can then be flattened/etc as needed by further steps of the algo.
The code for `agglom':
----------------------------------------------------------------------------- import Data.Tree import Data.List
type Id = Int type Cost = Int
t :: Tree (Id,Cost) t = Node (4,0) [Node (3,1) [Node (1,1) []] ,Node (2,3) [Node (5,12) []]]
ppT :: Show a => Tree a -> IO () ppT = putStrLn . drawTree . fmap show
-- | Compress the incoming @Tree a@ with @accumEq@. agglom :: Eq cost => (a -> b) -> (a -> cost) -> Tree a -> Tree (cost,[b]) agglom proj cost = go where accum = accumEq proj cost go (Node a []) = Node (cost a,[proj a]) [] go (Node a ts) = let b = proj a c = cost a (bs,ss) = accum c ts in Node (c,b:bs) (fmap go ss)
-- | Repeatedly @splitEq@, and return a pair -- whose /first/ element is a list of the projected -- @b@s from those root values along paths from -- the roots of the trees in the incoming @[Tree a]@ -- which have @cost@ equal to the third function parameter, -- and whose /second/ element is the (concatenation of the) -- list(s) gotten from each of the @splitEq@ calls. accumEq :: Eq cost => (a -> b) -> (a -> cost) -> cost -> [Tree a] -> ([b],[Tree a]) accumEq proj cost c = go [] [] where split ts = splitEq proj cost c ts go xs ys [] = (xs,ys) go xs ys ts = let (eqs,neqs) = split ts in case eqs of []-> ([],ts) _ -> let (bs,tss) = unzip eqs in go (bs++xs) (neqs++ys) (concat tss)
-- | Split the incoming trees into -- (1) a @[(b,Tree a)]@ of each @b@ is the -- @proj@ected value from an @a@ where -- the @cost@ of that @a@ is equal to -- the third function parameter, and (2) -- the members of the incoming @[Tree a]@ -- whose roots' costs are /not/ equal to -- the third function parameter. splitEq :: Eq cost => (a -> b) -> (a -> cost) -> cost -> [Tree a] -> ([(b,[Tree a])],[Tree a]) splitEq proj cost c = foldl' go ([],[]) where go (!eqs,!neqs) t@(Node a ts) | c==cost a = ((proj a,ts):eqs,neqs) | otherwise = (eqs,t:neqs) -----------------------------------------------------------------------------
Cheers, Matt
On 12/23/09, Nikolas Borrel-Jensen
wrote: Hi! I have some trouble implementing single-linkage clustering algorithm by using a minimum-spanning tree, so I would appreciate if some of you could give me some advise.
I am implementing a single-linkage clustering algorithm, and my approach is to use minimum spanning trees for that task. I am using the library FGL ( http://web.engr.oregonstate.edu/~erwig/fgl/haskell/), and I have managed to compute a minimum spanning tree from an arbitrary fully connected graph with 5 nodes. I get [ [(4,0) ] , [ (3,1) , (4,0) ] , [ (1,1) , (3,1) , (4,0) ] , [ (2,3) , (4,0) ] , [ (5,12) , (2,3) , (4,0) ] ], which is the root path tree of the minimum spanning tree created by the function msTreeAt.
From that I would create a dendrogram. [ (1,1) , (3,1) , (4,0) ] is telling that node 1,3 and 4 has the same cost, namely cost 1. Therefore these are merged at level 1. At level 1 we now have 3 clusters: (1,3,4), 2 and 5. Now the second lowest should be merged, that is 2 and 4. BUT because 4 is already merged in the cluster (1,3,4), we should merge (1,3,4) and 2 at level 3 (because the cost is 3). Now at level 3 we have 2 clusters, (1,2,3,4) and 5. Now we merge the last one at level 12: (1,2,3,4,5), and we are finished.
I have very hard to see, how this could be done efficiently without pointers (as in C). I have thought of just saving the nodes from the start of the root path, and traversing it, but a lot of searching should be done all the time.
Can you please give me some advise on that?
Kind regards
Nikolas Borrel-Jensen Computer Science University Of Copenhagen

Thank you very much for your reply! I have been looking at the code, and there are two problems, as I can see. First, trying with the example t1 :: Tree (Id, Cost) t1 = Node (4,0) [Node (3,2) [Node (1,12) []] ,Node (2,3) [Node (5,1) [Node (6,2) [Node (7,2) [] ]]]] printed as (4,0) | +- (3,2) | | | `- (1,12) | `- (2,3) | `- (5,1) | `- (6,2) | `- (7,2) your function 'cluster fst snd t1' returns Many [Many [Many [Many [Many [One (0,[4]),One (1,[5])],One (2,[3])],One (2,[6,7])],One (3,[2])],One (12,[1])] I can't see how this representation is giving the hierarchical clusters. The example above should resolve into level 1: [[(2,3),(5,1)],[(6,2)],[(7,2)],[(4,0)], [(3,2)], [(1,12)]] level 2: [[(2,3),(5,1),(6,2),(7,2)], [(4,0),(3,2)], [(1,12)]] level 3: [[(2,3),(5,1),(6,2),(7,2),(4,0),(3,2)], [(1,12)]] level 4 (or (cost) level 12): [[(2,3),(5,1),(6,2),(7,2),(4,0),(3,2),(1,12)]] By doing it this way, we cluster all nodes connected with edges less than or equal x at (cost) level x. Clearly, we can have level 1: [[(1,1),(2,1)],[(3,1),(4,1)],...] if the edges between [(1,1),(2,1)] and [(3,1),(4,1)] are greater than 1. Second, I don't think it is trivial to tree-i-fy the root path tree. I have done the function treeifyMST, which surely isn't efficient, since the list encounteredNodes is traversed as many times as the number of nodes (a binary search tree would be more efficient). But more important, the tree isn't correct, since each path is connected at the root of the tree. Example (LRTree Int): [ [(1,0)],[(5,1),(1,0) ], [(2,2),(1,0)] , [(3,3),(2,2),(1,0)] , [(4,4),(2,2),(1,0)] ] -> [ [(5,1),(1,0) ] , [(3,3),(2,2),(1,0)] , [(4,4),(2,2),(1,0)] ] In my code, all 3 paths are branching at the root (1,0), but should for the last two paths branch at node (2,2). How should I cope with that in an efficient way? I wonder if if it is easier to implement it from the ground using the approach given at http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/hierarchical.htm... --------------------------------------------------------------------- --TO DO: now all paths are connected at the root of the tree. Should be patched at the right places inside the tree. The search in the list encounteredNodes is not efficient. treeifyMST :: LRTree Int -> Tree (Id,Cost) treeifyMST rootpathtree = let (LP rpt:rpts) = rootpathtree root = head rpt revrootpathtree = reverse rootpathtree in Node root (constructTree [] revrootpathtree) where constructTree :: [Int] -> LRTree Int -> [Tree (Id,Cost)] constructTree encounteredNodes (LP x:[]) = [] constructTree encounteredNodes (LP x:xs) = let path1 = x !! 0 path2 = x !! 1 id1 = fst path1 id2 = fst path2 in case (L.find (==id1) encounteredNodes) of -- because we have encountered an already processed id, we can skip this sublist Just _ -> constructTree (id2:encounteredNodes) xs -- new id, meaning that we have encountered a new path Nothing -> let lenpath = length x revpath = reverse $ take (lenpath-1) x tree = listToNode revpath in tree:constructTree (id2:encounteredNodes) xs constructTree _ _ = [] listToNode (p:ps:[]) = Node p [Node ps []] listToNode (p:ps) = Node p [listToNode ps] --------------------------------------------------------------------- Nikolas
participants (3)
-
Ketil Malde
-
Matt Morrow
-
Nikolas Borrel-Jensen