Re: [Haskell-cafe] Finding longest common prefixes in a list

20 Jan 2012

      On 20/01/12 18:45, Gwern Branwen wrote:
...
Recently I wanted to sort through a large folder of varied files and
figure out what is a 'natural' folder to split out, where natural
means something like>4 files with the same prefix.
My idea for an algorithm would be: build a trie for the input strings, 
and then look for the deepest subtries with more than one child.

For example, a trie containing the strings
   chorus-kiminoshiranaimonogatari.ogg
   chorus-mrmusic.ogg
   choucho-lastnightgoodnight.ogg

looks like:
  <root>  (3 items)
   c       (3 items)
    h       (3 items)
     o       (3 items)
      r       (2 items)
       u       (2 items)
        s       (2 items)
         -       (2 items)
          k       (1 item)
           i       (1 item)
            minoshiranaimonogatari.ogg
          m       (1 item)
           r       (1 item)
            music.ogg
      u       (1 item)
       c       (1 item)
        ho-lastnightgoodnight.ogg
Where actually the lines with more than one character are also subtrees 
of subtrees of subtrees.

Here is some example code (untested):

import qualified Data.Map as Map

-- A trie datatype
data Trie a = Trie { numLeafs, numDescendant :: !Int
                    , children :: Map.Map a (Trie a) }

-- The empty trie
empty :: Trie a
empty = Trie 0 0 Map.empty

-- A trie that contains a single string
singleton :: Ord a => [a] -> Trie a
singleton [] = Trie 1 1 Map.empty
singleton (x:xs) = Trie 0 1 (Map.singleton x (singleton xs)

-- Merge two tries
merge :: Ord a => Trie a -> Trie a -> Trie a
merge (Trie l d c) (Trie l' d' c')
     = Trie (l+l') (d+d') (Map.unionWith merge c c')

fromList :: Ord a => [[a]] -> Trie a
fromList = foldr merge empty . map singleton

toList :: Ord a => Trie a -> [[a]]
toList (Trie l _ c)
     = replicate l []
     ++ [ x:xs | (x,t) <- Map.toList c, xs <- toList t ]

data CommonPrefix a = Prefix { prefix :: [a], names :: Trie a }

atLeastThisManyDescendants :: Int -> Trie a -> [CommonPrefix a]
atLeastThisManyDescendants minD trie@(Trie l d t)
     | d < minD = []
     | null forChildren = [Prefix [] trie]
     | otherwise = forChildren
   where
     forChildren = [ Prefix (x:pfx) names
                   | (x,t) <- Map.toList c
                   , Prefix pfx names <- atLeastThisManyDescendants n t ]

Twan

Re: [Haskell-cafe] Finding longest common prefixes in a list

Twan van Laarhoven