
On 20/01/12 18:45, Gwern Branwen wrote:
Recently I wanted to sort through a large folder of varied files and figure out what is a 'natural' folder to split out, where natural means something like>4 files with the same prefix.
My idea for an algorithm would be: build a trie for the input strings, and then look for the deepest subtries with more than one child. For example, a trie containing the strings chorus-kiminoshiranaimonogatari.ogg chorus-mrmusic.ogg choucho-lastnightgoodnight.ogg looks like: <root> (3 items) c (3 items) h (3 items) o (3 items) r (2 items) u (2 items) s (2 items) - (2 items) k (1 item) i (1 item) minoshiranaimonogatari.ogg m (1 item) r (1 item) music.ogg u (1 item) c (1 item) ho-lastnightgoodnight.ogg Where actually the lines with more than one character are also subtrees of subtrees of subtrees. Here is some example code (untested): import qualified Data.Map as Map -- A trie datatype data Trie a = Trie { numLeafs, numDescendant :: !Int , children :: Map.Map a (Trie a) } -- The empty trie empty :: Trie a empty = Trie 0 0 Map.empty -- A trie that contains a single string singleton :: Ord a => [a] -> Trie a singleton [] = Trie 1 1 Map.empty singleton (x:xs) = Trie 0 1 (Map.singleton x (singleton xs) -- Merge two tries merge :: Ord a => Trie a -> Trie a -> Trie a merge (Trie l d c) (Trie l' d' c') = Trie (l+l') (d+d') (Map.unionWith merge c c') fromList :: Ord a => [[a]] -> Trie a fromList = foldr merge empty . map singleton toList :: Ord a => Trie a -> [[a]] toList (Trie l _ c) = replicate l [] ++ [ x:xs | (x,t) <- Map.toList c, xs <- toList t ] data CommonPrefix a = Prefix { prefix :: [a], names :: Trie a } atLeastThisManyDescendants :: Int -> Trie a -> [CommonPrefix a] atLeastThisManyDescendants minD trie@(Trie l d t) | d < minD = [] | null forChildren = [Prefix [] trie] | otherwise = forChildren where forChildren = [ Prefix (x:pfx) names | (x,t) <- Map.toList c , Prefix pfx names <- atLeastThisManyDescendants n t ] Twan