Finding longest common prefixes in a list

Recently I wanted to sort through a large folder of varied files and figure out what is a 'natural' folder to split out, where natural means something like >4 files with the same prefix. (This might be author, genre, subject, whatever I felt was important when I was naming the file.) Now usually I name files with hyphens as the delimiters like the hypothetical '1998-wadler-monads.pdf', and it would be easy to write a stdin/stdout filter to break Strings on hyphens and sort by whatever is most common. But this is rather hardwired, can I solve the more general problem of finding the longest common prefixes, whatever they are? This turns out to be much more difficult than simply finding 'the' longest common prefix (which is usually ""). I found an algorithm of sorts at http://stackoverflow.com/a/6634624 but it was easier described than implemented. Eventually I wrote what I *think* is a correct program, but it's definitely of the write-only sort. Perhaps people have better implementations somewhere? I saw a lot of discussion of tries, but I didn't go that route. The code, followed by an example: #!/usr/bin/env runhaskell import Data.List (intercalate, isPrefixOf, nub, sort) main :: IO () main = interact (unlines . intercalate [""] . chunkFiles . lines ) -- basic algorithm from http://stackoverflow.com/a/6634624 chunkFiles :: Ord a => [[a]] -> [[[a]]] chunkFiles f = map (\(_,b) -> filter (isPrefixOf b) f) $ sort $ map (\x -> (countPrefixes x f,x)) (e $ bar f) sharedPrefixes :: Ord a => [[a]] -> [a] sharedPrefixes [] = [] sharedPrefixes s = foldr1 sp2 s where sp2 l1 l2 = map fst . takeWhile (uncurry (==)) $ zip l1 l2 traverse :: Ord a => [[a]] -> [[a]] traverse [] = [] traverse x = sharedPrefixes (take 2 x) : traverse (drop 1 x) bar :: Ord a => [[a]] -> [[a]] bar = nub . sort . traverse . sort countPrefixes :: (Ord a) => [a] -> [[a]] -> Int countPrefixes x xs = length $ filter (x `isPrefixOf`) xs e :: Eq a => [[a]] -> [[a]] e y = map fst $ filter snd $ map (\x -> (x, (==) 1 $ length . filter id $ map (x `isPrefixOf`) y)) y {- Example input from `ls`: chorus-kiminoshiranaimonogatari.ogg chorus-mrmusic.ogg choucho-lastnightgoodnight.ogg dylanislame-aikotoba.ogg electriclove-エレクトリック・ラブ-korskremix.ogg gumi-bacon8-justhangingaround.ogg gumi-iapologizetoyou.ogg gumi-montblanc.ogg gumi-mozaikrole.ogg gumi-ハッピーシンセサイザ.ogg gumi-showasengirl.ogg gumi-sweetfloatflatsスイートフロートアパート.ogg gumi-timewarpedafterchoppingmystagbeetle.ogg gumi-オリジナル曲-付きホシメグリ.ogg gumi-ミクオリジナル親友.ogg kaito-byakkoyano.ogg kaito-flowertail.ogg kasaneteto-tam-ochamekinou重音テト吹っ切れたおちゃめ機能.ogg len-crime-timetosaygoodbye.ogg len-fire◎flower.ogg len-ponponpon.ogg lily-prototype.ogg luka-apolxcore-waitingforyou.ogg luka-dimトロイ.ogg luka-dion-myheartwillgoon.ogg luka-dirgefilozofio-dirgeasleepinjesus.ogg luka-アゴアニキ-doubelariatダブルラリアット.ogg luka-emon-heartbeats.ogg luka-emonloid3-ハローハロー.ogg luka-everybreathyoutake.ogg luka-オリジナル-garden.ogg luka-justbefriends.ogg lukameiko-gemini.ogg luka-milkyway.ogg luka-やみくろ-かいぎ.ogg luka-tic-tick.ogg luka-torinouta.ogg luka-zeijakukei-shounenshoujo.ogg luka-勝手にアニメ-nologic-作ってみた.ogg luka-駄目人間.ogg meiko-artemis-awake.ogg miku-9ronicleプラチナ.ogg miku-acolorlinkingworld-この世界の下で.ogg miku-acolorlinkingworld-青い花.ogg miku-a+jugos-lullabyforkindness.ogg miku-akayaka-beacon.ogg miku-akayakap-sunrise.ogg miku-aoihana.ogg miku-arabianresponse.ogg miku-avtechno-tear.ogg miku-こえをきかせてcicci.ogg miku-cleantears-remind2011natsu-greenhillzonecrystiararemix.ogg miku-cleantears-remind2011natsu-夏影summerwindremix.ogg miku-clocklockworks.ogg miku-dancedancevol2-runner.ogg miku-daniwellp-chaoticuniverse.ogg miku-dixieflatline-shinonomescrumble.ogg miku-electricloveエレクトリックラヴ.ogg miku-elegumitokyo-kissmebaby.ogg miku-galaxyodyssey-cryingirl.ogg miku-galaxyodyssey-galaxyspacelines.ogg miku-hakamairi.ogg miku-haruna.ogg miku-heartshooter.ogg miku-hoshikuzutokakera.ogg miku-innes.ogg miku-innocence初音ミク.ogg miku-jemappelle-motion-likeyou.ogg miku-jemappelle-motion-ohwell.ogg miku-jevannip-myfavoritesummer.ogg miku-kakokyuudance-過呼吸ダンス.ogg miku-kz-packaged.ogg miku-kz-tellyourworld.ogg miku-lastscene.ogg miku-lostmemories付き-初音ミク.ogg miku-lovelyday.ogg miku-いいわけlove_song.ogg mikulukagumi-prayfor.ogg miku-maple-初音ミク楓-オリジナル曲.ogg miku-more1.5.ogg miku-m@rk-eklosion.ogg miku-m@rk-kirch.ogg miku-nana-ボーナストラック-ハッピー般若コア.ogg miku-nekomimiswitch.ogg miku-nightrainbow.ogg miku-noyounome.ogg miku-むかしむかしのきょうのぼくオリジナル.ogg miku-pandolistp-neverendinghammertime.ogg miku-ジラートP-birthdayofeden-deepsleep.ogg miku-ジラートP-birthdayofeden-水中読書.ogg miku-plustellia-dear.ogg miku-plustellia-壁の彩度-crazygirl.ogg miku-plustellia-壁の彩度-discoradio.ogg miku-ぽわぽわP-ストロボライト.ogg miku-rabbitforgets.ogg miku-re:package-lastnightgoodnight.ogg miku-re:package-ourmusic.ogg miku-re:package-sutorobonaitsu.ogg miku-rollinggirl.ogg miku-ryo-メルト-melt.ogg miku-senseiniitteyaro.ogg miku-sevencolors-レモネード.ogg miku-shoukinosatadenia.ogg miku-stratosphere.ogg miku-supernova.ogg miku-tam-lastnightgoodnight.ogg miku-tanatofobia.ogg miku-thearmyforyourenvy-スーパー・ノヴァ.ogg miku-theendlesslove.ogg miku-tinyparadise-snowflake.ogg miku-tinyparadise-tinyparadise.ogg miku-unfragment.ogg miku-worldismine-ルドイズマイン.ogg miku-yakiimo.ogg miku-文学少年の憂鬱-オリジナル.ogg miku-カラフルポップビートオリジナル曲.ogg miku-杯本選life.ogg miku-杯本選初音ミクどういうことなのダンス.ogg miku-般若心経beautyfloor-buddhamix.ogg miku-般若心経ポップ.ogg niconicochorus-blackrockshooter.ogg niconicochorus-justbefriends.ogg rin-dixieflatline-gemini.ogg rin-elegumitokyo-二人、恋してgirlsside.ogg rin-helloworld.ogg rin-jutenija.ogg rin-lastnightgoodnight.ogg rin-ripples-evergreen.ogg rin-っ´ω`c.ogg rollinggirl-piano.ogg seeu-gagain-따라리라ddadada.ogg utau-雪歌ユフbeyondオリジナル曲.ogg yuki-discochocolatheque.ogg yuki-shouwasenhosiga^ru.ogg yuki-shouwasenhosiga^ru.ogg Example output from `ls | ~/lcp.hs`: chorus-kiminoshiranaimonogatari.ogg chorus-mrmusic.ogg gumi-montblanc.ogg gumi-mozaikrole.ogg gumi-showasengirl.ogg gumi-sweetfloatflatsスイートフロートアパート.ogg kaito-byakkoyano.ogg kaito-flowertail.ogg luka-emon-heartbeats.ogg luka-emonloid3-ハローハロー.ogg luka-tic-tick.ogg luka-torinouta.ogg miku-acolorlinkingworld-この世界の下で.ogg miku-acolorlinkingworld-青い花.ogg miku-akayaka-beacon.ogg miku-akayakap-sunrise.ogg miku-cleantears-remind2011natsu-greenhillzonecrystiararemix.ogg miku-cleantears-remind2011natsu-夏影summerwindremix.ogg miku-dancedancevol2-runner.ogg miku-daniwellp-chaoticuniverse.ogg miku-electricloveエレクトリックラヴ.ogg miku-elegumitokyo-kissmebaby.ogg miku-galaxyodyssey-cryingirl.ogg miku-galaxyodyssey-galaxyspacelines.ogg miku-hakamairi.ogg miku-haruna.ogg miku-innes.ogg miku-innocence初音ミク.ogg miku-jemappelle-motion-likeyou.ogg miku-jemappelle-motion-ohwell.ogg miku-kz-packaged.ogg miku-kz-tellyourworld.ogg miku-lostmemories付き-初音ミク.ogg miku-lovelyday.ogg miku-m@rk-eklosion.ogg miku-m@rk-kirch.ogg miku-plustellia-壁の彩度-crazygirl.ogg miku-plustellia-壁の彩度-discoradio.ogg miku-senseiniitteyaro.ogg miku-sevencolors-レモネード.ogg miku-tam-lastnightgoodnight.ogg miku-tanatofobia.ogg miku-thearmyforyourenvy-スーパー・ノヴァ.ogg miku-theendlesslove.ogg miku-tinyparadise-snowflake.ogg miku-tinyparadise-tinyparadise.ogg miku-ジラートP-birthdayofeden-deepsleep.ogg miku-ジラートP-birthdayofeden-水中読書.ogg miku-杯本選life.ogg miku-杯本選初音ミクどういうことなのダンス.ogg miku-般若心経beautyfloor-buddhamix.ogg miku-般若心経ポップ.ogg niconicochorus-blackrockshooter.ogg niconicochorus-justbefriends.ogg len-crime-timetosaygoodbye.ogg len-fire◎flower.ogg len-ponponpon.ogg luka-dimトロイ.ogg luka-dion-myheartwillgoon.ogg luka-dirgefilozofio-dirgeasleepinjesus.ogg miku-re:package-lastnightgoodnight.ogg miku-re:package-ourmusic.ogg miku-re:package-sutorobonaitsu.ogg miku-nana-ボーナストラック-ハッピー般若コア.ogg miku-nekomimiswitch.ogg miku-nightrainbow.ogg miku-noyounome.ogg rin-dixieflatline-gemini.ogg rin-elegumitokyo-二人、恋してgirlsside.ogg rin-helloworld.ogg rin-jutenija.ogg rin-lastnightgoodnight.ogg rin-ripples-evergreen.ogg rin-っ´ω`c.ogg -} -- gwern http://www.gwern.net

On 20/01/12 18:45, Gwern Branwen wrote:
Recently I wanted to sort through a large folder of varied files and figure out what is a 'natural' folder to split out, where natural means something like>4 files with the same prefix.
My idea for an algorithm would be: build a trie for the input strings, and then look for the deepest subtries with more than one child. For example, a trie containing the strings chorus-kiminoshiranaimonogatari.ogg chorus-mrmusic.ogg choucho-lastnightgoodnight.ogg looks like: <root> (3 items) c (3 items) h (3 items) o (3 items) r (2 items) u (2 items) s (2 items) - (2 items) k (1 item) i (1 item) minoshiranaimonogatari.ogg m (1 item) r (1 item) music.ogg u (1 item) c (1 item) ho-lastnightgoodnight.ogg Where actually the lines with more than one character are also subtrees of subtrees of subtrees. Here is some example code (untested): import qualified Data.Map as Map -- A trie datatype data Trie a = Trie { numLeafs, numDescendant :: !Int , children :: Map.Map a (Trie a) } -- The empty trie empty :: Trie a empty = Trie 0 0 Map.empty -- A trie that contains a single string singleton :: Ord a => [a] -> Trie a singleton [] = Trie 1 1 Map.empty singleton (x:xs) = Trie 0 1 (Map.singleton x (singleton xs) -- Merge two tries merge :: Ord a => Trie a -> Trie a -> Trie a merge (Trie l d c) (Trie l' d' c') = Trie (l+l') (d+d') (Map.unionWith merge c c') fromList :: Ord a => [[a]] -> Trie a fromList = foldr merge empty . map singleton toList :: Ord a => Trie a -> [[a]] toList (Trie l _ c) = replicate l [] ++ [ x:xs | (x,t) <- Map.toList c, xs <- toList t ] data CommonPrefix a = Prefix { prefix :: [a], names :: Trie a } atLeastThisManyDescendants :: Int -> Trie a -> [CommonPrefix a] atLeastThisManyDescendants minD trie@(Trie l d t) | d < minD = [] | null forChildren = [Prefix [] trie] | otherwise = forChildren where forChildren = [ Prefix (x:pfx) names | (x,t) <- Map.toList c , Prefix pfx names <- atLeastThisManyDescendants n t ] Twan

On Fri, Jan 20, 2012 at 1:57 PM, Twan van Laarhoven
Here is some example code (untested):
Well, you're right that it doesn't work. I tried to fix the crucial function, 'atLeastThisManyDescendants', but it's missing something because varying parts doesn't much affect the results when I try it out on example input - it either returns everything or nothing, it seems: atLeastThisManyDescendants :: Int -> Trie a -> [CommonPrefix a] atLeastThisManyDescendants minD trie@(Trie l d t') | d < minD = [] | null forChildren = [Prefix [] trie] | otherwise = forChildren where forChildren = [ Prefix (x:pfx) nms | (x,t) <- Map.toList t' , Prefix pfx nms <- atLeastThisManyDescendants l t ] -- gwern http://www.gwern.net

On 2012-01-20 23:44, Gwern Branwen wrote:
On Fri, Jan 20, 2012 at 1:57 PM, Twan van Laarhoven
wrote: Here is some example code (untested):
Well, you're right that it doesn't work. I tried to fix the crucial function, 'atLeastThisManyDescendants', but it's missing something because varying parts doesn't much affect the results when I try it out on example input - it either returns everything or nothing, it seems: atLeastThisManyDescendants :: Int -> Trie a -> [CommonPrefix a] atLeastThisManyDescendants minD trie@(Trie l d t') | d < minD = [] | null forChildren = [Prefix [] trie] | otherwise = forChildren where forChildren = [ Prefix (x:pfx) nms | (x,t) <- Map.toList t' , Prefix pfx nms <- atLeastThisManyDescendants l t ]
It should be "atLeastThisManyDescendants minD t", minD is a threshold for the minimum numer of descendants, and it stays the same in the recursive call. That's what you get for not testing your code :) With the correct function I get a result like: *Main> mapM_ (print . prefix) $ atLeastThisManyDescendants 4 test1 "gumi-" "luka-" "miku-a" "miku-h" "miku-m" "miku-n" "miku-p" "miku-r" "miku-s" "miku-t" "rin-" Notice that there are lots of "miku-X" prefixes found. This is probably not what you want. What exactly do you want the algorithm to do? For example, "" is obviously a prefix of every string, but it is not very long. On the other hand, each string is a prefix of itself, but that prefix is shared by only one string (usually). By the way, the sort and compare adjacent pairs approach corresponds to "atLeastThisManyDescendants 2". Twan

On Sat, Jan 21, 2012 at 8:18 AM, Twan van Laarhoven
Notice that there are lots of "miku-X" prefixes found. This is probably not what you want. What exactly do you want the algorithm to do? For example, "" is obviously a prefix of every string, but it is not very long. On the other hand, each string is a prefix of itself, but that prefix is shared by only one string (usually).
By the way, the sort and compare adjacent pairs approach corresponds to "atLeastThisManyDescendants 2".
Ah, now the code makes sense to me. It's longer, but it is a heck of a lot more principled and readable, so I'm happy to replace my version with yours. It's not too hard to convert it into a CLI filter with optional depth (default of 2, replicating original behavior): import qualified Data.Map as Map import System.Environment (getArgs) import Data.List (sortBy) import Data.Ord (comparing) main :: IO () main = do arg <- getArgs let n = if null arg then 2 else read (head arg) :: Int interact (unlines . chunk n . lines) chunk :: Int -> [String] -> [String] chunk n = map prefix . sortByLength . atLeastThisManyDescendants n . fromList where sortByLength :: [CommonPrefix Char] -> [CommonPrefix Char] sortByLength = sortBy (comparing (numDescendant . names)) ..... And the results seem kosher (printing just the prefixes is probably the best idea, but wouldn't be too hard to switch to printing full filenames - just filter the original file list with the extracted prefix from each CommonPrefix): $ ls music/vocaloid/| runhaskell lcp.hs 5 miku-s miku-t miku-r rin- miku-a gumi- luka- $ ls music/vocaloid/| runhaskell lcp.hs 4 miku-h miku-m miku-n miku-p miku-s miku-t miku-r rin- miku-a gumi- luka- $ ls music/vocaloid/| runhaskell lcp.hs # with 2 chorus- gumi-mo gumi-s kaito- luka-emon luka-t miku-acolorlinkingworld- miku-akayaka miku-cleantears-remind2011natsu- miku-dan miku-ele miku-galaxyodyssey- miku-ha miku-inn miku-jemappelle-motion- miku-kz- miku-lo miku-m@rk- miku-plustellia-壁の彩度- miku-ro miku-se miku-ta miku-the miku-tinyparadise- miku-ジラートP-birthdayofeden- miku-杯本選 miku-般若心経 niconicochorus- yuki- len- luka-di miku-re:package- miku-n rin- -- gwern http://www.gwern.net
participants (2)
-
Gwern Branwen
-
Twan van Laarhoven