
On 03/14/2012 01:04 PM, Chaddaï Fouché wrote:
In my own tests it wasn't very fast : 10min to check a 25GB hierarchy of music files, using 13MB of memory maximum. Though I must admit that I didn't try to find similar tools to compare so I'm not too certain of normal performance times. I finally got it running after a Haskell-platform update from source to resolve some dependencies, but now it works like a charm. Runtime isn't an issue for me so far and memory consumption seems to be in acceptable levels, so it's a very good start - thanks again.
I took the liberty to modify the output a little bit to my needs - maybe a future reader will find it helpful, too. It's attached below. Best, Michael -- module Main where import Data.Conduit.Filesystem (traverse) import qualified Data.Conduit.List as CL import Data.Conduit import Data.Digest.Pure.MD5 (MD5Digest) import Crypto.Conduit (hashFile) import qualified Data.Map as M import qualified Filesystem.Path.CurrentOS as FP import System.Environment duplicates :: FilePath -> IO [(MD5Digest,[FilePath])] duplicates dir = runResourceT $ do md5s <- traverse False (FP.decodeString dir) $$ CL.mapM process =$ CL.fold buildMap M.empty return . M.assocs . M.filter ((>1).length) $ md5s where process :: FP.FilePath -> IO (MD5Digest, FilePath) process fp = do let strFp = FP.encodeString fp md5 <- hashFile strFp return (md5,strFp) buildMap m (md5,fp) = M.insertWith' (flip (++)) md5 [fp] m main = do [dir] <- getArgs putStrLn . unlines . map (\(md5,paths) -> (show md5) ++ "-->\n" ++ (unlines paths)) =<< duplicates dir