
On Tue, Mar 13, 2012 at 4:34 PM, Michael Schober
Hi again,
I'm still trying to figure out my earlier problem regarding the directory tree and functions on the files.
What I'm really trying to do sounds simple enough: find duplicate files for a complete directory, given a root. The original approach I had in mind was to compute every file's checksum, which I could feed into a map of (checksums,[filepaths]).
Is there a problem with the solution I gave you ?
On 03/10/2012 01:48 PM, Chaddaï Fouché wrote:
Note that you solution isn't very "functional-like", but rather imperative. On the other hand, making it more functional in this particular case come with its own brand of subtle difficulties.
I now wonder how to achieve my goal properly in Haskell. Obviously I have some sequential parts of my program (reading directory trees is I/O and therefore monadic), but it seems to me that could also be some functional part as well (pairing files with checksums, pairing checksums with common filepaths, filtering only checksums with more than one filepath, maybe even sorting, etc.).
How would you go about to do it? Which parts of the program should be monadic, which functional? Would you create a custom-build monad or would you use build-ins (which ones?)?
I would really like to have some pointers how an experienced Haskell developer would go about such a thing, since I want to acquire a Haskell feel of programming rather than keep on programming in my old ways, now in Haskell. I would also be happy to be simply pinpointed to other applications which source code are exemplary, but not too huge to dig through in a sensible amount of time. (I'm currently trying to work through the Darcs source code, but that seems a bit too overkill.)
An "experienced" Haskell programmer would probably reuse some of the "streaming" solutions on Hackage, combine it with md5 and a map :
module Main where
import Data.Conduit.Filesystem (traverse) import qualified Data.Conduit.List as CL import Data.Conduit
import Data.Digest.Pure.MD5 (MD5Digest) import Crypto.Conduit (hashFile)
import qualified Data.Map as M import qualified Filesystem.Path.CurrentOS as FP import System.Environment
duplicates :: FilePath -> IO [[FilePath]] duplicates dir = runResourceT $ do md5s <- traverse False (FP.decodeString dir) $$ CL.mapM process =$ CL.fold buildMap M.empty return . M.elems . M.filter ((>1).length) $ md5s where process :: FP.FilePath -> IO (MD5Digest, FilePath) process fp = do let strFp = FP.encodeString fp md5 <- hashFile strFp return (md5,strFp) buildMap m (md5,fp) = M.insertWith' (flip (++)) md5 [fp] m
main = do [dir] <- getArgs print =<< duplicates dir
This may be a bit too much and there's a bit of "noise" introduced by the two FilePath type (the standard one which is a synonym for String, and the one used in Filesystem.Path and filesystem-conduit which was introduced to better cope with some exotic filepath and encoding in filesystems), on the other hand, this is not too inefficient (though it may be improved by using another structure than Map, and maybe by using nano-md5 rather than pureMD5) and the memory used is only proportional to the number of files in the directory which should be ok. -- Jedaï