Re: [Haskell-beginners] Processing a list of files the Haskell way

13 Mar 2012


      On Tue, Mar 13, 2012 at 4:34 PM, Michael Schober  wrote:
...
Hi again,
I'm still trying to figure out my earlier problem regarding the directory
tree and functions on the files.
What I'm really trying to do sounds simple enough: find duplicate files for
a complete directory, given a root. The original approach I had in mind was
to compute every file's checksum, which I could feed into a map of
(checksums,[filepaths]).
Is there a problem with the solution I gave you ?
...
On 03/10/2012 01:48 PM, Chaddaï Fouché wrote:
...
Note that you solution isn't very "functional-like", but rather
imperative. On the other hand, making it more functional in this
particular case come with its own brand of subtle difficulties.
I now wonder how to achieve my goal properly in Haskell. Obviously I have
some sequential parts of my program (reading directory trees is I/O and
therefore monadic), but it seems to me that could also be some functional
part as well (pairing files with checksums, pairing checksums with common
filepaths, filtering only checksums with more than one filepath, maybe even
sorting, etc.).
How would you go about to do it? Which parts of the program should be
monadic, which functional? Would you create a custom-build monad or would
you use build-ins (which ones?)?
I would really like to have some pointers how an experienced Haskell
developer would go about such a thing, since I want to acquire a Haskell
feel of programming rather than keep on programming in my old ways, now in
Haskell. I would also be happy to be simply pinpointed to other applications
which source code are exemplary, but not too huge to dig through in a
sensible amount of time. (I'm currently trying to work through the Darcs
source code, but that seems a bit too overkill.)
An "experienced" Haskell programmer would probably reuse some of the
"streaming" solutions on Hackage, combine it with md5 and a map :
...
module Main where
import Data.Conduit.Filesystem (traverse)
import qualified Data.Conduit.List as CL
import Data.Conduit
import Data.Digest.Pure.MD5 (MD5Digest)
import Crypto.Conduit (hashFile)
import qualified Data.Map as M
import qualified Filesystem.Path.CurrentOS as FP
import System.Environment
duplicates :: FilePath -> IO [[FilePath]]
duplicates dir = runResourceT $ do
  md5s <- traverse False (FP.decodeString dir) $$ CL.mapM process =$ CL.fold buildMap M.empty
  return . M.elems . M.filter ((>1).length) $ md5s
    where
      process :: FP.FilePath -> IO (MD5Digest, FilePath)
      process fp = do
        let strFp = FP.encodeString fp
        md5 <- hashFile strFp
        return (md5,strFp)
      buildMap m (md5,fp) = M.insertWith' (flip (++)) md5 [fp] m
main = do
  [dir] <- getArgs
  print =<< duplicates dir
This may be a bit too much and there's a bit of "noise" introduced by
the two FilePath type (the standard one which is a synonym for String,
and the one used in Filesystem.Path and filesystem-conduit which was
introduced to better cope with some exotic filepath and encoding in
filesystems), on the other hand, this is not too inefficient (though
it may be improved by using another structure than Map, and maybe by
using nano-md5 rather than pureMD5) and the memory used is only
proportional to the number of files in the directory which should be
ok.

-- 
Jedaï