Processing a list of files the Haskell way

Michael Schober

10 Mar 2012 10 Mar '12

11:55 a.m.

Hi everyone, I'm currently trying to solve a problem in which I have to process a long list of files, more specifically I want to compute MD5 checksums for all files. I have code which lists me all the files and holds it in the following data structure: data DirTree = FileNode FilePath | DirNode FilePath [DirTree] I tried the following: -- calculates MD5 sums for all files in a dirtree addChecksums :: DirTree -> IO [(DirTree,MD5Digest)] addChecksums dir = addChecksums' [dir] where addChecksums' :: [DirTree] -> IO [(DirTree,MD5Digest)] addChecksums' [] = return [] addChecksums' (f@(FileNode fp):re) = do bytes <- BL.readFile fp rest <- addChecksums' re return ((f,md5 bytes):rest) addChecksums' ((DirNode fp filelist):re) = do efiles <- addChecksums' filelist rest <- addChecksums' re return $ efiles ++ rest This works fine, but only for a small number of files. If I try it on a big directory tree, the memory gets junked up and it aborts with an error message telling me that there are too many open files. So I guess, I have to sequentialize the code a little bit more. But at the same time, I want to keep it as functional as possible and I don't want to write C-like code. What would be the Haskell way to do something like this? Thanks for all the input, Michael

Show replies by date

edgar klerks

10 Mar 10 Mar

12:32 p.m.

Hi Michael, Your code has a very C-like feel to it. I would first separate the reading of the directory structure and the files and the walk over the tree. Something like this: data DirTree = FileNode FilePath | DirNode FilePath [DirTree] walkDirTree :: (FilePath -> a) -> DirTree -> [a] walkDirTree f (FileNode fp) = [f fp] walkDirTree f (DirNode fp fs) = f fp : (fs >>= (walkDirTree f)) I know this isn't what you need, I didn't read your solution properly when I wrote it, but it is a useful hint. The separation of the pure part and the IO part of your program is important. The problem of the open files is another beast. You are using lazy bytestrings. Lazy bytestrings can keep the file descriptor open as long as you haven't read all the bytes. I suspect you need to add some strictness to your program. You can try to use strict bytestrings. Or use seq to evaluate the md5 thunks earlier in the program execution. Greets, Edgar On 3/10/12, Michael Schober wrote:

...

Hi everyone,

I'm currently trying to solve a problem in which I have to process a long list of files, more specifically I want to compute MD5 checksums for all files.

I have code which lists me all the files and holds it in the following data structure:

data DirTree = FileNode FilePath | DirNode FilePath [DirTree]

I tried the following:

-- calculates MD5 sums for all files in a dirtree addChecksums :: DirTree -> IO [(DirTree,MD5Digest)] addChecksums dir = addChecksums' [dir] where addChecksums' :: [DirTree] -> IO [(DirTree,MD5Digest)] addChecksums' [] = return [] addChecksums' (f@(FileNode fp):re) = do bytes <- BL.readFile fp rest <- addChecksums' re return ((f,md5 bytes):rest) addChecksums' ((DirNode fp filelist):re) = do efiles <- addChecksums' filelist rest <- addChecksums' re return $ efiles ++ rest

This works fine, but only for a small number of files. If I try it on a big directory tree, the memory gets junked up and it aborts with an error message telling me that there are too many open files.

So I guess, I have to sequentialize the code a little bit more. But at the same time, I want to keep it as functional as possible and I don't want to write C-like code.

What would be the Haskell way to do something like this?

Thanks for all the input, Michael

_______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners

Chaddaï Fouché

12:48 p.m.

On Sat, Mar 10, 2012 at 12:55 PM, Michael Schober wrote:

...

Hi everyone,

I'm currently trying to solve a problem in which I have to process a long list of files, more specifically I want to compute MD5 checksums for all files.

I have code which lists me all the files and holds it in the following data structure:

data DirTree = FileNode FilePath | DirNode FilePath [DirTree]

I tried the following:

-- calculates MD5 sums for all files in a dirtree addChecksums :: DirTree -> IO [(DirTree,MD5Digest)] addChecksums dir = addChecksums' [dir] where addChecksums' :: [DirTree] -> IO [(DirTree,MD5Digest)] addChecksums' [] = return [] addChecksums' (f@(FileNode fp):re) = do bytes <- BL.readFile fp rest <- addChecksums' re return ((f,md5 bytes):rest)

You're not computing the md5 sums before you have done the same for all other files in the directory... And since you're being lazy you don't even compute it _at all_ before you ask for it leter in your program. If readFile wasn't lazy, you would need to keep all the contents of those files in memory until after addChecksums is completely finished (which would be a big problem in itself), but since readFile is lazy, those file aren't read either until you need their content. But they're still opened, so you get a lot of opened handle you don't close, and opened handle are a limited resource in any OS so... What you need to do is computing the md5 sums as soon as you see the file and before you do anything else, so :

...

addChecksums' (f@(FileNode fp):re) = do bytes <- BL.readFile fp let !md5sum = md5 bytes rest <- addChecksums' re return ((f,md5sum):rest)

The ! before md5sum indicates that this let-binding should be immediately computed rather than deferred until needed which is the norm for let-binding. Don't forget to add {-# LANGUAGE BangPattern #-} at the beginning of your file. Since the file is read to its end by md5, the handle is automatically closed, so you shouldn't have the same problem. Note that you solution isn't very "functional-like", but rather imperative. On the other hand, making it more functional in this particular case come with its own brand of subtle difficulties. -- Jedaï

Michael Schober

13 Mar 13 Mar

3:34 p.m.

Hi again, I'm still trying to figure out my earlier problem regarding the directory tree and functions on the files. What I'm really trying to do sounds simple enough: find duplicate files for a complete directory, given a root. The original approach I had in mind was to compute every file's checksum, which I could feed into a map of (checksums,[filepaths]). On 03/10/2012 01:48 PM, Chaddaï Fouché wrote:

...

Note that you solution isn't very "functional-like", but rather imperative. On the other hand, making it more functional in this particular case come with its own brand of subtle difficulties.

I now wonder how to achieve my goal properly in Haskell. Obviously I have some sequential parts of my program (reading directory trees is I/O and therefore monadic), but it seems to me that could also be some functional part as well (pairing files with checksums, pairing checksums with common filepaths, filtering only checksums with more than one filepath, maybe even sorting, etc.). How would you go about to do it? Which parts of the program should be monadic, which functional? Would you create a custom-build monad or would you use build-ins (which ones?)? I would really like to have some pointers how an experienced Haskell developer would go about such a thing, since I want to acquire a Haskell feel of programming rather than keep on programming in my old ways, now in Haskell. I would also be happy to be simply pinpointed to other applications which source code are exemplary, but not too huge to dig through in a sensible amount of time. (I'm currently trying to work through the Darcs source code, but that seems a bit too overkill.) Thanks for any comments and thoughts. Best, Michael

Chaddaï Fouché

6:47 p.m.

On Tue, Mar 13, 2012 at 4:34 PM, Michael Schober wrote:

...

Hi again,

I'm still trying to figure out my earlier problem regarding the directory tree and functions on the files.

What I'm really trying to do sounds simple enough: find duplicate files for a complete directory, given a root. The original approach I had in mind was to compute every file's checksum, which I could feed into a map of (checksums,[filepaths]).

Is there a problem with the solution I gave you ?

...

On 03/10/2012 01:48 PM, Chaddaï Fouché wrote:

...
Note that you solution isn't very "functional-like", but rather imperative. On the other hand, making it more functional in this particular case come with its own brand of subtle difficulties.

I now wonder how to achieve my goal properly in Haskell. Obviously I have some sequential parts of my program (reading directory trees is I/O and therefore monadic), but it seems to me that could also be some functional part as well (pairing files with checksums, pairing checksums with common filepaths, filtering only checksums with more than one filepath, maybe even sorting, etc.).

How would you go about to do it? Which parts of the program should be monadic, which functional? Would you create a custom-build monad or would you use build-ins (which ones?)?

I would really like to have some pointers how an experienced Haskell developer would go about such a thing, since I want to acquire a Haskell feel of programming rather than keep on programming in my old ways, now in Haskell. I would also be happy to be simply pinpointed to other applications which source code are exemplary, but not too huge to dig through in a sensible amount of time. (I'm currently trying to work through the Darcs source code, but that seems a bit too overkill.)

An "experienced" Haskell programmer would probably reuse some of the "streaming" solutions on Hackage, combine it with md5 and a map :

...

module Main where

import Data.Conduit.Filesystem (traverse) import qualified Data.Conduit.List as CL import Data.Conduit

import Data.Digest.Pure.MD5 (MD5Digest) import Crypto.Conduit (hashFile)

import qualified Data.Map as M import qualified Filesystem.Path.CurrentOS as FP import System.Environment

duplicates :: FilePath -> IO [[FilePath]] duplicates dir = runResourceT $ do md5s <- traverse False (FP.decodeString dir) $$ CL.mapM process =$ CL.fold buildMap M.empty return . M.elems . M.filter ((>1).length) $ md5s where process :: FP.FilePath -> IO (MD5Digest, FilePath) process fp = do let strFp = FP.encodeString fp md5 <- hashFile strFp return (md5,strFp) buildMap m (md5,fp) = M.insertWith' (flip (++)) md5 [fp] m

main = do [dir] <- getArgs print =<< duplicates dir

This may be a bit too much and there's a bit of "noise" introduced by the two FilePath type (the standard one which is a synonym for String, and the one used in Filesystem.Path and filesystem-conduit which was introduced to better cope with some exotic filepath and encoding in filesystems), on the other hand, this is not too inefficient (though it may be improved by using another structure than Map, and maybe by using nano-md5 rather than pureMD5) and the memory used is only proportional to the number of files in the directory which should be ok. -- Jedaï

Michael Schober

14 Mar 14 Mar

10:44 a.m.

On 03/13/2012 07:47 PM, Chaddaï Fouché wrote:

...

Is there a problem with the solution I gave you ? No and yes. No in the sense that it most certainly did solve the open files problem. Yes in the sense that it still consumed a huge amount of memory and, more important to me, left me feeling that I was certainly doing something conceptually wrong. That was the reason for my repost, since I wanted to learn the 'right' (or at least more suitable) concept.

...

An "experienced" Haskell programmer would probably reuse some of the "streaming" solutions on Hackage, combine it with md5 and a map : This seems to be the 'Haskell-way' solution I was looking for. I'm still in the progress of digging through a lot of material I found ([1-3], links below for other interested readers), but I think that this will be very helpful in the future with similar projects as well.

I have yet to test the code you send me last time, but I will give some feedback of it as soon as I get the chance. Thank you very much! Michael References: [1] The original conduit article I found: http://www.yesodweb.com/book/conduit [2] Conduits seem to be a development progress originating from enumerators/iteratee. Therefore, it seems to be a good idea to read up on those prior: http://www.yesodweb.com/book/enumerator [3] There's also an article in The Monad.Reader, issue 16 about Iteratee: http://themonadreader.wordpress.com/2010/05/12/issue-16/

Chaddaï Fouché

12:04 p.m.

Le 14 mars 2012 11:44, "Michael Schober" a écrit :

...

...
An "experienced" Haskell programmer would probably reuse some of the "streaming" solutions on Hackage, combine it with md5 and a map :

This seems to be the 'Haskell-way' solution I was looking for. I'm still

in the progress of digging through a lot of material I found ([1-3], links below for other interested readers), but I think that this will be very helpful in the future with similar projects as well.

...

I have yet to test the code you send me last time, but I will give some

feedback of it as soon as I get the chance.

...

In my own tests it wasn't very fast : 10min to check a 25GB hierarchy of music files, using 13MB of memory maximum. Though I must admit that I didn't try to find similar tools to compare so I'm not too certain of normal performance times.

...

Thank you very much! Michael

References: [1] The original conduit article I found: http://www.yesodweb.com/book/conduit [2] Conduits seem to be a development progress originating from

enumerators/iteratee. Therefore, it seems to be a good idea to read up on those prior:

...

http://www.yesodweb.com/book/enumerator [3] There's also an article in The Monad.Reader, issue 16 about Iteratee: http://themonadreader.wordpress.com/2010/05/12/issue-16/

Good reading ! Though understanding them is not always easy, basic usage of iteratee/conduit/... is not too hard and often deliver pretty nice performance improvements. -- Jedaï

Michael Schober

15 Mar 15 Mar

11:14 a.m.

On 03/14/2012 01:04 PM, Chaddaï Fouché wrote:

...

In my own tests it wasn't very fast : 10min to check a 25GB hierarchy of music files, using 13MB of memory maximum. Though I must admit that I didn't try to find similar tools to compare so I'm not too certain of normal performance times. I finally got it running after a Haskell-platform update from source to resolve some dependencies, but now it works like a charm. Runtime isn't an issue for me so far and memory consumption seems to be in acceptable levels, so it's a very good start - thanks again.

I took the liberty to modify the output a little bit to my needs - maybe a future reader will find it helpful, too. It's attached below. Best, Michael -- module Main where import Data.Conduit.Filesystem (traverse) import qualified Data.Conduit.List as CL import Data.Conduit import Data.Digest.Pure.MD5 (MD5Digest) import Crypto.Conduit (hashFile) import qualified Data.Map as M import qualified Filesystem.Path.CurrentOS as FP import System.Environment duplicates :: FilePath -> IO [(MD5Digest,[FilePath])] duplicates dir = runResourceT $ do md5s <- traverse False (FP.decodeString dir) $$ CL.mapM process =$ CL.fold buildMap M.empty return . M.assocs . M.filter ((>1).length) $ md5s where process :: FP.FilePath -> IO (MD5Digest, FilePath) process fp = do let strFp = FP.encodeString fp md5 <- hashFile strFp return (md5,strFp) buildMap m (md5,fp) = M.insertWith' (flip (++)) md5 [fp] m main = do [dir] <- getArgs putStrLn . unlines . map (\(md5,paths) -> (show md5) ++ "-->\n" ++ (unlines paths)) =<< duplicates dir

Yawar Amin

18 Mar 18 Mar

5:43 p.m.

Hi Michael, Michael Schober writes:

...

[...] I took the liberty to modify the output a little bit to my needs - maybe a future reader will find it helpful, too. It's attached below.

I kind of played around with your example a little bit and wondered if it could be implemented in terms of just the basic Haskell Platform modules and functions. So as an exercise I rolled my own directory traversal and duplicate finder functions. This is what I came up with: - walkDirWith: walks a given directory with a given function that takes a Handle to any (unknown type) value, and returns association lists of paths and the unknown type values. - filePathMap: I think roughly analogous to your duplicates function. - main: In the third line of the main function, I use hFileSize as an example of a function that takes a Handle to an IO value, in this case IO Integer. A hash function could easily be put in here. The last line pretty-prints the Map in a tree-like format. import System.IO import System.Environment (getArgs) import System.Directory ( doesDirectoryExist , getDirectoryContents) import Control.Monad (mapM) import Control.Applicative ((<$>)) import System.FilePath (()) import qualified Data.Map as M walkDirWith :: FilePath -> (Handle -> IO r) -> IO [(r, FilePath)] -> IO [(r, FilePath)] walkDirWith path f walkList = do isDir <- doesDirectoryExist path if isDir then do paths <- getDirectoryContents path concat <$> mapM (\p -> walkDirWith (path p) f walkList) [p | p <- paths, p /= ".", p /= ".."] else do rValue <- withFile path ReadMode f ((:) (rValue, path)) <$> walkList filePathMap :: Ord r => [(r, FilePath)] -> M.Map r [FilePath] filePathMap pathPairs = foldl (\theMap (r, path) -> M.insertWith' (++) r [path] theMap) M.empty pathPairs main :: IO () main = do [dir] <- getArgs fileSizes <- walkDirWith dir hFileSize $ return [] putStr . M.showTree $ filePathMap fileSizes Obviously there's no right or wrong way to do it, but I'm wondering what you think. Regards, Yawar

Lorenzo Bolla

21 Mar 21 Mar

5:13 p.m.

What about something as simple as this? import Control.Monad (forM) import System.Directory (doesDirectoryExist, getDirectoryContents) import System.FilePath (()) import qualified Data.ByteString as B import Data.Digest.OpenSSL.MD5 (md5sum) import qualified Data.Map as M getRecursiveContents :: FilePath -> IO [FilePath] getRecursiveContents topdir = do names <- getDirectoryContents topdir let properNames = filter (`notElem` [".", ".."]) names paths <- forM properNames $ \name -> do let path = topdir name isDirectory <- doesDirectoryExist path if isDirectory then getRecursiveContents path else return [path] return (concat paths) getMD5 :: FilePath -> IO String getMD5 file = md5sum `fmap` B.readFile file main :: IO () main = do files <- getRecursiveContents "." md5s <- sequence $ map getMD5 files let m = M.fromListWith (++) $ zip md5s [[f] | f <- files] putStrLn $ M.showTree m The biggest part is the "getRecursiveContent", shamelessly stolen from RWH. L. On Sun, Mar 18, 2012 at 5:43 PM, Yawar Amin wrote:

...

Hi Michael,

Michael Schober writes:

...
[...] I took the liberty to modify the output a little bit to my needs - maybe a future reader will find it helpful, too. It's attached below.

I kind of played around with your example a little bit and wondered if it could be implemented in terms of just the basic Haskell Platform modules and functions. So as an exercise I rolled my own directory traversal and duplicate finder functions. This is what I came up with:

- walkDirWith: walks a given directory with a given function that takes a Handle to any (unknown type) value, and returns association lists of paths and the unknown type values.

- filePathMap: I think roughly analogous to your duplicates function.

- main: In the third line of the main function, I use hFileSize as an example of a function that takes a Handle to an IO value, in this case IO Integer. A hash function could easily be put in here. The last line pretty-prints the Map in a tree-like format.

import System.IO import System.Environment (getArgs) import System.Directory ( doesDirectoryExist , getDirectoryContents) import Control.Monad (mapM) import Control.Applicative ((<$>)) import System.FilePath (()) import qualified Data.Map as M

walkDirWith :: FilePath -> (Handle -> IO r) -> IO [(r, FilePath)] -> IO [(r, FilePath)] walkDirWith path f walkList = do isDir <- doesDirectoryExist path if isDir then do paths <- getDirectoryContents path concat <$> mapM (\p -> walkDirWith (path p) f walkList) [p | p <- paths, p /= ".", p /= ".."] else do rValue <- withFile path ReadMode f ((:) (rValue, path)) <$> walkList

filePathMap :: Ord r => [(r, FilePath)] -> M.Map r [FilePath] filePathMap pathPairs = foldl (\theMap (r, path) -> M.insertWith' (++) r [path] theMap) M.empty pathPairs

main :: IO () main = do [dir] <- getArgs fileSizes <- walkDirWith dir hFileSize $ return [] putStr . M.showTree $ filePathMap fileSizes

Obviously there's no right or wrong way to do it, but I'm wondering what you think.

Regards,

Yawar

_______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners

4862

Age (days ago)

4873

Last active (days ago)

List overview

Download

9 comments

5 participants

participants (5)

Chaddaï Fouché
edgar klerks
Lorenzo Bolla
Michael Schober
Yawar Amin