problem with System.Directory.Tree

Hello All, I want to build a program which will recursively scan a directory and build md5sum for all the files. The intent is to do something similar to unison but more specific to my requirements. I am having trouble in the initial part of building the md5sums. I did some digging around and found that "System.Directory.Tree" is a very close match for what I want to do. In fact after a little poking around I could do exactly what I wanted. ,---- | import Monad | import System.Directory.Tree | import System.Directory | import Data.Digest.Pure.MD5 | import qualified Data.ByteString.Lazy.Char8 as L | | calcMD5 = | readDirectoryWith (\x-> liftM md5 (L.readFile x)) `---- This work perfectly for small directories. readDirectoryWith is already defined in the library and exactly what we want ,---- | *Main> calcMD5 "/home/mitra/Desktop/" | | "/home/mitra" :/ Dir {name = "Desktop", contents = [File {name = | "060_LocalMirror_Workflow.t.10.2.62.9.log", file = | f687ad04bc64674134e55c9d2a06902a},File {name = "cmd_run", file = | 6f334f302b5c0d2028adeff81bf2a0d9},File {name = "cmd_run~", `---- However when ever I give it something more challenging it gets into trouble. ,---- | *Main> calcMD5 "/home/mitra/laptop/" | *** Exception: /home/mitra/laptop/ell/calc-2.02f/calc.info-27: | openFile: resource exhausted (Too many open files) | *Main> 29~ `---- If I understand what is happening it seems to be doing all the opens before consuming them via md5. This works fine for small directories but for any practical setup this could potentially be very large. I tried forcing the md5 evaluation in the hope that the file descriptor will be freed once the entire file is read. That did not help, either because I could not get it right or there is some more subtle I am missing. I also had a look at the code in module "System.Directory.Tree" and although it gave me some understanding of how it works I am no closer to a solution. regards -- Anand Mitra

On Monday 07 June 2010 14:06:22, Anand Mitra wrote:
Hello All,
I want to build a program which will recursively scan a directory and build md5sum for all the files. The intent is to do something similar to unison but more specific to my requirements. I am having trouble in the initial part of building the md5sums.
I did some digging around and found that "System.Directory.Tree" is a very close match for what I want to do. In fact after a little poking around I could do exactly what I wanted.
,----
| import Monad | import System.Directory.Tree | import System.Directory | import Data.Digest.Pure.MD5 | import qualified Data.ByteString.Lazy.Char8 as L | | calcMD5 = | readDirectoryWith (\x-> liftM md5 (L.readFile x))
Does calcMD5 = readDirectoryWith (\x -> do txt <- readFile x return $! md5 txt) help?
`----
This work perfectly for small directories. readDirectoryWith is already defined in the library and exactly what we want
,----
| *Main> calcMD5 "/home/mitra/Desktop/" | | "/home/mitra" :/ Dir {name = "Desktop", contents = [File {name = | "060_LocalMirror_Workflow.t.10.2.62.9.log", file = | f687ad04bc64674134e55c9d2a06902a},File {name = "cmd_run", file = | 6f334f302b5c0d2028adeff81bf2a0d9},File {name = "cmd_run~",
`----
However when ever I give it something more challenging it gets into trouble.
,----
| *Main> calcMD5 "/home/mitra/laptop/" | *** Exception: /home/mitra/laptop/ell/calc-2.02f/calc.info-27: | openFile: resource exhausted (Too many open files) | *Main> 29~
`----
If I understand what is happening it seems to be doing all the opens before consuming them via md5. This works fine for small directories but for any practical setup this could potentially be very large. I tried forcing the md5 evaluation in the hope that the file descriptor will be freed once the entire file is read. That did not help, either because I could not get it right or there is some more subtle I am missing.
I also had a look at the code in module "System.Directory.Tree" and although it gave me some understanding of how it works I am no closer to a solution.
regards

Hi Daniel,
That works just as intended, Thanks.
On Tue, Jun 8, 2010 at 1:31 AM, Daniel Fischer
Does
calcMD5 = readDirectoryWith (\x -> do txt <- readFile x return $! md5 txt)
help?
`----
This work perfectly for small directories. readDirectoryWith is already defined in the library and exactly what we want
,----
| *Main> calcMD5 "/home/mitra/Desktop/" | | "/home/mitra" :/ Dir {name = "Desktop", contents = [File {name = | "060_LocalMirror_Workflow.t.10.2.62.9.log", file = | f687ad04bc64674134e55c9d2a06902a},File {name = "cmd_run", file = | 6f334f302b5c0d2028adeff81bf2a0d9},File {name = "cmd_run~",
`----
However when ever I give it something more challenging it gets into trouble.
,----
| *Main> calcMD5 "/home/mitra/laptop/" | *** Exception: /home/mitra/laptop/ell/calc-2.02f/calc.info-27: | openFile: resource exhausted (Too many open files) | *Main> 29~
`----
If I understand what is happening it seems to be doing all the opens before consuming them via md5. This works fine for small directories but for any practical setup this could potentially be very large. I tried forcing the md5 evaluation in the hope that the file descriptor will be freed once the entire file is read. That did not help, either because I could not get it right or there is some more subtle I am missing.
I also had a look at the code in module "System.Directory.Tree" and although it gave me some understanding of how it works I am no closer to a solution.
regards
_______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners

I did something similar where I built up an md5sum of all the files in
a directory for comparing whether two directories were identical (I
was cleaning up some server storage). One difference is that I only
read the first 4096 bytes of the file because if files are going to
differ they will likely differ in those bytes (and definitely would in
my case) and that is the default page read size is I recall, so even
if you use hGet handle 512, the system still reads 4192 bytes into
memory anyway, so why not use them.
I think I had a similar problem to yours with open file handles until
i used `withFile` from System.IO. This handy function took care of
closing up file resources for me so I wouldn't have a ton of open file
handles. My getFileHash function is as follows:
getFileHash :: FilePath -> IO (Maybe String)
getFileHash path =
(do
contents <- withFile path ReadMode (\h -> hGet h 4096)
return . Just $! md5sum contents)
`catch` (\e -> printFileError e
>> return Nothing)
printFileError is just a function for printing out pretty errors
related to files.
You can see that it reads some contents of the file through withFile
and then md5sums them. I have the $! to force evaluation so it will
compute as we go, otherwise it builds a huge tree of sums waiting to
be computed before computing the result for display at the root.
There are other $! operators in the tree operations to collapse at that level,
and now the program runs in constant memory space.
--
Drew Haven
drew.haven@gmail.com
On Mon, Jun 7, 2010 at 5:06 AM, Anand Mitra
Hello All,
I want to build a program which will recursively scan a directory and build md5sum for all the files. The intent is to do something similar to unison but more specific to my requirements. I am having trouble in the initial part of building the md5sums.
I did some digging around and found that "System.Directory.Tree" is a very close match for what I want to do. In fact after a little poking around I could do exactly what I wanted.
,---- | import Monad | import System.Directory.Tree | import System.Directory | import Data.Digest.Pure.MD5 | import qualified Data.ByteString.Lazy.Char8 as L | | calcMD5 = | readDirectoryWith (\x-> liftM md5 (L.readFile x)) `----
This work perfectly for small directories. readDirectoryWith is already defined in the library and exactly what we want
,---- | *Main> calcMD5 "/home/mitra/Desktop/" | | "/home/mitra" :/ Dir {name = "Desktop", contents = [File {name = | "060_LocalMirror_Workflow.t.10.2.62.9.log", file = | f687ad04bc64674134e55c9d2a06902a},File {name = "cmd_run", file = | 6f334f302b5c0d2028adeff81bf2a0d9},File {name = "cmd_run~", `----
However when ever I give it something more challenging it gets into trouble.
,---- | *Main> calcMD5 "/home/mitra/laptop/" | *** Exception: /home/mitra/laptop/ell/calc-2.02f/calc.info-27: | openFile: resource exhausted (Too many open files) | *Main> 29~ `----
If I understand what is happening it seems to be doing all the opens before consuming them via md5. This works fine for small directories but for any practical setup this could potentially be very large. I tried forcing the md5 evaluation in the hope that the file descriptor will be freed once the entire file is read. That did not help, either because I could not get it right or there is some more subtle I am missing.
I also had a look at the code in module "System.Directory.Tree" and although it gave me some understanding of how it works I am no closer to a solution.
regards -- Anand Mitra
_______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners

Hi Again,
Thanks to the help from this group I have got past my first problem and now
stuck on another. I have succeeded in building a DirTree using library
"System.Directory.Tree" with properties of md5sum and modification time. I
have also built the parts that can compare two such trees and find the
files that have changed.
I was trying to serializing the tree so that it can be saved for later
identification of changed files. The module "Tree" incidentally derives Show
but not Read and hence I cannot read the file serialized file. Searching a
bit I decided to use the infrastructure in "Data.Binary" to do the job for
me.
As soon as I started this I realized that I would have to modify Tree.hs
module. This was required because DirTree does not derive from Typeable and
Data which is required for it to be serialized via "Data.Binary". After
patching the Binary module to derive from Typeable and Data I get the
following error.
System/Directory/Tree.hs:95:47:
No instance for (Data Exception)
arising from the 'deriving' clause of a data type declaration
at System/Directory/Tree.hs:95:47-50
Possible fix:
add an instance declaration for (Data Exception)
or use a standalone 'deriving instance' declaration instead,
so you can specify the instance context yourself
When deriving the instance for (Data (DirTree a))
Failed, modules loaded: BinaryDerive.
Secondly I suspect that I could have derived it without having to modify the
original module source. The compilation error does give a hint about
"standalone deriving instance" but trying stuff at
http://www.haskell.org/ghc/docs/6.12.2/html/users_guide/deriving.html did
not help me much.
In short what is the simplest way I can serialize "System.Directory.Tree"
using Binary. Is there a better alternative to Binary for serialization ?
Are there solutions to the problem I have outlined above ? or is my approach
incorrect. Is it possible to add the deriving of datatype DirTree without
modifying the module ?
regards
--
Anand Mitra
On Mon, Jun 7, 2010 at 5:36 PM, Anand Mitra
Hello All,
I want to build a program which will recursively scan a directory and build md5sum for all the files. The intent is to do something similar to unison but more specific to my requirements. I am having trouble in the initial part of building the md5sums.
I did some digging around and found that "System.Directory.Tree" is a very close match for what I want to do. In fact after a little poking around I could do exactly what I wanted.
,---- | import Monad | import System.Directory.Tree | import System.Directory | import Data.Digest.Pure.MD5 | import qualified Data.ByteString.Lazy.Char8 as L | | calcMD5 = | readDirectoryWith (\x-> liftM md5 (L.readFile x)) `----
This work perfectly for small directories. readDirectoryWith is already defined in the library and exactly what we want
,---- | *Main> calcMD5 "/home/mitra/Desktop/" | | "/home/mitra" :/ Dir {name = "Desktop", contents = [File {name = | "060_LocalMirror_Workflow.t.10.2.62.9.log", file = | f687ad04bc64674134e55c9d2a06902a},File {name = "cmd_run", file = | 6f334f302b5c0d2028adeff81bf2a0d9},File {name = "cmd_run~", `----
However when ever I give it something more challenging it gets into trouble.
,---- | *Main> calcMD5 "/home/mitra/laptop/" | *** Exception: /home/mitra/laptop/ell/calc-2.02f/calc.info-27: | openFile: resource exhausted (Too many open files) | *Main> 29~ `----
If I understand what is happening it seems to be doing all the opens before consuming them via md5. This works fine for small directories but for any practical setup this could potentially be very large. I tried forcing the md5 evaluation in the hope that the file descriptor will be freed once the entire file is read. That did not help, either because I could not get it right or there is some more subtle I am missing.
I also had a look at the code in module "System.Directory.Tree" and although it gave me some understanding of how it works I am no closer to a solution.
regards -- Anand Mitra

Hello Anand System.Directory.Tree is not a good candidate for serializing directly - as you have found out, one of the constructors - Failed - carries an IOException which cannot be serialized (IOExceptions may contain file handles which are inherently runtime values). Data.Binary is usually the best option for serialization. In your case, you wont be able to make a 1-1 mapping between a runtime DirTree and its on disk representation as you'll have to work out what to do about 'Failed' - maybe you would want to only serialize the good constructors - Dir and File - instead. As you will have to coerce the data type a bit, I'd recommend using Data.Binary to write the serialization, but rather than make instances of Put and Get for DirTree instead give the serialize and deserialize functions characteristic names e.g. serializeGoodTree / deserializeGoodTree. Best wishes Stephen

Hi,
Continuing on my project I have hit another hurdle.
Summarizing this is what I have been tring to do. Using
System.Directory.Tree to traverse a directory tree and build a DirTree of
type Props where Props are
data Props = Prop { md5sum :: MD5Digest, modTime :: ClockTime,
filenam::FilePath }
| Blank deriving (Show, Eq, Typeable) -- , Data
I am trying to capture the md5sum and time. Next I wanted to serialize this
structure to disk for comparison at a later invocation.
I got stuck trying to serialize DirTree using Data.Binary and encode because
DirTree has a field of type Exception which does not derive from Binary/Data
which is required for serialization. I felt that using "deriving instance "
could solve the problem but I don't understand that mechanism well enough to
solve my problem. How ever an alternative was suggested and this is what I
did. It is in-elegant but sort of works.
I defined a alternate private DirTreeW which mirrors the DirTree except that
the Exception is converted to a String. The data type is defined below
data DirTreeW a = DirW { name :: FileName,
contents :: [DirTreeW a] }
| FileW { name :: FileName,
file :: a }
| FailedW { name :: FileName,
err :: String }
deriving (Show, Eq, Typeable, Data)
I then wrote a function to convert a DirTree to this type.
convert (Dir a b) = DirW a $ map convert b
convert (File a b) = FileW a b
convert (Failed a e) = FailedW a "some error"
This version of DirTree I can serialize. However a DirTreeW of type Props
fails to serialize because of the MD5Digest and ClockTime. I tried deriving
Props from Data but that fails with the following error.
*BinaryDerive Data.Binary Data.Digest.Pure.MD5> :r
[2 of 2] Compiling Main ( dir-recurse.hs, interpreted )
Ok, modules loaded: BinaryDerive, Main.
*Main Data.Binary Data.Digest.Pure.MD5> :r
[2 of 2] Compiling Main ( dir-recurse.hs, interpreted )
dir-recurse.hs:16:56:
No instances for (Data MD5Digest, Data ClockTime)
arising from the 'deriving' clause of a data type declaration
at dir-recurse.hs:16:56-59
Possible fix:
add an instance declaration for (Data MD5Digest, Data ClockTime)
or use a standalone 'deriving instance' declaration instead,
so you can specify the instance context yourself
When deriving the instance for (Data Props)
Failed, modules loaded: BinaryDerive.
It seems to suggest that md5 and clocktime are not derived from Data and
hence I cannot derive Props from Data. Please suggest on a way forward. I
would appreciate pointers on exactly what the purpose of "deriving instace"
is. I found the relevant references in the ghc compiler reference but the
are far too cryptic and I cannot understand it.
I am attaching the full source in the attachment if you need to examine it.
Hoping to see useful suggestions soon.
regards
--
Anand Mitra
On Mon, Jun 14, 2010 at 1:26 PM, Stephen Tetley
Hello Anand
System.Directory.Tree is not a good candidate for serializing directly - as you have found out, one of the constructors - Failed - carries an IOException which cannot be serialized (IOExceptions may contain file handles which are inherently runtime values).
Data.Binary is usually the best option for serialization. In your case, you wont be able to make a 1-1 mapping between a runtime DirTree and its on disk representation as you'll have to work out what to do about 'Failed' - maybe you would want to only serialize the good constructors - Dir and File - instead.
As you will have to coerce the data type a bit, I'd recommend using Data.Binary to write the serialization, but rather than make instances of Put and Get for DirTree instead give the serialize and deserialize functions characteristic names e.g. serializeGoodTree / deserializeGoodTree.
Best wishes
Stephen

Hi Anand MD5Digest is an abstract type (the constructor is not exported from its module) but it is an instance of Binary. ClockTime (from System.Time) is not an instance of Binary but it does export its constructor. Neither are instances of Data.Data. So I would hand-craft an instance of Binary for the Props datatype rather than try to first make them instances of Data. The code will be something like this, as I don't have MD5 installed it is unchecked: class Binary Props where put (Prop md5 tim name) = do { putWord8 0 -- number each constructor ; put md5 -- MD5Digest has a Binary instance ; putTOD tim ; put name } put Blank = do { putWord8 1 } -- number each constructor get = do { typ <- getWord8 -- get the constructor tag... ; case typ of 0 -> getProp 1 -> return Blank } getProp :: Get Props getProp = do { md5 <- get ; tim <- getTOD ; name <- get ; return (Prop md5 tim name) } -- ClockTime doesn't have a binary instance -- but a it has a single constructor -- -- > TOD Integer Integer - -- -- and Integer has a Binary instance, so I -- would make auxillaris for put and get: putTOD :: ClockTime -> Put () putTOD (TOD a b) = do { put a ; put b } getTOD :: Get ClockTime getTOD = do { a <- get; b <- get; return (TOD a b) }

All the advice and help I got on my difficulties till now have been
very useful. My current problem is a little weird and can't
figure out what is happening.
I have been able to get the serialization working with DirTree based
on the suggestions I have received till now. I have a function calcMD5
which given a FilePath will traverse the entire tree calculating the
checksum of each file it encounters. The resultant structure is
serializable by encode. But when I do a encodeFile to store the result
to a file I get nothing.
,----
| *Main> calcMD5 "/tmp/tmp"
| AncTree "/tmp" (DirW {name = "tmp", contents = [FileW {name = "passwd",
file = Prop {md5sum = f54e7cef69973cecdce3c923da2f9222, modTime = Tue Jul 6
07:18:16 IST 2010, filenam = "/tmp/tmp/passwd"}}]})
|
| *Main> liftM encode $ calcMD5 "/tmp/tmp"
| Chunk
"\NUL\NUL\NUL\NUL\NUL\NUL\NUL\EOT/tmp\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\ETXtmp\NUL\NUL\NUL\NUL\NUL\NUL\NUL\SOH\SOH\NUL\NUL\NUL\NUL\NUL\NUL\NUL\ACKpasswd\NUL\245N|\239i\151<\236\220\227\201#\218/\146\"\NULL2\139`\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\SI/tmp/tmp/passwd"
Empty
`----
clearly the encoding is working.
But if I try to use encodeFile to write it to a file the file is not
created.
,----
| *Main> liftM (encodeFile "/tmp/tmp-list") $ calcMD5 "/tmp/tmp"
|
| $ ls /tmp/tmp-list
| ls: cannot access /tmp/tmp-list: No such file or directory
`----
I tried a few other things like converting the Bytestring from encode
into a string using show and then doing a writeFile on
it. Unfortunately none of them worked.
regards
--
Anand Mitra
On Mon, Jun 28, 2010 at 1:23 PM, Stephen Tetley
Hi Anand
MD5Digest is an abstract type (the constructor is not exported from its module) but it is an instance of Binary. ClockTime (from System.Time) is not an instance of Binary but it does export its constructor. Neither are instances of Data.Data.
So I would hand-craft an instance of Binary for the Props datatype rather than try to first make them instances of Data.
The code will be something like this, as I don't have MD5 installed it is unchecked:
class Binary Props where put (Prop md5 tim name) = do { putWord8 0 -- number each constructor ; put md5 -- MD5Digest has a Binary instance ; putTOD tim ; put name }
put Blank = do { putWord8 1 } -- number each constructor
get = do { typ <- getWord8 -- get the constructor tag... ; case typ of 0 -> getProp 1 -> return Blank }
getProp :: Get Props getProp = do { md5 <- get ; tim <- getTOD ; name <- get ; return (Prop md5 tim name) }
-- ClockTime doesn't have a binary instance -- but a it has a single constructor -- -- > TOD Integer Integer - -- -- and Integer has a Binary instance, so I -- would make auxillaris for put and get:
putTOD :: ClockTime -> Put () putTOD (TOD a b) = do { put a ; put b }
getTOD :: Get ClockTime getTOD = do { a <- get; b <- get; return (TOD a b) }

On Tuesday 06 July 2010 04:01:58, Anand Mitra wrote:
All the advice and help I got on my difficulties till now have been very useful. My current problem is a little weird and can't figure out what is happening.
I have been able to get the serialization working with DirTree based on the suggestions I have received till now. I have a function calcMD5 which given a FilePath will traverse the entire tree calculating the checksum of each file it encounters. The resultant structure is serializable by encode. But when I do a encodeFile to store the result to a file I get nothing.
,----
| *Main> calcMD5 "/tmp/tmp" | AncTree "/tmp" (DirW {name = "tmp", contents = [FileW {name = | "passwd",
file = Prop {md5sum = f54e7cef69973cecdce3c923da2f9222, modTime = Tue Jul 6 07:18:16 IST 2010, filenam = "/tmp/tmp/passwd"}}]})
| *Main> liftM encode $ calcMD5 "/tmp/tmp" | Chunk
"\NUL\NUL\NUL\NUL\NUL\NUL\NUL\EOT/tmp\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\ET Xtmp\NUL\NUL\NUL\NUL\NUL\NUL\NUL\SOH\SOH\NUL\NUL\NUL\NUL\NUL\NUL\NUL\ACKp asswd\NUL\245N|\239i\151<\236\220\227\201#\218/\146\"\NULL2\139`\NUL\NUL\ NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\SI/tmp/tmp/passwd" Empty `----
clearly the encoding is working. But if I try to use encodeFile to write it to a file the file is not created.
,----
| *Main> liftM (encodeFile "/tmp/tmp-list") $ calcMD5 "/tmp/tmp"
calcMD5 :: IO something I suppose? then liftM (encodeFile "/tmp/tmp-list") $ calcMD5 "/tmp/tmp" has type IO (IO ()) and executing that only evaluates the action (encodeFile "/tmp/tmp-list" calcMD5Result) , it doesn't execute it. What you want is calcMD5 "/tmp/tmp" >>= encodeFile "/tmp/tmp-list"
| | $ ls /tmp/tmp-list | ls: cannot access /tmp/tmp-list: No such file or directory
`----
I tried a few other things like converting the Bytestring from encode into a string using show and then doing a writeFile on it. Unfortunately none of them worked.
regards
participants (5)
-
Anand Mitra
-
Anand Mitra
-
Daniel Fischer
-
Drew Haven
-
Stephen Tetley