[ANNOUNCE] (and request for review): directory-tree v0.9.0

Greetings Haskellers! directory-tree is a module providing a directory-tree-like datatype along with Foldable and Traversable instances, along with a simple, high-level IO interface. You can see the package along with some examples here (apologies if the haddock docs haven't been generated yet) : http://hackage.haskell.org/package/directory-tree This primary change in this release is the addition of two experimental "lazy" functions: `readDirectoryWithL` and `buildL`. These functions use `unsafePerformIO` behind the scenes to traverse the filesystem as required by pure computations consuming the returned DirTree data structure. I believe I am doing this safely and sanely but would love if some more experienced folks could comment on the code. These changes (and this whole revamping of this originally very simple module) were inspired by the fact that a few people seemed to really like this API, and this recent reddit post lamenting the perceived difficulty of writing a `du`-like function in haskell. http://www.reddit.com/r/haskell/comments/cs54i/how_would_you_write_du_in_has... One could write such a function using directory-tree as follows (sorry if the monadic compositional style is foreign):
import System.Directory.Tree import qualified Data.Foldable as F import System.IO import Control.Monad
du :: FileName -> IO () du = print . F.sum . free <=< readDirectoryWithL (hFileSize <=< readHs) where readHs = flip openFile ReadMode
Thanks for reading and for any input, especially performance suggestions or opinions on my unsafe function usage. I hope this is useful to someone. SIncerely, Brandon Simmons http://coder.bsimmons.name/

On Mon, Aug 9, 2010 at 10:48 PM, Brandon Simmons < brandon.m.simmons@gmail.com> wrote:
Greetings Haskellers!
directory-tree is a module providing a directory-tree-like datatype along with Foldable and Traversable instances, along with a simple, high-level IO interface. You can see the package along with some examples here (apologies if the haddock docs haven't been generated yet) :
If I understand what you're saying, then your library is very similar to an abstraction that darcs had for years knows as "Slurpy". The experience in the darcs project was that it lead to performance issues and correctness issues that were hard to find/fix.
This primary change in this release is the addition of two experimental "lazy" functions: `readDirectoryWithL` and `buildL`. These functions use `unsafePerformIO` behind the scenes to traverse the filesystem as required by pure computations consuming the returned DirTree data structure. I believe I am doing this safely and sanely but would love if some more experienced folks could comment on the code.
unsafePerformIO or unsafeInterleaveIO? Either way, to me it seems a bit dangerous to be doing this sort of lazy IO. If the directory structure is large will I run out of file handles? How will IO errors be handled? Will I receive the exceptions in pure code or inside my IO actions? Will I run into space leaks if something holds on to 1 file and then references it "after" the directory traversal? I might have my history wrong, but as I recall darcs started with lazy slurpies and moved to doing things strictly due to space leaks, running out of file descriptors, file descriptor leaks (not running out, but having the file be locked long after darcs should have been 'done' with it), and exception delivery. It's a seductive path but one that does not seem to have a good ending. I'm not sure what darcs uses these days. Perhaps that's what hashed-storage provides, although I haven't been able to find any documentation on hashed-storage other than the haddocks (which only document the api with no overview or explanation of the problem hashed-storage solves). Jason

On Tue, Aug 10, 2010 at 4:34 PM, Jason Dagit
On Mon, Aug 9, 2010 at 10:48 PM, Brandon Simmons
wrote: Greetings Haskellers!
directory-tree is a module providing a directory-tree-like datatype along with Foldable and Traversable instances, along with a simple, high-level IO interface. You can see the package along with some examples here (apologies if the haddock docs haven't been generated yet) :
If I understand what you're saying, then your library is very similar to an abstraction that darcs had for years knows as "Slurpy". The experience in the darcs project was that it lead to performance issues and correctness issues that were hard to find/fix.
This primary change in this release is the addition of two experimental "lazy" functions: `readDirectoryWithL` and `buildL`. These functions use `unsafePerformIO` behind the scenes to traverse the filesystem as required by pure computations consuming the returned DirTree data structure. I believe I am doing this safely and sanely but would love if some more experienced folks could comment on the code.
unsafePerformIO or unsafeInterleaveIO? Either way, to me it seems a bit dangerous to be doing this sort of lazy IO. If the directory structure is large will I run out of file handles? How will IO errors be handled? Will I receive the exceptions in pure code or inside my IO actions? Will I run into space leaks if something holds on to 1 file and then references it "after" the directory traversal? I might have my history wrong, but as I recall darcs started with lazy slurpies and moved to doing things strictly due to space leaks, running out of file descriptors, file descriptor leaks (not running out, but having the file be locked long after darcs should have been 'done' with it), and exception delivery.
IO Errors are caught in a pure constructor called "Failed". In practice I think my unsafe version is better in many of those respects than the original, for example with regard to running out of file handles. Are you referring to lazy IO in general, which those problems you mention seem to apply to, or the use of unsafePerformIO? I certainly want this module to be as useful and problem-free as possible, but I will be content if it is no less problematic than lazy IO is problematic. Could you elaborate on > "Will I run into space leaks if something holds on to1 file and then references > it "after" the directory traversal"? ?
It's a seductive path but one that does not seem to have a good ending. I'm not sure what darcs uses these days. Perhaps that's what hashed-storage provides, although I haven't been able to find any documentation on hashed-storage other than the haddocks (which only document the api with no overview or explanation of the problem hashed-storage solves). Jason
Eric Kow just pointed out the existence of hashed-storage to me (I believe you are right that it is what darcs does/will use) and it will be interesting to see the approach in there, if I can grok it. Thanks a lot for the input. Brandon Simmons http://coder.bsimmons.name

On Tue, Aug 10, 2010 at 5:54 PM, Brandon Simmons < brandon.m.simmons@gmail.com> wrote:
On Tue, Aug 10, 2010 at 4:34 PM, Jason Dagit
wrote: On Mon, Aug 9, 2010 at 10:48 PM, Brandon Simmons
wrote: Greetings Haskellers!
directory-tree is a module providing a directory-tree-like datatype along with Foldable and Traversable instances, along with a simple, high-level IO interface. You can see the package along with some examples here (apologies if the haddock docs haven't been generated yet) :
If I understand what you're saying, then your library is very similar to
an
abstraction that darcs had for years knows as "Slurpy". The experience in the darcs project was that it lead to performance issues and correctness issues that were hard to find/fix.
This primary change in this release is the addition of two experimental "lazy" functions: `readDirectoryWithL` and `buildL`. These functions use `unsafePerformIO` behind the scenes to traverse the filesystem as required by pure computations consuming the returned DirTree data structure. I believe I am doing this safely and sanely but would love if some more experienced folks could comment on the code.
unsafePerformIO or unsafeInterleaveIO? Either way, to me it seems a bit dangerous to be doing this sort of lazy IO. If the directory structure is large will I run out of file handles? How will IO errors be handled? Will I receive the exceptions in pure code or inside my IO actions? Will I run into space leaks if something holds on to 1 file and then references it "after" the directory traversal? I might have my history wrong, but as I recall darcs started with lazy slurpies and moved to doing things strictly due to space leaks, running out of file descriptors, file descriptor leaks (not running out, but having the file be locked long after darcs should have been 'done' with it), and exception delivery.
IO Errors are caught in a pure constructor called "Failed". In practice I think my unsafe version is better in many of those respects than the original, for example with regard to running out of file handles. Are you referring to lazy IO in general, which those problems you mention seem to apply to, or the use of unsafePerformIO?
It boils down to the same thing right?
I certainly want this module to be as useful and problem-free as possible, but I will be content if it is no less problematic than lazy IO is problematic.
Could you elaborate on
"Will I run into space leaks if something holds on to1 file and then references it "after" the directory traversal"?
Let me give you an example. Prelude's readFile is lazy. That is, it returns immediately and then only fetches from the file as you demand the contents of the file. This makes it possible to stream the file. If you process it chunks, say 1 line at a time, then you can do so in constant space. If you then let the contents of the file escape, meaning somewhere else in the processing references it, then you'll stop streaming it and start holding on to the whole thing at once. Something like this, untested: notleaky1 = do xs <- readFile "foo" mapM_ print (lines xs) notleaky2 = do xs <- readFile "foo" print (length xs) leaky = do xs <- readFile "foo" mapM_ print (lines xs) print (length xs) handleleak = do xs <- readFile "foo" return (take 10 xs) Now, in leaky if you calculated the length and printed the lines in the same iteration, the leak would go away. In the handleleak example the file stays open even after handleleak produces all 10 elements. Now imagine those examples in terms of directory traversals instead of read from a file. This would still be a problem even if replace readFile with readFile': readFile' f = unsafePerformIO (readFile f) I hope that helps, Jason
participants (2)
-
Brandon Simmons
-
Jason Dagit