apfelmus' interface degustation - System.FilePath.Find

Chers amis of functionaλ cooking, aujourd'hui, le maître de cuisine is going to criticize System.FilePath.Find http://darcs.serpentine.com/filemanip/dist/doc/html/ /FileManip/System-FilePath-Find.html Is it seasoned too much with Curry and Monads? Pure or diluée with side effects? Apfelmus will tell you! And like every serious cooking critic, le maître thinks no end of himself so he can judge the haddock without eating the compiled meal. Who did not yet have to scan the file buffet for data of his gusto? This is what System.FilePath.Find helps you to do. It basically offers a function find :: RecursionPredicate -> FilterPredicate -> FilePath -> IO [FilePath] to lazily traverse a directory tree filtering out files that do not match the FilterPredicate. The RecursionPredicate decides whether to recurse into or a sub-directory or not (it could be mentioned explicitly that the predicate is only invoked on directories). A variant of find includes user-defined error handling. How to define predicates? Mon dieu! They are to be obtained from a monad FindClause a type FilterPredicate = FindClause Bool which offers operations like extension :: FindClause FilePath (==?) :: Eq a => FindClause a -> a -> FindClause Bool (||?) = liftM2 (||) One example predicate would be (extension ==? ".hs") ||? (extension ==? ".lhs") Cette soupe is much too monadic! In particular, we have the isomorphism (FileInfo -> a) ≅ FindClause a witnessed by the function pair (f,g) = ((`liftM` fileInfo), evalClause) Of course, the encapsulation was made to allow a formulation like extension ==? ".hs" which does not mention the FileInfo parameter. But alas, this can also be achieved more naturally by appealing to the instance Control.Applicative ((->) a) Abstracting the concrete representation away into FindClause hinders reuse of already existing functionality from System.FilePath, although the name "extension" is arguably more succinct than "takeExtension". The monad could make sense if the predicate might involve not only the file status but also looking inside the file. Returning all files with a certain magic number would be such a use case but System.FilePath.Find currently does not offer zis possibilité. Un autre point is that the library offers a function fold that is almost a foldl' over all files in a directory tree, but not quite: here, the fold goes over FileInfos but the list is a simple list of FilePaths. One has to foldM over the latter and get the file infos again to achieve the same effect. Le maître de cuisine thinks that these quirks stem from the fact that the file system lacks a purely functional design. The observation is that given persistence, reading does not need to be monadic. In other words, directory tree traversal might well return a list of files data File = File { contents :: String, status :: FileStatus } with contents and status. More generally, the traversal could be a side-effect-free traversal of a pure data structure. Of course, a purely functional file system needs further research, also due to performance requirements. Another direction of généralisation is that XML data is in principle similar to a file system. Shouldn't both have a common traversal interface? Last but not least, every modern cooking critic includes a Rating: λλλ Le maître awards three λ since the library uses a monad where none should be, but acknowledges that this was for the ability to formulate queries without plumbing the FileInfo-parameter everywhere. Bon appetit, apfelmus

Bryan O'Sullivan wrote:
apfelmus wrote:
Thank you for the review and constructive criticism. Although highly amusing, some of what you write is elliptical enough in style that I'm having trouble following a few of your points. Doubly so since I've thus far gone years without paying attention to Control.Applicative
Oui, you're right. I'm going to elaborate, although critique incompréhensible is a must, otherwise it would quickly become obvious that the critic has absolutely no clue ;)
The RecursionPredicate decides whether to recurse into or a sub-directory or not (it could be mentioned explicitly that the predicate is only invoked on directories).
Good Point!
Cette soupe is much too monadic! In particular, we have the isomorphism
(FileInfo -> a) ≅ FindClause a
witnessed by the function pair
(f,g) = ((`liftM` fileInfo), evalClause)
Of course, the encapsulation was made to allow a formulation like
extension ==? ".hs"
which does not mention the FileInfo parameter. But alas, this can also be achieved more naturally by appealing to the
instance Control.Applicative ((->) a)
The intention is to present the predicates (FileInfo -> a) in a way that does not mention the parameter over and over. I mean, one could write the example as \info -> (extension info == ".hs") || (extension info == ".lhs") but threading the parameter around will quickly become a nuisance with more complex queries. One solution is to use the Reader monad, and in essence that's what System.FilePath.Find currently does. But I think that a) The Reader is better seen as applicative functor than as monad b) Making (FindClause a) a mere type synonym of (FileInfo -> a) has the benefit that the user can choose whether he wants to use monads or applicative functors via the respective instances or whether he does not. Programming with applicative functors is like programming with monads but you're only allowed to use return and `ap`. The example can be written for monads as return (||) `ap` (return (== ".hs") `ap` extension) `ap` (return (== ".lhs") `ap` extension) or for general applicative functors as (||) <$> ((== ".hs") <$> extension) <*> ((== ".lhs") <$> extension) In the end, one will probably use the custom combinators (==?) and (||?) anyway, so that it doesn't matter whether it's monad or applicative or whatever. But not making (FindClause a) opaque gives more freedom to the user. Note that providing functionality by making things instances of very powerful classes should be documented explicitly. A recent question on haskell-cafe was about Data.Tree and one can argue that the innocent looking instances for Foldable, Functor and Traversable are maybe 75% of the functionality that Data.Tree offers :)
Abstracting the concrete representation away into FindClause hinders reuse of already existing functionality from System.FilePath,
Yes, that's unfortunate.
The monad could make sense if the predicate might involve not only the file status but also looking inside the file.
How would you suggest doing so? Just a simple unsafeInterleaveIO readFile?
Returning all files with a certain magic number would be such a use case but System.FilePath.Find currently does not offer zis possibilité.
I'm certainly all for improving it, so this kind of suggestion is most welcome.
I guess that using (FileInfo -> IO Bool) and providing a specialized (FileInfo -> Bool) version is probably a safe way. But with IO, the traversal could even change files in place, which is probably not a good idea. Maybe unsafePerformIO is the best solution, because you may safely close the file after having evaluated predicate (unsafePerformIO $ hGetContents handle) (fileinfo) to weak head normal form, i.e. True or False. I think it fits perfectly.
Un autre point is that the library offers a function fold that is almost a foldl' over all files in a directory tree, but not quite: here, the fold goes over FileInfos but the list is a simple list of FilePaths. One has to foldM over the latter and get the file infos again to achieve the same effect.
But the FilePath is embedded in a FileInfo, so a double traversal is unneeded.
Yes. I've not been clear, I mean the other way round. Using System.FilePath.Find.fold gives you both file status and file path but the ought-to-be equivalent approach of using foldl' on the list returned by find only gives the file path but no file status. So, the suggestion is to make find return a list of FileInfo find :: ... -> IO [FileInfo] If you only want a list of file paths, you can always do liftM (map filepath) . find ... Of course, this leads to the question whether find should be factored even more into generate & prune find r p dir = map filepath . filter p . directoryWalk r dir with the intention that System.FilePath.Find only exports directoryWalk. Which leads to the question whether directoryWalk can be factored as well which in turn leads to the next question:
Le maître de cuisine thinks that these quirks stem from the fact that the file system lacks a purely functional design. The observation is that given persistence, reading does not need to be monadic. In other words, directory tree traversal might well return a list of files
data File = File { contents :: String, status :: FileStatus }
with contents and status. More generally, the traversal could be a side-effect-free traversal of a pure data structure.
The darcs authors already tried this, and gave up on the idea. Once you have a pure data structure, you start developing notions that it makes sense to manipulate it, and then all is lost once you turn your mind to applying those manipulations to the real world.
Yes, the data structure ought to be read-only. But by making it an abstract data type with proper access functions, maybe this goal can be achieved. In any case, with such a data structure, directoryWalk can be factored as well. In particular, I have something in mind, based on the following Question: "Given a directory tree, what are you going to do with it?". Answers : "Well, I rename every file" (=> map) "Calculate total size" (=> fold) The assumption is that everything you'll ever do with a huge directory tree is to map or fold it. So, here comes my crazy speculation: make the directory tree a phantom type data DirTree a type DirectoryTree = DirTree FileInfo and implement Functor, Foldable and Traversable for it. So, printing out all files would become Data.Foldable.mapM_ (putStrLn . filepath) :: DirTree FileInfo -> IO () In conclusion, there are several ways to generalize an interface. One way is to add more options and parameters to a function. But the other way is to shatter a monolithic function into tiny pieces that can be reassembled and composed at will. I think that the latter is the spirit and the source of power of functional programming. Regards, apfelmus

b) Making (FindClause a) a mere type synonym of (FileInfo -> a) has the benefit that the user can choose whether he wants to use monads or applicative functors via the respective instances or whether he does not.
That's where I had started out, as a matter of fact.
or for general applicative functors as
(||) <$> ((== ".hs") <$> extension) <*> ((== ".lhs") <$> extension)
I don't find that very readable, I must confess.
Maybe unsafePerformIO is the best solution, because you may safely close the file after having evaluated
predicate (unsafePerformIO $ hGetContents handle) (fileinfo)
to weak head normal form, i.e. True or False. I think it fits perfectly.
In principle unsafeInterleaveIO $ readFile fileName ought to be better, because it will not try to open the file unless the predicate actually inspects it, and opening files is expensive. But it will also not close the file until a finalizer kicks in. A tidier approach might be: maybeH <- newIORef Nothing contents <- unsafeInterleaveIO $ do h <- openFile fileName ReadMode writeIORef maybeH (Just h) hGetContents h let result = predicate contents result `seq` readIORef maybeH >>= maybe (return ()) hClose That's a little cumbersome, but very appealing from the perspective of a user of the library. Unfortunately, it looks to me like it's not very safe; see below.
Using System.FilePath.Find.fold gives you both file status and file path but the ought-to-be equivalent approach of using foldl' on the list returned by find only gives the file path but no file status. So, the suggestion is to make find return a list of FileInfo
Let me pass an idea by you. There's a problem with augmenting FileInfo to potentially cause IO to occur as above (both with your original suggestion and my changes), no? We lose the ability to control when a file might be opened, and when it ought to be closed. If that were the case, the fold interface would have the same problem, if it decided for some reason to sock away FileInfo values in an accumulator and work on them after the fold had completed.
Of course, this leads to the question whether find should be factored even more into generate & prune
find r p dir = map filepath . filter p . directoryWalk r dir
with the intention that System.FilePath.Find only exports directoryWalk.
That's a nice idea, but subject to the same problems as the fold interface would be if we added file contents to the interface. Your other observations regarding making a directory tree abstract, and instances of some of our favourite typeclasses, are very appealing. Now if only I could figure out a clean way to avoid bad things happening in the presence of that user-friendly IO code...

Bryan O'Sullivan wrote:
b) Making (FindClause a) a mere type synonym of (FileInfo -> a) has the benefit that the user can choose whether he wants to use monads or applicative functors via the respective instances or whether he does not.
That's where I had started out, as a matter of fact.
Yes. What I want to say is that you're still there, even if not intended. Although FindClause is implemented as a state monad, the functions evalClause :: FindClause a -> (FileInfo -> a) (`liftM` fileInfo) :: (FileInfo -> a) -> FindClause a are isomorphisms. In other words, nothing is gained or lost compared to (FileInfo -> a). Interestingly, these functions can be defined for any state monad, but in general, their composition is not the identity. Here it is because put is not available. Whether an abstract data type is isomorphic to some other type depends on what functions are defined but also on what functions are not defined.
or for general applicative functors as
(||) <$> ((== ".hs") <$> extension) <*> ((== ".lhs") <$> extension)
I don't find that very readable, I must confess.
Neither do I. It's probably due to (||) and (==) being infix operators, it's quite readable for normal functions. I think it's indeed best to have custom infix operators like (==?) and (||?). Btw, liftOp works for any functor liftOp :: Functor f => (a -> b -> c) -> f a -> b -> f c liftOp = flip . (fmap .) . flip
Maybe unsafePerformIO is the best solution, because you may safely close the file after having evaluated
predicate (unsafePerformIO $ hGetContents handle) (fileinfo)
to weak head normal form, i.e. True or False. I think it fits perfectly.
In principle
unsafeInterleaveIO $ readFile fileName
ought to be better, because it will not try to open the file unless the predicate actually inspects it, and opening files is expensive. But it will also not close the file until a finalizer kicks in. A tidier approach might be:
maybeH <- newIORef Nothing contents <- unsafeInterleaveIO $ do h <- openFile fileName ReadMode writeIORef maybeH (Just h) hGetContents h let result = predicate contents result `seq` readIORef maybeH >>= maybe (return ()) hClose
That's a little cumbersome, but very appealing from the perspective of a user of the library. Unfortunately, it looks to me like it's not very safe; see below.
Yes, this code looks best. And I think it's safe if the predicate may only return unary constructors like True or False. I mean, evaluating this to WHNF is like evaluating it to NF and every unused part of the file can safely be discarded, since there are no unevaluated thunks depending on the file contents anymore (and which are not garbage).
Using System.FilePath.Find.fold gives you both file status and file path but the ought-to-be equivalent approach of using foldl' on the list returned by find only gives the file path but no file status. So, the suggestion is to make find return a list of FileInfo
Let me pass an idea by you. There's a problem with augmenting FileInfo to potentially cause IO to occur as above (both with your original suggestion and my changes), no? We lose the ability to control when a file might be opened, and when it ought to be closed.
I agree, augmenting FileInfo with actual file contents is probably not advisable, although it would be very functional in style. Better ditch the idea for now. However, we probably could safely salvage two lazy reading operations: 1) reading the FileStatus record of a file 2) reading file contents if to be used in a True/False predicate The common pattern is that the information extracted is small enough to be deepSeq'ed, so that the file can be closed early once the lazy read is triggered.
If that were the case, the fold interface would have the same problem, if it decided for some reason to sock away FileInfo values in an accumulator and work on them after the fold had completed.
(The rank-2 trick can prevent FileInfos to leak out into the result of the fold fold :: (forall s . a -> FileInfo s -> a) -> FilePath -> IO a but that doesn't help very much as long as one would want a to be able to depend on the file contents via one of contents :: forall s . FileInfo s -> String withContents :: forall s . FileInfo s -> (String -> a) -> a so this is not an option.) Concerning fold, I think that it is the very core of the directory traversal algorithm: traversing a tree is a catamorphism. In other words, every other traversal can be formulated with a fold. For instance, we currently almost have find r p dir = fold r (flip $ (:) . infoPath) [] dir except that find returns a lazy list whereas fold traverses the directory tree before returning the list. But fold is currently not the general catamorphism, which is fold :: (FileInfo -> [a] -> a) -> (FileInfo -> a) -> FilePath -> IO a (I have been tricked by the name 'Foldable', this class does not give the catamorphism I had in mind.) This fold has to be lazy to be useful since only this allows to skip hole subdirectories. With the catamorphism, we can express find and the old fold: toList r p = fold branch leaf where branch dir xs = if r dir then concat xs else [] leaf file = if p file then [file] else [] find r p = map infoPath . toList r p oldfold r f x root = listSeq `fmap` foldl' f x `fmap` toList r (const True) root where listSeq xs = length xs `seq` xs (Btw, does the old fold only fold the value of type a over regualar files or also over directories?) Being able to return FileInfos captures reading operation 1) but misses the opportunity 2). However, 2) can simply be supplied as an extra function contentsSatisfies :: (String -> Bool) -> FilePath -> Bool contentsSatisfies f file = unsafePerformIO $ do h <- openFile file ReadMode b <- f `fmap` hGetContents h b `seq` hClose h return b to be used at the users discretion.
Of course, this leads to the question whether find should be factored even more into generate & prune
find r p dir = map filepath . filter p . directoryWalk r dir
with the intention that System.FilePath.Find only exports directoryWalk.
That's a nice idea, but subject to the same problems as the fold interface would be if we added file contents to the interface.
Your other observations regarding making a directory tree abstract, and instances of some of our favourite typeclasses, are very appealing.
As soon as we supply a (lazy) catamorphism like above, the user can make a concrete directory tree data File = Directory FileInfo [File] | File FileInfo (fold Directory File) :: FilePath -> IO File and the only way to keep the type abstract is to only supply weaker stuff than fold. Which is probably too weak then. Regards, apfelmus
participants (2)
-
apfelmus
-
Bryan O'Sullivan