
When using readFile to process a large number of files, I am exceeding the resource limits for the maximum number of open file descriptors on my system. How can I enhance my program to deal with this situation without making significant changes? My program takes one or more directories that contain email messages, stored one per file, and prints a list of all the email threads. Here is a snippet of output: New addition to the Kazmier family Casey Kazmier Memoization in Erlang? Thomas Johnsson Ulf Wiger \(AL/EAB\) As a newcomer to Haskell, I am intrigued by lazy evaluation and how it can influence one's designs. With that said, I wrote the program as a sequence of list manipulations which seemed quite natural to do in Haskell starting with reading the contents of the each file. Here is the algorithm at the high level: 1. Read contents of all files returning a list of Strings 2. Map over the list and parse each String as an Email 3. Sort the list of Emails 4. Group the list of Emails by Subject 5. Map over the grouped list to create a list of Threads 6. Finally, print the list of Threads It is my understanding that, as a result of lazy IO, the entire file does not need to be read into memory because parseEmail only inspects the topmost portion of the email (its headers), which is a key part of my design as some of the files can be quite large. Unfortunately, as soon as I run this program on a directory with more than 1024 files, GHC craps out on me due to resource limits. It seems that the handles opened by readFile remain open. Would this be common across all Haskell implementations? How do I go about fixing this without making a significant number of changes to my program? Did I make a mistake in steps 1 and 2 above? Should I have read and parsed a single file at a time, and then move on to the next? I'd appreciate any other comments on the program as well. I feel this is the best example of Haskell code that I have written. Compared to the first version of this program I wrote a few months ago, this is a hundred times better. Here is the program:
module Main where
import Control.Monad (filterM, liftM) import Data.List import Data.Maybe import System.Directory import System.Environment
type From = String type Subject = String data Email = Email {from :: From, subject :: Subject} deriving Show data Thread = Thread [Email]
instance Show Thread where show (Thread emails@(e:es)) = title ++ senders where title = newline . bolder . subject $ e sender = newline . indent . from senders = concatMap sender emails newline = (++ "\n") indent = (" " ++) bolder = ("\27[0;32;40m" ++) . (++ "\27[0m")
main = getArgs >>= mapM fileContentsOfDirectory >>= mapM_ print . threadEmails . map parseEmail . concat
fileContentsOfDirectory :: FilePath -> IO [String] fileContentsOfDirectory dir = setCurrentDirectory dir >> getDirectoryContents dir >>= filterM doesFileExist >>= -- ignore directories mapM readFile
parseEmail :: String -> Email parseEmail text = Email (getHeader "From") (getHeader "Subject") where getHeader = fromMaybe "N/A" . flip lookup headers headers = concatMap mkassoc . takeWhile (/="") $ lines text mkassoc s = case findIndex (==':') s of Just n -> [(take n s, drop (n+2) s)] Nothing -> []
threadEmails :: [Email] -> [Thread] threadEmails = map Thread . groupBy (fuzzy (==)) . sortBy (fuzzy compare) where fuzzy fn e e' = stripReFwd (subject e) `fn` stripReFwd (subject e') stripReFwd = stripSpaces . reverse . stripToColon . reverse stripSpaces = dropWhile (==' ') stripToColon = takeWhile (/=':')

pete-expires-20070513:
When using readFile to process a large number of files, I am exceeding the resource limits for the maximum number of open file descriptors on my system. How can I enhance my program to deal with this situation without making significant changes?
Read in data strictly, and there are two obvious ways to do that: -- Via strings: readFileStrict f = do s <- readFile f length s `seq` return s -- Via ByteStrings readFileStrict = Data.ByteString.readFile readFileStrictString = liftM Data.ByteString.unpack Data.ByteString.readFile If you're reading more than say, 100k of data, I'd use strict ByteStrings without hesitation. More than 10M, and I'd use lazy bytestrings. -- Don

dons@cse.unsw.edu.au (Donald Bruce Stewart) writes:
pete-expires-20070513:
When using readFile to process a large number of files, I am exceeding the resource limits for the maximum number of open file descriptors on my system. How can I enhance my program to deal with this situation without making significant changes?
Read in data strictly, and there are two obvious ways to do that:
-- Via strings:
readFileStrict f = do s <- readFile f length s `seq` return s
-- Via ByteStrings readFileStrict = Data.ByteString.readFile readFileStrictString = liftM Data.ByteString.unpack Data.ByteString.readFile
If you're reading more than say, 100k of data, I'd use strict ByteStrings without hesitation. More than 10M, and I'd use lazy bytestrings.
Correct me if I'm wrong, but isn't this exactly what I wanted to avoid? Reading the entire file into memory? In my previous email, I was trying to state that I wanted to lazily read the file because some of the files are quite large and there is no reason to read beyond the small set of headers. If I read the entire file into memory, this design goal is no longer met. Nevertheless, I was benchmarking with ByteStrings (both lazy and strict), and in both cases, the ByteString versions of readFile yield the same error regarding max open files. Incidentally, the lazy bytestring version of my program was by far the fastest and used the least amount of memory, but it still crapped out regarding max open files. So I'm back to square one. Any other ideas? Thanks, Pete

pete-expires-20070513:
dons@cse.unsw.edu.au (Donald Bruce Stewart) writes:
pete-expires-20070513:
When using readFile to process a large number of files, I am exceeding the resource limits for the maximum number of open file descriptors on my system. How can I enhance my program to deal with this situation without making significant changes?
Read in data strictly, and there are two obvious ways to do that:
-- Via strings:
readFileStrict f = do s <- readFile f length s `seq` return s
-- Via ByteStrings readFileStrict = Data.ByteString.readFile readFileStrictString = liftM Data.ByteString.unpack Data.ByteString.readFile
If you're reading more than say, 100k of data, I'd use strict ByteStrings without hesitation. More than 10M, and I'd use lazy bytestrings.
Correct me if I'm wrong, but isn't this exactly what I wanted to avoid? Reading the entire file into memory? In my previous email, I was trying to state that I wanted to lazily read the file because some of the files are quite large and there is no reason to read beyond the small set of headers. If I read the entire file into memory, this design goal is no longer met.
Nevertheless, I was benchmarking with ByteStrings (both lazy and strict), and in both cases, the ByteString versions of readFile yield the same error regarding max open files. Incidentally, the lazy bytestring version of my program was by far the fastest and used the least amount of memory, but it still crapped out regarding max open files.
So I'm back to square one. Any other ideas?
Hmm. Ok. So we need to have more hClose's happen somehow. Can you process files one at a time? -- Don

Quoth Pete Kazmier, nevermore,
the same error regarding max open files. Incidentally, the lazy bytestring version of my program was by far the fastest and used the least amount of memory, but it still crapped out regarding max open files.
I've tried the approach you appear to be using and it can be tricky to predict how the laziness will interact with the list of actions. For example, I tried to download a temporary file, read a bit of data out of it and then download another one. I thought I would save thinking and use the same file name for each download: /tmp/feed.xml. What happened was that it downloaded them all in rapid succession, over-writing each one with the next and not actually reading the data until the end. So I ended up parsing N identical copies of the final file, instead of one of each. You need to refactor how you map the functions so that fewer whole lists are passed around. I'd guess that (1) is being executed in its entirety before being passed to (2), but it's not until (2) that the file data is actually used.
main = getArgs >>= mapM fileContentsOfDirectory >>= -- (1) mapM_ print . threadEmails . map parseEmail . concat -- (2)
This means there are a lot of files sitting open doing nothing. I've had a lot of success by recreating this as:
main = getArgs >>= mapM_ readAndPrint where readAndPrint = fileContentsOfDirectory >>= print -- etc.
It may seem semantically identical but it sometimes makes a difference when things actually happen. -- Dougal Stanton

pete-expires-20070513:
When using readFile to process a large number of files, I am exceeding the resource limits for the maximum number of open file descriptors This is very annoying - I can't see any good reason why file descriptors should "run out" (before memory is exhausted). I guess the Linux kernel is intended for imperative use :-/ Read in data strictly, and there are two obvious ways to do that:
-- Via strings [..] -- Via ByteStrings [..] Perhaps this is an esoteric way, but I think the nicest approach is to
Donald Bruce Stewart wrote: parse into a strict structure. If you fully evaluate each Email (or whatever structure you parse into), there will be no unevaluated thunks linking to the file, and it will be closed. If the files are small (e.g. maildir or similar with one email in each?), you can use strict ByteString, but I generally use lazy ByteStrings for just about anything. Be aware that extracting a substring from a ByteString is performed by "slicing", so it keeps a pointer to the original string (along with offset and length). For strict ByteStrings, this would keep everything in memory, for lazy ByteStrings, you'd keep only the relevant chunks (so that would allow the body to be GC'ed, if you aren't interested in keeping it). (I wonder if the garbage collector could somehow discover strings that have been sliced down a lot, and copy only the relevant parts?) -k

Ketil Malde:
Perhaps this is an esoteric way, but I think the nicest approach is to parse into a strict structure. If you fully evaluate each Email (or whatever structure you parse into), there will be no unevaluated thunks linking to the file, and it will be closed.
Not necessarily so, since you are making assumptions about the timeliness of garbage collection. I was similarly sceptical of Claus' suggestion: Claus Reinke:
in order to keep the overall structure, one could move readFile backwards and parseEmail forwards in the pipeline, until the two meet. then make sure that parseEmail completely constructs the internal representation of each email, thereby keeping no implicit references to the external representation.
So here's a test. I don't have any big maildirs handy, so this is based on the simple exercise of printing the first line of each of a large number of files. First, the preamble.
import Control.Exception (bracket) import System.Environment import System.IO
main = do t:n:fs <- getArgs ([test0,test1,test2,test3] !! read t) (take (read n) $ cycle fs)
The following example corresponds to Pete's original program. As expected, when called with a sufficiently large number of files, it always results in file handle exhaustion without producing any output:
test0 files = mapM readFile files >>= mapM_ (putStrLn.head.lines)
The next example, corresponds (I think) to Claus' suggestion, in which the readFile and putStrLn are performed at the same point in the pipeline. I found that sometimes this runs without error, but other times it fails with file handle exhaustion. This seems to depend on the mood of the garbage collector, or at least the external conditions in which the garbage collector operates. It also appears to fail more frequently for small files. Without any knowledge of garbage collector internals, I'm guessing that this is because readFiles reads in 8K chunks. For files significantly smaller than 8K, garbage collection cycles are likely to be much less frequent, and therefore there is greater likelihood of file handle exhaustion between GC cycles.
test1 files = mapM_ doStuff files where doStuff f = readFile f >>= putStrLn.head.lines
The third is similar to the second, except it adds strictness annotations to force the file to be read to the end. As expected, this saves me from file handle exhaustion, but it is grossly inefficient for large files.
test2 files = mapM_ doStuff files where doStuff f = do contents <- readFile f putStrLn $ head $ lines contents return $! force contents force (x:xs) = force xs force [] = ()
In the fourth example, I explicitly close the filehandle. This also saves me from file handle exhaustion, but I must be carefull to force everything I need to be read before returning. Returning a lazy computation would be no good, as discovered in [1]. In this case, putStrLn does all the forcing I need.
test3 files = mapM_ bracketStuff files where bracketStuff f = bracket (openFile f ReadMode) hClose doStuff doStuff h = hGetContents h >>= putStrLn.head.lines
As Oleg points out in [2], all of the above have the problem that it is impossible to tell the difference between a read error and end-of-file. I had intended to write an example using explicitly sequenced I/O, but Oleg has saved me the trouble with the post he made just now [3]. [1]http://www.haskell.org/pipermail/haskell-cafe/2007-March/023189.html [2]http://www.haskell.org/pipermail/haskell-cafe/2007-March/023073.html [3]http://www.haskell.org/pipermail/haskell-cafe/2007-March/023523.html

Matthew Brecknell wrote:
Ketil Malde:
Perhaps this is an esoteric way, but I think the nicest approach is to parse into a strict structure. If you fully evaluate each Email (or whatever structure you parse into), there will be no unevaluated thunks linking to the file, and it will be closed.
Not necessarily so, since you are making assumptions about the timeliness of garbage collection. Good point. I'd have hoped that the RTS would GC on file handle exhaustion, but perhaps this is hard to do?
I think parsing input strictly is good practice anyway, since parsing is often not very compute intensive, and tends to reduce the memory required. At least, that is my experience. -k

Not necessarily so, since you are making assumptions about the timeliness of garbage collection. I was similarly sceptical of Claus' suggestion:
Claus Reinke:
in order to keep the overall structure, one could move readFile backwards and parseEmail forwards in the pipeline, until the two meet. then make sure that parseEmail completely constructs the internal representation of each email, thereby keeping no implicit references to the external representation.
you are quite right to be skeptical!-) indeed, in the latest Handle documentation, we still find the following excuse for GHC: http://www.haskell.org/ghc/docs/latest/html/libraries/base/System-IO.html#t%... GHC note: a Handle will be automatically closed when the garbage collector detects that it has become unreferenced by the program. However, relying on this behaviour is not generally recommended: the garbage collector is unpredictable. If possible, use explicit an explicit hClose to close Handles when they are no longer required. GHC does not currently attempt to free up file descriptors when they have run out, it is your responsibility to ensure that this doesn't happen. this issue has been discussed in the past, and i consider it a bug if the memory manager tells me to handle memory myself;-) so i do hope that this infelicity will be removed in the future (run out of file descriptors -> run a garbage collection and try again, before giving up entirely). in fact, my local version had two variants of processFile - the one i posted and one with explicit file handle handling (the code was restructured this way exactly to hide this implementation decision in a single function). i did test both variants on a directory with lots of copies of a few emails (>2000 files), and both worked on my system, so i hoped -rather than checked- that the handle collection issue had finally been fixed, and made the mistake of removing the more complex variant before posting. thanks for pointing out that error - as the documentation above demonstrates, it isn't good to rely on assumptions, nor on tests. so here is the alternate variant of processFile (for which i imported System.IO):
processFile path = do f <- openFile path ReadMode text <- hGetContents f let email = parseEmail text email `seq` hClose f return email
all this hazzle to expose a file handle to call hClose on, just so that the GC does not have to.. thanks, claus

claus.reinke:
Not necessarily so, since you are making assumptions about the timeliness of garbage collection. I was similarly sceptical of Claus' suggestion:
Claus Reinke:
in order to keep the overall structure, one could move readFile backwards and parseEmail forwards in the pipeline, until the two meet. then make sure that parseEmail completely constructs the internal representation of each email, thereby keeping no implicit references to the external representation.
you are quite right to be skeptical!-) indeed, in the latest Handle documentation, we still find the following excuse for GHC:
http://www.haskell.org/ghc/docs/latest/html/libraries/base/System-IO.html#t%...
GHC note: a Handle will be automatically closed when the garbage collector detects that it has become unreferenced by the program. However, relying on this behaviour is not generally recommended: the garbage collector is unpredictable. If possible, use explicit an explicit hClose to close Handles when they are no longer required. GHC does not currently attempt to free up file descriptors when they have run out, it is your responsibility to ensure that this doesn't happen. this issue has been discussed in the past, and i consider it a bug if the memory manager tells me to handle memory myself;-) so i do hope that this infelicity will be removed in the future (run out of file descriptors -> run a garbage collection and try again, before giving up entirely).
in fact, my local version had two variants of processFile - the one i posted and one with explicit file handle handling (the code was restructured this way exactly to hide this implementation decision in a single function). i did test both variants on a directory with lots of copies of a few emails (>2000 files), and both worked on my system, so i hoped -rather than checked- that the handle collection issue had finally been fixed, and made the mistake of removing the more complex variant before posting. thanks for pointing out that error - as the documentation above demonstrates, it isn't good to rely on assumptions, nor on tests.
so here is the alternate variant of processFile (for which i imported System.IO):
processFile path = do f <- openFile path ReadMode text <- hGetContents f let email = parseEmail text email `seq` hClose f return email
all this hazzle to expose a file handle to call hClose on, just so that the GC does not have to..
Are we at the point that we should consider adding some documentation how to deal with this issue? And are the recommendations to either use strict IO (should we have a package for System.IO.Strict??), or via strictness on the consumer of the data. -- Don

http://www.haskell.org/ghc/docs/latest/html/libraries/base/System-IO.html#t%...
GHC note: a Handle will be automatically closed when the garbage collector detects that it has become unreferenced by the program. However, relying on this behaviour is not generally recommended: the garbage collector is unpredictable. If possible, use explicit an explicit hClose to close Handles when they are no longer required. GHC does not currently attempt to free up file descriptors when they have run out, it is your responsibility to ensure that this doesn't happen.
this issue has been discussed in the past, and i consider it a bug if the memory manager tells me to handle memory myself;-) so i do hope that this infelicity will be removed in the future (run out of file descriptors -> run a garbage collection and try again, before giving up entirely).
Are we at the point that we should consider adding some documentation how to deal with this issue? And are the recommendations to either use strict IO (should we have a package for System.IO.Strict??), or via strictness on the consumer of the data.
i'm all for having a readFileNow, right next to readFile. apart from that, it might be sufficient to mention explicitly, in the lazy i/o docs, that - lazy i/o and strict i/o are separate approaches to i/o - lazy i/o is more abstract, strict i/o gives better control of resources - mixing lazy and strict i/o is to be approached with special attention, because the strict i/o exposes features that are assumed to be hidden when using lazy i/o (should it be asynchronous vs synchronous i/o, instead of lazy vs strict?) as for the specific issue at hand: i've seen software with thick folders of well-written manuals explaining all the intricacies of using said software. and i've seen software which was so obvious to use that it needed hardly any printed manuals. guess which one i prefer?-) in good old Hugs, for instance, we find in function newHandle in src/iomonad.c http://cvs.haskell.org/cgi-bin/cvsweb.cgi/hugs98/src/iomonad.c?rev=1.104;con... /* return a free Handle or throw an IOError */ /* Search for unused handle*/ /* If at first we don't */ /* succeed, garbage collect*/ /* and try again ... */ /* ... before we give up */ so, instead of documenting limitations and workarounds, this issue should be fixed in GHC as well. in the meantime, the existing documentation of the GHC issue with handles is not easy to notice because readFile does not even mention handles, and their docs are in System.IO, not in Prelude. readFile refers to getContents, which refers to hGetContents stdin, which explains when handles are semi-closed and closed, but doesn't mention the implications discussed in the Handle docs. my suggestion would be that all operations that might leak handles simply ought to have their docs include a direct link to the Handle docs, as in "see notes on possible file handle leakage". perhaps the Handle docs are also the right place for the notes on lazy vs strict i/o, with appriate links to that section ("see notes on lazy vs strict i/o")? claus

"Matthew Brecknell"
So here's a test. I don't have any big maildirs handy, so this is based on the simple exercise of printing the first line of each of a large number of files. First, the preamble.
import Control.Exception (bracket) import System.Environment import System.IO
main = do t:n:fs <- getArgs ([test0,test1,test2,test3] !! read t) (take (read n) $ cycle fs)
[snip]
Thank you for summarizing the approaches presented by others. As a Haskell newbie, there seems to be quite a few esoteric concepts to conquer. Your concrete examples were helpful in my understanding of the ramifications associated with the various approaches. After reading the various threads you cited, I decided to avoid lazy IO altogether. By using 'readFile' without forcing the strict evaluation of my parser, I inadvertently relinquished control of the resource management--closing of the file handles was left to the GC. And although I could have used 'seq' to address the issue, why bother fixing a problem that could have been avoided altogther by using strict IO. With that said, I added the following function to my program and then replaced the invocation of 'readFile' with it: readEmailHeaders :: FilePath -> IO String readEmailHeaders file = bracket (openFile file ReadMode) (hClose) (headers []) where headers acc h = do line <- hGetLine h case line of -- Stop reading file once we hit the empty separator -- line, no need to read the rest of the file (body). "" -> return . concat . reverse $ acc _ -> headers ("\n":line:acc) h I'm not sure if this is the best implementation, but the speed is comparable to the lazy IO version without the annoying defect of running out of file handles. I also tried an implementation using 'hGetChar' but that was much slower. I attempted to read Oleg's fold-stream implementation [1] as this sounds quite appealing to me, but I was completely overwhelmed, especially with all of the various type signatures used. It would be great if one of the regular Haskell bloggers (Tom Moertel are you reading this?) might write a blog entry or two interpreting his implementation for those of us starting out in Haskell perhaps by starting out with a non-polymorphic version so as to emphasize the approach. Thanks, Pete [1] http://okmij.org/ftp/Haskell/fold-stream.lhs

Pete Kazmier:
I attempted to read Oleg's fold-stream implementation [1] as this sounds quite appealing to me, but I was completely overwhelmed, especially with all of the various type signatures used. It would be great if one of the regular Haskell bloggers (Tom Moertel are you reading this?) might write a blog entry or two interpreting his implementation for those of us starting out in Haskell perhaps by starting out with a non-polymorphic version so as to emphasize the approach.
The basic idea of the paper is the use of a left-fold operator as the primary interface for enumarating collections. The recursive version (less general than the non-recursive version) of a left-fold operator for enumerating the lines of a text file might look something like this:
import Control.Monad.Fix import Control.Exception import Data.List import qualified Data.Set as S import qualified Data.Map as M import System.IO
enumLines :: (a -> String -> Either a a) -> a -> FilePath -> IO a enumLines iter accum filename = do h <- openFile filename ReadMode flip fix accum $ \iterate accum -> do try_line <- try (hGetLine h) case try_line of Left e -> hClose h >> return accum Right line -> do case iter accum line of Left accum -> hClose h >> return accum Right accum -> iterate accum
It needs better exception handling, including a mechanism to pass a non-EOF exception back to the caller without losing the value accumulated up to the point of the exception. Perhaps it might also be useful to generalise the first parameter to an IO action. To use this, you provide an "iteratee", a function which takes an accumulator and a line from the file, and returns a new accumulator embedded in an Either. Using the Left branch causes immediate termination of the enumeration. For example, to search for the first occurrence of each of a set of email headers:
getHeaders :: S.Set String -> FilePath -> IO (S.Set String, M.Map String String) getHeaders hdrs file = enumLines findHdrs (hdrs,M.empty) file where findHdrs accum@(wanted,found) line = if null line then Left accum else case headerLine line of Nothing -> Right accum Just hdr -> case findDelete hdr wanted of Nothing -> Right accum Just wanted -> let accum = (wanted, M.insert hdr line found) in if S.null wanted then Left accum else Right accum
headerLine :: String -> Maybe String headerLine (':':xs) = Just [] headerLine (x:xs) = fmap (x:) (headerLine xs) headerLine [] = Nothing
findDelete :: Ord a => a -> S.Set a -> Maybe (S.Set a) findDelete e s = if S.member e s then Just (S.delete e s) else Nothing
It's a bit of a case-analysis nightmare, but when comparing this to previous approaches, note that file traversal and processing are cleanly separated, file handle closure is guaranteed to be timely, file traversal stops as soon as all the required headers have been found, memory usage is minimised. I hope that helps.

"Matthew Brecknell"
Pete Kazmier:
I attempted to read Oleg's fold-stream implementation [1] as this sounds quite appealing to me, but I was completely overwhelmed, especially with all of the various type signatures used. It would be great if one of the regular Haskell bloggers (Tom Moertel are you reading this?) might write a blog entry or two interpreting his implementation for those of us starting out in Haskell perhaps by starting out with a non-polymorphic version so as to emphasize the approach.
The basic idea of the paper is the use of a left-fold operator as the primary interface for enumarating collections. The recursive version (less general than the non-recursive version) of a left-fold operator for enumerating the lines of a text file might look something like this:
import Control.Monad.Fix import Control.Exception import Data.List import qualified Data.Set as S import qualified Data.Map as M import System.IO
enumLines :: (a -> String -> Either a a) -> a -> FilePath -> IO a enumLines iter accum filename = do h <- openFile filename ReadMode flip fix accum $ \iterate accum -> do try_line <- try (hGetLine h) case try_line of Left e -> hClose h >> return accum Right line -> do case iter accum line of Left accum -> hClose h >> return accum Right accum -> iterate accum
I understand the intent of this code, but I am having a hard time understanding the implementation, specifically the combination of 'fix', 'flip', and 'interate'. I looked up 'fix' and I'm unsure how one can call 'flip' on a function that takes one argument.
To use this, you provide an "iteratee", a function which takes an accumulator and a line from the file, and returns a new accumulator embedded in an Either. Using the Left branch causes immediate termination of the enumeration. For example, to search for the first occurrence of each of a set of email headers:
getHeaders :: S.Set String -> FilePath -> IO (S.Set String, M.Map String String) getHeaders hdrs file = enumLines findHdrs (hdrs,M.empty) file where findHdrs accum@(wanted,found) line = if null line then Left accum else case headerLine line of Nothing -> Right accum Just hdr -> case findDelete hdr wanted of Nothing -> Right accum Just wanted -> let accum = (wanted, M.insert hdr line found) in if S.null wanted then Left accum else Right accum
headerLine :: String -> Maybe String headerLine (':':xs) = Just [] headerLine (x:xs) = fmap (x:) (headerLine xs) headerLine [] = Nothing
findDelete :: Ord a => a -> S.Set a -> Maybe (S.Set a) findDelete e s = if S.member e s then Just (S.delete e s) else Nothing
It's a bit of a case-analysis nightmare, but when comparing this to previous approaches, note that file traversal and processing are cleanly separated, file handle closure is guaranteed to be timely, file traversal stops as soon as all the required headers have been found, memory usage is minimised.
Very nice. I like the clean separation, but as you say, its one ugly bit of code compared to my original code, although much more elegant no doubt.
I hope that helps.
Very much so. Thank you for you help.

Pete Kazmier wrote:
I understand the intent of this code, but I am having a hard time understanding the implementation, specifically the combination of 'fix', 'flip', and 'interate'. I looked up 'fix' and I'm unsure how one can call 'flip' on a function that takes one argument.
If you look at the code, that's not really what's happening. See the embedded anonymous function below?
flip fix accum $ \iterate accum -> do ...
It's a function of two arguments. All "flip" is doing is switching the order of the arguments to "fix", in this case for readability. If you were to get rid of the "flip", you'd need to remove the "accum" after "fix" and move it after the lambda expression, which would make the expression much uglier to write and read. So all the "flip" is doing here is tidying up the code. (If you're still confused, look at the difference between forM and mapM. The only reason forM exists is readability when you have - in terms of the amount of screen space they consume - a big function and a small piece of data, just as here.) As to why it's okay to call "flip" on "fix" at all, look at the types involved. fix :: (a -> a) -> a flip :: (a -> b -> c) -> b -> a -> c By substitution: flip fix :: a -> ((a -> b) -> a -> b) -> b In the case above, accum has type a, and the lambda has type (a -> IO a) -> a -> IO a, and these fit nicely into the type expected by "flip fix".

Bryan O'Sullivan
Pete Kazmier wrote:
I understand the intent of this code, but I am having a hard time understanding the implementation, specifically the combination of 'fix', 'flip', and 'interate'. I looked up 'fix' and I'm unsure how one can call 'flip' on a function that takes one argument.
As to why it's okay to call "flip" on "fix" at all, look at the types involved.
fix :: (a -> a) -> a flip :: (a -> b -> c) -> b -> a -> c
By substitution:
flip fix :: a -> ((a -> b) -> a -> b) -> b
Sadly, I'm still confused. I understand how 'flip' works in the case where its argument is a function that takes two arguments. I've started to use this in my own code lately. But my brain refuses to understand how 'flip' is applied to 'fix', a function that takes one argument only, which happens to be a function itself. What is 'flip' flipping when the function passed to it only takes one argument? Thanks, Pete

Here's what happens: fix has type (x->x)->x and that has to match the first argument to flip, namely 'a->b->c'. The only chance of that is if x is actually a function type. Pick x=b->c, now we have fix has type ((b->c)->b->c)->b->c and it matches a->b->c if a=(b->c)->b->c Flip returns b->a->c, and if we substitute we get b->((b->c)->b->c)->c If you rename the variables you get the suggested type. -- Lennart On Mar 19, 2007, at 20:35 , Pete Kazmier wrote:
Bryan O'Sullivan
writes: Pete Kazmier wrote:
I understand the intent of this code, but I am having a hard time understanding the implementation, specifically the combination of 'fix', 'flip', and 'interate'. I looked up 'fix' and I'm unsure how one can call 'flip' on a function that takes one argument.
As to why it's okay to call "flip" on "fix" at all, look at the types involved.
fix :: (a -> a) -> a flip :: (a -> b -> c) -> b -> a -> c
By substitution:
flip fix :: a -> ((a -> b) -> a -> b) -> b
Sadly, I'm still confused. I understand how 'flip' works in the case where its argument is a function that takes two arguments. I've started to use this in my own code lately. But my brain refuses to understand how 'flip' is applied to 'fix', a function that takes one argument only, which happens to be a function itself. What is 'flip' flipping when the function passed to it only takes one argument?
Thanks, Pete
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Pete Kazmier wrote:
Bryan O'Sullivan
writes: Pete Kazmier wrote:
I understand the intent of this code, but I am having a hard time understanding the implementation, specifically the combination of 'fix', 'flip', and 'interate'. I looked up 'fix' and I'm unsure how one can call 'flip' on a function that takes one argument.
As to why it's okay to call "flip" on "fix" at all, look at the types involved.
fix :: (a -> a) -> a flip :: (a -> b -> c) -> b -> a -> c
By substitution:
flip fix :: a -> ((a -> b) -> a -> b) -> b
Sadly, I'm still confused. I understand how 'flip' works in the case where its argument is a function that takes two arguments. I've started to use this in my own code lately. But my brain refuses to understand how 'flip' is applied to 'fix', a function that takes one argument only, which happens to be a function itself. What is 'flip' flipping when the function passed to it only takes one argument?
fix :: (a -> a) -> a In this case, we know something about 'a': it is a function (b -> c). Substitute: fix :: ((b -> c) -> (b -> c)) -> (b -> c) Take advantage of the right-associativity of (->) fix :: ((b -> c) -> b -> c) -> b -> c Now it looks like a function of two arguments, because the return value (normally ordinary data) can in fact, in this case, take arguments. Here's another example of that: data Box a = Box a get (Box a) = a - -- get (Box 1) :: Int - -- get (Box (\a -> a)) :: Int -> Int - -- (get (Box (\a -> a))) 1 :: Int --function application is left-associative: - -- get (Box (\a -> a)) 1 :: Int - -- flip get 1 (Box (\a -> a)) :: Int Yes, it sometimes confuses me too. Isaac -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.3 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFF/vcXHgcxvIWYTTURAj5RAKCUMeAF0vosJ6ROAVlBIDHsEq/vzgCfflnR 50BmW6tuAF6mKXBtrlHdQ5Y= =uv3G -----END PGP SIGNATURE-----

Pete Kazmier:
I understand the intent of this code, but I am having a hard time understanding the implementation, specifically the combination of 'fix', 'flip', and 'interate'. I looked up 'fix' and I'm unsure how one can call 'flip' on a function that takes one argument.
I threw that in there because I figured you were up for another challenge. :-) It took me ages to get some clue about how to use fix, quite apart from combining it with flip. The concept of passing the output of a function as one of its parameters ("tying the knot") can be difficult to accept, particularly if you haven't studied lambda calculus. Note that I could have just written this:
let iterate a = do ... iterate a' ... iterate accum
Or this:
fix iterate accum where iterate a = do ... iterate a' ...
Though with the latter, I presume you would still be confused about how I can pass two arguments to a function that only takes one. Actually, that's not that difficult. Say I have a function "f" that takes two arguments. Then I could write:
(id f) a b
No problem. But function application associates to the left (at least in value-land), so I could just as easily write:
id f a b
You could say I was passing three arguments to id, which only takes one argument. But id returns its first argument, so I'm really just passing the last two arguments to the function returned by id. So with my use of "flip fix", I'm really just calling fix on the anonymous function (\iterate accum -> ...), and then the parameter ("accum") is passed to the function returned by fix. So now you just need a couple of weeks (or months if you're as slow as me) to understand what fix is all about... :-) There is the question of whether it's preferable to use the "let" form or the "fix" form for embedding a recursive function in the middle of a do-block. I don't know if there's any consensus on this question, but it seems to me to be about whether one prefers to read a function top-down or bottom-up. I think I'm about 80/20 top-down/bottom-up. When I read a "let", I know (due to laziness) that it doesn't have any effect until the bindings are used, so I usually find myself scanning forward to find those uses. When I read "fix \f -> ...", I see exactly how the (anonymous) function is used, just before I get to its definition. So fix helps me to see things in a mostly top-down fashion. A couple of times I have wished that the libraries contained "pre-flipped" versions of fix, for example:
fix1 a f = fix f a fix2 a b f = fix f a b fix3 a b c f = fix f a b c
Any opinions on whether this would be a worthwhile addition? Or would it just legitimise an obscure idiom?

On the topic of 'fix', is there any good tutorial for fix? I searched google, but mostly came up with pages including things like 'bug fix'. It's hard for me to get an intuition about it when 'fix' always stack overflows on me because I don't really know how to use it.

Bryan Burgers:
On the topic of 'fix', is there any good tutorial for fix? I searched google, but mostly came up with pages including things like 'bug fix'. It's hard for me to get an intuition about it when 'fix' always stack overflows on me because I don't really know how to use it.
I don't know of any tutorials that teach how to use fix, but these are some of the pages that helped me on the way towards understanding what it is: http://en.wikipedia.org/wiki/Fixed_point_combinator http://en.wikipedia.org/wiki/Anonymous_recursion In Haskell, it might help to note the equivalence between a and a', b and b', etc, in the following (given appropriate definitions for p, q, r, s, t, etc):
a = fix (\f -> if t then f else r) a' = let f = (if t then f else r) in f
b = fix (\f x -> if t' x then f (s' x) else r' x) p b' = let f x = (if t' x then f (s' x) else r' x) in f p
c = fix (\f x y -> if t'' x y then uncurry f (s'' x y) else r'' x y) p q c' = let f x y = (if t'' x y then uncurry f (s'' x y) else r'' x y) in f p q
etc. The first case is not of much practical use, since each iteration of f must produce the same result, so it must either return immediately (if it doesn't recurse into f) or never (if it does). The other cases can be useful, since the additional arguments can be used by the implementation of f to decide whether or not to recurse. When writing an anonymous function inside an invocation of fix, the main thing to realise is that the first argument of that anonymous function is the anonymous function itself. You can use that argument to make a recursive call to the anonymous function. So you could say it's not really anonymous after all. Of course, you eventually have to return without recursing if you want to avoid an infinite loop.

Bryan Burgers
On the topic of 'fix', is there any good tutorial for fix? I searched google, but mostly came up with pages including things like 'bug fix'. It's hard for me to get an intuition about it when 'fix' always stack overflows on me because I don't really know how to use it.
Perhaps try: Bruce J. McAdam. 1997. That about wraps it up: Using FIX to handle errors without exceptions, and other programming tricks. Tech. Rep. ECS-LFCS-97-375, Laboratory for Foundations of Computer Science, Department of Computer Science, University of Edinburgh. http://www.lfcs.informatics.ed.ac.uk/reports/97/ECS-LFCS-97-375/ -- Edit this signature at http://www.digitas.harvard.edu/cgi-bin/ken/sig This is a fairly straightforward coding problem: There aren't many really interesting ways to screw up. -- Leslie Lamport

On 20/03/07, Bryan Burgers
On the topic of 'fix', is there any good tutorial for fix?
I quite liked this one: http://dreamsongs.org/Files/WhyOfY.pdf Alistair

Matthew Brecknell wrote:
There is the question of whether it's preferable to use the "let" form or the "fix" form for embedding a recursive function in the middle of a do-block. I don't know if there's any consensus on this question, but it seems to me to be about whether one prefers to read a function top-down or bottom-up. I think I'm about 80/20 top-down/bottom-up. When I read a "let", I know (due to laziness) that it doesn't have any effect until the bindings are used, so I usually find myself scanning forward to find those uses. When I read "fix \f -> ...", I see exactly how the (anonymous) function is used, just before I get to its definition. So fix helps me to see things in a mostly top-down fashion.
If you're merely talking about top-down or bottom-up then there is also 'where' rather than 'let'. So the question becomes: 1. recursive definition first (let) 2. recursive definition last (where) 3. recursion inline (explicit use of fix) 1+2 have the virtue of separating the local recursive function from the 'main' function. On the other hand, 3 has the virtue of keeping them together :) Which you prefer is quite subjective, and context dependent. They potentially have slightly different scoping implications, too. A lot of discussion about haskell code revolves around whether or not a given construction is 'clear'; this has something to do with haskell's almost unparallelled ability to abstract, I suppose. Ultimately, something is clear once you are used to the abstraction, but obfuscated if you're not, so it becomes rather subjective. Jules

Jules Bean:
If you're merely talking about top-down or bottom-up then there is also 'where' rather than 'let'.
Yes, I admit I tend to prefer "where" over "let", all else being equal. But my main concern was embedding recursive functions in do-blocks, particularly monadic loops that aren't amenable to the usual combinators. Bindings made in a do block aren't visible in a "where" clause, so pushing the definition to the where clause would typically require additional arguments to be passed. I'd rather not do that, so I'm left choosing between "let" and "fix".
A lot of discussion about haskell code revolves around whether or not a given construction is 'clear'; this has something to do with haskell's almost unparallelled ability to abstract, I suppose. Ultimately, something is clear once you are used to the abstraction, but obfuscated if you're not, so it becomes rather subjective.
I'd agree with that, so perhaps I should have instead asked: Are there other ways, besides "fix" and "let", to write monadic loops that won't easily submit to forM and friends? I already see one suggestion from Claus [1], who seems determined that lazy I/O should have the last laugh :-), that is, extract the loop structure to a new combinator, leaving the loop body inline. [1]http://www.haskell.org/pipermail/haskell-cafe/2007-March/023686.html

"Matthew Brecknell"
Pete Kazmier:
I understand the intent of this code, but I am having a hard time understanding the implementation, specifically the combination of 'fix', 'flip', and 'interate'. I looked up 'fix' and I'm unsure how one can call 'flip' on a function that takes one argument.
I threw that in there because I figured you were up for another challenge. :-)
Haskell has a way of making one feel dumb. This is by far the most challenging programming language I've ever used.
It took me ages to get some clue about how to use fix, quite apart from combining it with flip. The concept of passing the output of a function as one of its parameters ("tying the knot") can be difficult to accept, particularly if you haven't studied lambda calculus.
This is a bit mind boggling. Do you have any trivial examples that may help my understanding?
Note that I could have just written this:
let iterate a = do ... iterate a' ... iterate accum
In the meantime, I'm more than happy to claim ignorance and stick with the above version which is very accessible to us mere mortals.
So with my use of "flip fix", I'm really just calling fix on the anonymous function (\iterate accum -> ...), and then the parameter ("accum") is passed to the function returned by fix. So now you just need a couple of weeks (or months if you're as slow as me) to understand what fix is all about... :-)
I won't try to understand fix just yet, but I'm still confused by the type of fix: fix :: (a -> a) -> a It appears to me that it takes a function as an argument, and that function takes a single argument. So how are you passing fix an anonymous function taking 2 arguments? Sorry if I have beaten this horse to death, but my pea-sized brain is working overtime here. Thanks for all of the help. Pete

Pete Kazmier:
Haskell has a way of making one feel dumb. This is by far the most challenging programming language I've ever used.
It (or perhaps the community around it) does have a way of making you realise that the rabbit-hole really is very deep. But that's no reason to feel dumb.
I won't try to understand fix just yet, but I'm still confused by the type of fix:
fix :: (a -> a) -> a
It appears to me that it takes a function as an argument, and that function takes a single argument. So how are you passing fix an anonymous function taking 2 arguments? Sorry if I have beaten this horse to death, but my pea-sized brain is working overtime here.
As others have pointed out, fix is polymorphic, so "a" can stand for any type, including "(b -> c)". Removing redundant parentheses, this means fix can directly specialise to:
fix :: ((b -> c) -> b -> c) -> b -> c
It may also help to think some more about the concepts of "currying" and "partial application". Between the Haskell wiki and Wikipedia, you should find some resources for that. See also my "id" example in a previous post in this thread. That does the same thing, but without fix there to confuse the issue.

"Matthew Brecknell"
As others have pointed out, fix is polymorphic, so "a" can stand for any type, including "(b -> c)". Removing redundant parentheses, this means fix can directly specialise to:
fix :: ((b -> c) -> b -> c) -> b -> c
I understand now. I think part of my problem was that I was trying to grasp one too many new things all at once. This makes perfect sense. Thanks, Pete

But in fact it seems to me that the type variable "a" not only can, but must unify with "b->c". Is there any use of fix for which this is not true? If this is true, is the type "a" instead of "b->c" because it is not possible in general for the type checker to verify this fact, making it some kind of underivable true statement? If it is not true, I would dearly love to see a use of fix with a type for which functional application is not defined. For me, it is this aspect (the type of fix) that has made it so much harder to understand fix than it should have been. Dan Pete Kazmier wrote:
"Matthew Brecknell"
writes: As others have pointed out, fix is polymorphic, so "a" can stand for any type, including "(b -> c)". Removing redundant parentheses, this means fix can directly specialise to:
fix :: ((b -> c) -> b -> c) -> b -> c
I understand now. I think part of my problem was that I was trying to grasp one too many new things all at once. This makes perfect sense.
Thanks, Pete

Assuming 1 :: Int, then: ones = 1 : ones is equivalent to: ones = fix (\ones -> 1:ones) where fix has type ([Int] -> [Int]) -> [Int]. It's also the case that: inf = 1+inf is equivalent to: inf = fix (\inf -> 1+inf) where fix has type (Int -> Int) -> Int. Unfortunately (perhaps), the fixed point returned is _|_, since it is the LEAST solution to the recursive equation. -Paul Dan Weston wrote:
But in fact it seems to me that the type variable "a" not only can, but must unify with "b->c".
Is there any use of fix for which this is not true? If this is true, is the type "a" instead of "b->c" because it is not possible in general for the type checker to verify this fact, making it some kind of underivable true statement?
If it is not true, I would dearly love to see a use of fix with a type for which functional application is not defined.
For me, it is this aspect (the type of fix) that has made it so much harder to understand fix than it should have been.
Dan

In effect, this is a demonstration that Haskell supports recursive
values and not just recursive functions. If the a in
fix :: (a -> a) -> a
were to be unified always with a function type, then that would imply
that the language only supported recursive definitions for functions,
which would be a bit unfortunate for the co-coders in the community.
It's good to note that simple languages with a strict evaluation
scheme are limited to
fix :: ((a -> b) -> (a -> b)) -> (a -> b)
because functions are the only things that can delay evaluation.
On 3/20/07, Paul Hudak
Assuming 1 :: Int, then: ones = 1 : ones is equivalent to: ones = fix (\ones -> 1:ones) where fix has type ([Int] -> [Int]) -> [Int].
It's also the case that: inf = 1+inf is equivalent to: inf = fix (\inf -> 1+inf) where fix has type (Int -> Int) -> Int. Unfortunately (perhaps), the fixed point returned is _|_, since it is the LEAST solution to the recursive equation.
-Paul
Dan Weston wrote:
But in fact it seems to me that the type variable "a" not only can, but must unify with "b->c".
Is there any use of fix for which this is not true? If this is true, is the type "a" instead of "b->c" because it is not possible in general for the type checker to verify this fact, making it some kind of underivable true statement?
If it is not true, I would dearly love to see a use of fix with a type for which functional application is not defined.
For me, it is this aspect (the type of fix) that has made it so much harder to understand fix than it should have been.
Dan
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Haskell has a way of making one feel dumb. This is by far the most challenging programming language I've ever used.
i wouldn't see either as a particularly positive attribute. a language should enable, explain, encourage, and perhaps inspire. if it also leaves you feeling that you could do more and go further should you ever want or need to, that's okay, too. but (unless designed for such purpose;) a language should not be a puzzle, nor an obstacle, nor should its users feel judged or needlessly challenged. http://www.cs.utexas.edu/~EWD/transcriptions/EWD03xx/EWD340.html of course, all of that also depends on not only on the language, but on how it is used, presented, and perceived. haskell allows for a wide variety of approaches to programming, allowing programmers to be productive in plain old functional programming, or perhaps seemingly less so in advanced pointless type-level meta-programming. that doesn't mean that either approach is better or worse than the other, or that one approach fits all problems, or programmers. even one's measures of what constitutes simple code can evolve over time. so, by picking those techniques that work for you, and keeping an eye open for those that you're not (yet) comfortable with, you seem to be on the right track. perhaps other techniques will make more sense to you later, or you'll find the need to overcome limitations in the techniques you use now (which is how most of the more complicated-looking themes have developed from easier-looking ones in the past), perhaps not. understanding more advanced haskell tricks can be fun, but for most programmers, the point is not (just:) to engage in intellectual games and challenges for their own sake, but in exploring what their current understanding of haskell does or does not allow them to do.
I won't try to understand fix just yet, but I'm still confused by the type of fix:
fix :: (a -> a) -> a
It appears to me that it takes a function as an argument, and that function takes a single argument. So how are you passing fix an anonymous function taking 2 arguments? Sorry if I have beaten this horse to death, but my pea-sized brain is working overtime here.
fix takes a function as an argument, and that function takes a single argument. that function also returns something of the same type as its single argument. which is where things get more interesting/complicated. but first: lets give the argument of fix a name and its type: f :: a -> a so fix takes f and produces an a, but where can that a come from? that a could be the result of f, but then we'd need something of type a to put into f. so, once again, we need an a. where could this a come from (i'm not a language, so i feel free to pose challenging puzzles;-)? hint: the relevant equation for fix is fix f = f (fix f) -- (1) read from right to left, we can see that f applied to (fix f) returns (fix f), which is why (fix f) is called a fixed point of f (and fix the fixed-point combinator). read from left to right, we can see how fix accomplishes the feat of computing a fixed point for any function f :: a -> a we might pass in as a parameter - it applies f to the fixed point yet to be computed, which unfolds to: fix f = f (f (..... f (fix f)..)..) so, f gets applied to a copy of itself, applied to a copy of itself, applied to .. in other words, fix f recursively creates a supply of copies of f, which f can make use of in its definition. lets say f = \self-> .. self .., then fix (\self-> .. self ..) = (\self-> .. self ..) (fix (\self-> .. self ..)) -- by (1) = .. (fix (\self-> .. self ..)) .. -- reduce and substitute for self = .. ((\self-> .. self ..) (fix (\self-> .. self ..))) .. -- by (1) so, fix f supplies f with copies to use for recursive calls within its body, and that a type is really the type of the body of f, as well as that of the self-reference. which brings us to that extra challenge of extra parameters. so far, f is a function only so that its body can take in a self-reference, and fix f corresponds to a recursive variable definition. if we want to define a recursive function instead, we need more parameters. lets say f = \self-> \x-> .. (self x) .., then fix (\self->\x-> .. (self x) ..) = (\self->\x-> .. (self x) ..) (fix (\self->\x-> .. (self x) ..)) = \x->.. ((fix (\self->\x-> .. (self x)..)) x) .. = \x->.. (((\self->\x-> .. (self x)..) (fix (\self->\x-> .. (self x) ..))) x).. (it really helps to perform the reductions yourself, rather than read them, btw) and what has happened to the type of f? f takes a self-reference, and returns a function of one argument, so f :: ?? -> (b->c) and the type of the self-reference is the type of f's body, so f :: (b->c) -> (b->c) or f :: (b->c) -> b -> c from which we can see that the type of fix gets instantiated to fix :: ((b -> c) -> (b -> c)) -> (b -> c) or fix :: ((b -> c) -> (b -> c)) -> b -> c and suddenly, fix does have two parameters, which flip can flip!-) no magic, just technology sufficiently advanced to be indistinguishable from it: a function of one parameter, which returns a function of one parameter, is a function of more than one parameter. at which point this particular fixed-point combinator puts its recursive unfoldings to rest for tonight. hth, claus

"Claus Reinke"
I won't try to understand fix just yet, but I'm still confused by the type of fix: fix :: (a -> a) -> a It appears to me that it takes a function as an argument, and that function takes a single argument. So how are you passing fix an anonymous function taking 2 arguments? Sorry if I have beaten this horse to death, but my pea-sized brain is working overtime here.
fix takes a function as an argument, and that function takes a single argument. that function also returns something of the same type as its single argument.
[snip]
and suddenly, fix does have two parameters, which flip can flip!-)
no magic, just technology sufficiently advanced to be indistinguishable from it: a function of one parameter, which returns a function of one parameter, is a function of more than one parameter.
at which point this particular fixed-point combinator puts its recursive unfoldings to rest for tonight.
Claus, Thank you for the detailed explanation. I think I understand now! To be sure, I'll reread your post several times over the next few days. Thanks again, this was very helpful.

[left-fold operator for enumerating the lines of a text file] ..
enumLines :: (a -> String -> Either a a) -> a -> FilePath -> IO a enumLines iter accum filename = do h <- openFile filename ReadMode flip fix accum $ \iterate accum -> do try_line <- try (hGetLine h) case try_line of Left e -> hClose h >> return accum Right line -> do case iter accum line of Left accum -> hClose h >> return accum Right accum -> iterate accum .. getHeaders :: S.Set String -> FilePath -> IO (S.Set String, M.Map String String) getHeaders hdrs file = enumLines findHdrs (hdrs,M.empty) file where
we can keep the Left/Right implicit, using either: getHeaders1 hdrs file = enumLines findHdrs (hdrs,M.empty) file where enumLines iter accum filename = do h <- openFile filename ReadMode flip fix accum $ \iterate accum -> join $ (fmap (either (const $ hClose h >> return accum) (either ((hClose h >>) . return) iterate . iter accum))) (try (hGetLine h)) or extract the reusable loop-with-exit-by-either functionality: loopME m stop continue acc = m >>= either (stop acc) (continue (loopME m stop continue) acc) getHeaders2 hdrs file = enumLines findHdrs (hdrs,M.empty) file where enumLines iter accum f = do h <- openFile f ReadMode loopME (try (hGetLine h)) (\acc left->hClose h >> return acc) (\loop acc right->either ((hClose h >>) . return) loop (iter acc right)) accum or sneak some lazy i/o back in, using a fold-with-exit-by-either, similar to loopME: withFile path m = bracket (openFile path ReadMode) hClose m withContentsOf path f = withFile path ((((return $!) . f ) =<<) . hGetContents) withLinesOf path f = withContentsOf path (f . lines) foldE f a [] = a foldE f a (x:xs) = either id (\a'->foldE f a' xs) (f a x) getHeaders3 hdrs file = withLinesOf file (foldE findHdrs (hdrs,M.empty))
To use this, you provide an "iteratee", a function which takes an accumulator and a line from the file, and returns a new accumulator embedded in an Either. Using the Left branch causes immediate termination of the enumeration. For example, to search for the first occurrence of each of a set of email headers:
findHdrs accum@(wanted,found) line = if null line then Left accum else case headerLine line of Nothing -> Right accum Just hdr -> case findDelete hdr wanted of Nothing -> Right accum Just wanted -> let accum = (wanted, M.insert hdr line found) in if S.null wanted then Left accum else Right accum
headerLine :: String -> Maybe String headerLine (':':xs) = Just [] headerLine (x:xs) = fmap (x:) (headerLine xs) headerLine [] = Nothing
findDelete :: Ord a => a -> S.Set a -> Maybe (S.Set a) findDelete e s = if S.member e s then Just (S.delete e s) else Nothing
It's a bit of a case-analysis nightmare
indeed, this part can be cleaned up considerably, using Monad Maybe: findHdrs accum@(wanted,found) line = if null line || S.null wanted then Left accum else maybe (Right accum) id $ do (field,value) <- headerLine line wanted' <- findDelete field wanted let found' = M.insert field value found return $! (Right $! ((,) $! wanted') $! found') headerLine :: String -> Maybe (String,String) headerLine xs = do (field,':':value) <- return (span (/=':') xs) let value' = dropWhile isSpace value return $! ((,) $! field) $! strictly value' findDelete :: Ord a => a -> S.Set a -> Maybe (S.Set a) findDelete e s = guard (S.member e s) >> return (S.delete e s) strictly l = length l `seq` l running the three variants over a moderately sized directory (>3k emails, one including a hugs-tarball;-), 1/2 are roughly equivalent, but Hugs claims that 3 allocates less and needs fewer garbage collections than 1/2, while GHC claims that it is the other way round.. claus

"Matthew Brecknell"
enumLines :: (a -> String -> Either a a) -> a -> FilePath -> IO a enumLines iter accum filename = do h <- openFile filename ReadMode flip fix accum $ \iterate accum -> do try_line <- try (hGetLine h) case try_line of Left e -> hClose h >> return accum Right line -> do case iter accum line of Left accum -> hClose h >> return accum Right accum -> iterate accum
Another variation, enabling multiple iteratees (like a state machine), exception propagation, and no flip fix :)
newtype Iterator a = Iterator (a -> String -> (a,Maybe (Iterator a)))
enumLines :: Iterator a -> a -> FilePath -> IO (a,Maybe Exception) enumLines iterator start filename = do h <- openFile filename ReadMode let f (Iterator iter) accum = do try_line <- try (hGetLine h) case try_line of Left e -> hClose h >> return (accum,Just e) Right line -> do case iter accum line of (acc',Nothing) -> hClose h >> return (acc',Nothing) (acc',Just cont) -> f cont acc' f iterator start -- Feri.

Pete Kazmier
I attempted to read Oleg's fold-stream implementation [1] as this sounds quite appealing to me, but I was completely overwhelmed, especially with all of the various type signatures used. It would be great if one of the regular Haskell bloggers (Tom Moertel are you reading this?) might write a blog entry or two interpreting his implementation for those of us starting out in Haskell perhaps by starting out with a non-polymorphic version so as to emphasize the approach.
In the event any other Haskell newbie comes along someday and is just as overwhelmed as I was, I've found this post by Oleg to be a much easier to understand than the above paper because it is not as generic and thus the type signatures are a bit easier on the eyes: http://www.haskell.org/pipermail/haskell/2003-September/012741.html With that said, I have a question regarding Hal's response to the above email in which he states:
Just thought I'd mention that this is, in fact, my preferred method of iterating over a file. It alleviates the pain associated with lazy file IO, and simultaneously provides a useful abstraction. I actually have 3*2 functions that I use which look like:
type Iteratee iter seed = seed -> iter -> Either seed seed hFoldChars :: FilePath -> Iteratee Char seed -> seed -> IO seed hFoldLines :: FilePath -> Iteratee String seed -> seed -> IO seed hFoldWords :: FilePath -> Iteratee [String] seed -> seed -> IO seed
type IterateeM iter seed = seed -> iter -> IO (Either seed seed) hFoldCharsM :: FilePath -> IterateeM Char seed -> seed -> IO seed hFoldLinesM :: FilePath -> IterateeM String seed -> seed -> IO seed hFoldWordsM :: FilePath -> IterateeM [String] seed -> seed -> IO seed
Which perform as expected (hFoldWords(M) can be written in terms of hFoldLinesM, but I find I use it sufficiently frequently to warrent having it stand out). Also, of course, the only ones actually implemented are the (M) variants; the non-M variants just throw a return into the Iteratee.
What does he mean by the very last sentence? Oleg's version seems more like the non-M versions. What is his implication? Thanks, Pete

Pete,
mapM fileContentsOfDirectory >>= mapM_ print . threadEmails . map parseEmail . concat
By using the IO monad you've /scheduled/ your first 'print' to occur after
your last 'readFile', so every file is opened before the first file is read.
I've come across the same problem and would also be interested in a more
elegant solution. The problem is that your program assumes a readonly
filesystem, but the compiler doesn't know that. For all it knows, the
'print' after your first parse may be overwriting the contents of your
second file.
What we need is a library for a readonly filesystem. That is, all the same
functions but pure. I believe you could make this readonly library by
wrapping each readonly I/O function with 'unsafeInterleaveIO' or
'unsafePerformIO', but I don't really understand the consequences of using
'unsafe' calls, so I avoided it myself.
Until that's figured out, one solution is as Don suggested, when you read a
file, read the whole file at once so that the runtime can close the file
handle. This lets you leave the rest of your code alone, but if your files
are too big, as they were for me, this will blow your heap!
My solution (code below) was to rearrange my I/O calls so that each print is
executed immediately after each readFile. This takes away the modularity
you're looking for and assumes you can process one file at time, but without
a safe readonly filesystem library and files too big to be read all at once,
I think this is about the best we can do for now. Hopefully others will
disagree!
main = do
[dir] <- getArgs
paths <- getFilePaths dir
mapM_ (either print prettyPrint `oM` parse) paths
getFilePaths = return . flatten `oM` getFileTree
-- Given a path, this returns a tree
getFileTree :: FilePath -> IO (Tree FilePath)
getFileTree = unfoldTreeM childPaths
where
childPaths dir = do
fs <- if' (getDirectoryContents dir) (return []) =<<
doesDirectoryExist dir
return (dir, [dir ++ "/" ++ p | p <- fs, head p /= '.'])
infixl 8 `oM`
(a `oM` b) x = b x >>= a
if' t f b = if b then t else f
Thanks,
Greg
On 3/14/07, Pete Kazmier
When using readFile to process a large number of files, I am exceeding the resource limits for the maximum number of open file descriptors on my system. How can I enhance my program to deal with this situation without making significant changes?
My program takes one or more directories that contain email messages, stored one per file, and prints a list of all the email threads. Here is a snippet of output:
New addition to the Kazmier family Casey Kazmier
Memoization in Erlang? Thomas Johnsson Ulf Wiger \(AL/EAB\)
As a newcomer to Haskell, I am intrigued by lazy evaluation and how it can influence one's designs. With that said, I wrote the program as a sequence of list manipulations which seemed quite natural to do in Haskell starting with reading the contents of the each file. Here is the algorithm at the high level:
1. Read contents of all files returning a list of Strings 2. Map over the list and parse each String as an Email 3. Sort the list of Emails 4. Group the list of Emails by Subject 5. Map over the grouped list to create a list of Threads 6. Finally, print the list of Threads
It is my understanding that, as a result of lazy IO, the entire file does not need to be read into memory because parseEmail only inspects the topmost portion of the email (its headers), which is a key part of my design as some of the files can be quite large. Unfortunately, as soon as I run this program on a directory with more than 1024 files, GHC craps out on me due to resource limits. It seems that the handles opened by readFile remain open. Would this be common across all Haskell implementations?
How do I go about fixing this without making a significant number of changes to my program? Did I make a mistake in steps 1 and 2 above? Should I have read and parsed a single file at a time, and then move on to the next?
I'd appreciate any other comments on the program as well. I feel this is the best example of Haskell code that I have written. Compared to the first version of this program I wrote a few months ago, this is a hundred times better.
Here is the program:
module Main where
import Control.Monad (filterM, liftM) import Data.List import Data.Maybe import System.Directory import System.Environment
type From = String type Subject = String data Email = Email {from :: From, subject :: Subject} deriving Show data Thread = Thread [Email]
instance Show Thread where show (Thread emails@(e:es)) = title ++ senders where title = newline . bolder . subject $ e sender = newline . indent . from senders = concatMap sender emails newline = (++ "\n") indent = (" " ++) bolder = ("\27[0;32;40m" ++) . (++ "\27[0m")
main = getArgs >>= mapM fileContentsOfDirectory >>= mapM_ print . threadEmails . map parseEmail . concat
fileContentsOfDirectory :: FilePath -> IO [String] fileContentsOfDirectory dir = setCurrentDirectory dir >> getDirectoryContents dir >>= filterM doesFileExist >>= -- ignore directories mapM readFile
parseEmail :: String -> Email parseEmail text = Email (getHeader "From") (getHeader "Subject") where getHeader = fromMaybe "N/A" . flip lookup headers headers = concatMap mkassoc . takeWhile (/="") $ lines text mkassoc s = case findIndex (==':') s of Just n -> [(take n s, drop (n+2) s)] Nothing -> []
threadEmails :: [Email] -> [Thread] threadEmails = map Thread . groupBy (fuzzy (==)) . sortBy (fuzzy compare) where fuzzy fn e e' = stripReFwd (subject e) `fn` stripReFwd (subject e') stripReFwd = stripSpaces . reverse . stripToColon . reverse stripSpaces = dropWhile (==' ') stripToColon = takeWhile (/=':')
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Greg Fitzgerald wrote:
What we need is a library for a readonly filesystem. That is, all the same functions but pure. I believe you could make this readonly library by wrapping each readonly I/O function with 'unsafeInterleaveIO' or 'unsafePerformIO', but I don't really understand the consequences of using 'unsafe' calls, so I avoided it myself.
SlurpDirectory from darcs tries to do this. It is likely just specific to darcs needs, but perhaps it would be a useful thing. I've thought about making something like this for the purposes of harchive, but concluded that I need too much control over when things happen. Having a memory leak that tries to read an entire filesystem into memory isn't going to work well :-) Dave

On Wed, Mar 14, 2007 at 06:05:31PM -0700, David Brown wrote:
Greg Fitzgerald wrote:
What we need is a library for a readonly filesystem. That is, all the same functions but pure. I believe you could make this readonly library by wrapping each readonly I/O function with 'unsafeInterleaveIO' or 'unsafePerformIO', but I don't really understand the consequences of using 'unsafe' calls, so I avoided it myself.
SlurpDirectory from darcs tries to do this. It is likely just specific to darcs needs, but perhaps it would be a useful thing.
I've thought about making something like this for the purposes of harchive, but concluded that I need too much control over when things happen. Having a memory leak that tries to read an entire filesystem into memory isn't going to work well :-)
Dave
I haven't looked at the darcs function, but for this problem something like --tested readFiles :: [FilePath] -> IO [String] readFiles files = return (map (unsafePerformIO . readFile) files) or --untested readFiles2 files = mapM (unsafeInterleaveIO . readFile) files probably works, as long as you read each file completely before going to the next. If the thunk returned by readFile was so lazy as to not even open the file before being forced you might not need this sort of thing at all (but it couldn't report problems at the readFile site). If you don't process each file completely you could make accessing any list element force earlier files with something like this serialize [] = [] serialize (a:as) = a:serialize' a as where serialize' a (b:bs) = a `seq` b : serialize' b bs serialize' _ [] = [] Brandon

Pete Kazmier:
When using readFile to process a large number of files, I am exceeding the resource limits for the maximum number of open file descriptors on my system. How can I enhance my program to deal with this situation without making significant changes?
AFAIU, file handles opened by readFile are closed in the following circumstances: 1) When lazy evaluation of the returned contents reaches the end of the file. 2) When the garbage collector runs the finaliser for the file structure. Obviously, for this to happen, the file structure must be unreachable. Unfortunately, the unreachability of the file structure doesn't guarantee anything about the timeliness of the garbage collection. While the garbage collector does respond to memory utilisation pressure, it doesn't respond to file handle utilisation pressure. Consequently, any program which uses readFile to read small portions of many files is likely to exhibit the problem you are experiencing. I'm not aware of an easy fix. You could use openFile, hGetContents and hClose, but then you have to be careful to avoid another problem, as described in [1]. In [2], Oleg describes the deeper problems with getContents and friends (including readFile), and advocates explicitly sequenced I/O. I have a feeling there have been even more discussions around this topic recently, but they elude me at the moment. Of course, we'll be most curious to hear which solution you choose. [1]http://www.haskell.org/pipermail/haskell-cafe/2007-March/023189.html [2]http://www.haskell.org/pipermail/haskell-cafe/2007-March/023073.html

When using readFile to process a large number of files, I am exceeding the resource limits for the maximum number of open file descriptors on my system. How can I enhance my program to deal with this situation without making significant changes? .. 1. Read contents of all files returning a list of Strings 2. Map over the list and parse each String as an Email 3. Sort the list of Emails 4. Group the list of Emails by Subject 5. Map over the grouped list to create a list of Threads 6. Finally, print the list of Threads
this logical pipeline is actually executed as driven by demand from strict i/o operations, such as 6, which opens many files early, processing them late; this leaves many files opened and partly read, grabbing file resources like mad. as others have pointed out, readFile's input resources can be freed when either the file has been read to the end, or all references to readFile's output have been dropped.
How do I go about fixing this without making a significant number of changes to my program? Did I make a mistake in steps 1 and 2 above? Should I have read and parsed a single file at a time, and then move on to the next?
in order to keep the overall structure, one could move readFile backwards and parseEmail forwards in the pipeline, until the two meet. then make sure that parseEmail completely constructs the internal representation of each email, thereby keeping no implicit references to the external representation. hth, claus
module Main where
import Control.Monad (filterM, liftM) import Data.List import Data.Maybe import System.Directory import System.Environment import System.IO
type From = String type Subject = String data Email = Email {from :: From, subject :: Subject} deriving Show data Thread = Thread [Email]
instance Show Thread where show (Thread emails@(e:es)) = title ++ senders where title = newline . bolder . subject $ e sender = newline . indent . from senders = concatMap sender emails newline = (++ "\n") indent = (" " ++) bolder = ("\27[0;32;40m" ++) . (++ "\27[0m")
main = getArgs >>= mapM filesOfDirectory >>= mapM (mapM processFile) >>= mapM_ print . threadEmails . concat
filesOfDirectory :: FilePath -> IO [String] filesOfDirectory dir = fmap (map ((dir++"/")++)) (getDirectoryContents dir) >>= filterM doesFileExist -- ignore directories
processFile path = do text <- readFile path return $! (parseEmail text)
parseEmail :: String -> Email parseEmail text = (Email $! (getHeader "From")) $! (getHeader "Subject") where strictly s= length s `seq` s getHeader = strictly . fromMaybe "N/A" . flip lookup headers headers = concatMap mkassoc . takeWhile (/="") $ lines text mkassoc s = case findIndex (==':') s of Just n -> [(take n s, drop (n+2) s)] Nothing -> []
threadEmails :: [Email] -> [Thread] threadEmails = map Thread . groupBy (fuzzy (==)) . sortBy (fuzzy compare) where fuzzy fn e e' = stripReFwd (subject e) `fn` stripReFwd (subject e') stripReFwd = stripSpaces . reverse . stripToColon . reverse stripSpaces = dropWhile (==' ') stripToColon = takeWhile (/=':')

When using readFile to process a large number of files, I am exceeding the resource limits for the maximum number of open file descriptors on my system. How can I enhance my program to deal with this situation without making significant changes?
I note that if you use mmap(2) to map a disk file into virtual memory, you may close the file descriptor afterwards and still access the data. That might also be relatively economical in other respects. Pardon me if this has already been suggested. I should know how to make ByteStrings from offsets into a mapped region, but all I can say right now is, I'm pretty sure that would be no problem. Don't give them any finalizer. If you need to eventually unmap the files, that would be a problem, but I think if they're mapped right, you won't need that. They obviously have their own backing store, and if you just pretend they're unmapped, the host virtual memory management should ideally let you get away with that. And file data will be contiguous and sequential in memory, which in principle ought to be optimal for memory resources. Donn Cave, donn@drizzle.com

donn:
When using readFile to process a large number of files, I am exceeding the resource limits for the maximum number of open file descriptors on my system. How can I enhance my program to deal with this situation without making significant changes?
I note that if you use mmap(2) to map a disk file into virtual memory, you may close the file descriptor afterwards and still access the data. That might also be relatively economical in other respects. Pardon me if this has already been suggested.
I should know how to make ByteStrings from offsets into a mapped region, but all I can say right now is, I'm pretty sure that would be no problem. Don't give them any finalizer.
In fact, the (commented out code) for mmapFile :: FilePath -> IO ByteString is in Data.ByteString now. You can also provide munmap as the finaliser, and let the GC take care of that. -- Don

Quoth dons@cse.unsw.edu.au (Donald Bruce Stewart): ... | In fact, the (commented out code) for mmapFile :: FilePath -> IO ByteString | is in Data.ByteString now. You can also provide munmap as the finaliser, | and let the GC take care of that. It does make more sense to unmap the string when it's the whole file, of course, and I assume it would force the OS to free the memory, if it makes any difference. Is there any cost to GC that can be avoided if there's no finalizer? I mean, in the present case you might generate many thousands of substrings during analysis of the file, so in theory there's a lot less book-keeping needed if there's nothing to do in the end anyway. Donn Cave, donn@drizzle.com

On 3/14/07, Pete Kazmier
When using readFile to process a large number of files, I am exceeding the resource limits for the maximum number of open file descriptors on my system. How can I enhance my program to deal with this situation without making significant changes?
I made it work with 20k files with only minor modifications.
type Subject = String data Email = Email {from :: From, subject :: Subject} deriving Show
It has been pointed out that parseEmail would work better if it were strict; the easiest way to accomplish this seems to be to replace the above line by data Email = Email {from :: !From, subject :: !Subject} deriving Show [snip]
fileContentsOfDirectory :: FilePath -> IO [String] fileContentsOfDirectory dir = setCurrentDirectory dir >> getDirectoryContents dir >>= filterM doesFileExist >>= -- ignore directories mapM readFile
And here's another culprit - readFile actually opens the file before any of its output is used. So I imported System.IO.Unsafe and replaced the last line above by mapM (unsafeInterLeaveIO . readFile) With these two changes the program seems to work fine. HTH, Bertram
participants (22)
-
Alistair Bayley
-
Bertram Felgenhauer
-
Brandon Michael Moore
-
Bryan Burgers
-
Bryan O'Sullivan
-
Chung-chieh Shan
-
Claus Reinke
-
Dan Weston
-
David Brown
-
Donn Cave
-
dons@cse.unsw.edu.au
-
Dougal Stanton
-
Ferenc Wagner
-
Greg Fitzgerald
-
Isaac Dupree
-
Jules Bean
-
Ketil Malde
-
Lennart Augustsson
-
Matthew Brecknell
-
Nicolas Frisby
-
Paul Hudak
-
Pete Kazmier