
Ketil Malde:
Perhaps this is an esoteric way, but I think the nicest approach is to parse into a strict structure. If you fully evaluate each Email (or whatever structure you parse into), there will be no unevaluated thunks linking to the file, and it will be closed.
Not necessarily so, since you are making assumptions about the timeliness of garbage collection. I was similarly sceptical of Claus' suggestion: Claus Reinke:
in order to keep the overall structure, one could move readFile backwards and parseEmail forwards in the pipeline, until the two meet. then make sure that parseEmail completely constructs the internal representation of each email, thereby keeping no implicit references to the external representation.
So here's a test. I don't have any big maildirs handy, so this is based on the simple exercise of printing the first line of each of a large number of files. First, the preamble.
import Control.Exception (bracket) import System.Environment import System.IO
main = do t:n:fs <- getArgs ([test0,test1,test2,test3] !! read t) (take (read n) $ cycle fs)
The following example corresponds to Pete's original program. As expected, when called with a sufficiently large number of files, it always results in file handle exhaustion without producing any output:
test0 files = mapM readFile files >>= mapM_ (putStrLn.head.lines)
The next example, corresponds (I think) to Claus' suggestion, in which the readFile and putStrLn are performed at the same point in the pipeline. I found that sometimes this runs without error, but other times it fails with file handle exhaustion. This seems to depend on the mood of the garbage collector, or at least the external conditions in which the garbage collector operates. It also appears to fail more frequently for small files. Without any knowledge of garbage collector internals, I'm guessing that this is because readFiles reads in 8K chunks. For files significantly smaller than 8K, garbage collection cycles are likely to be much less frequent, and therefore there is greater likelihood of file handle exhaustion between GC cycles.
test1 files = mapM_ doStuff files where doStuff f = readFile f >>= putStrLn.head.lines
The third is similar to the second, except it adds strictness annotations to force the file to be read to the end. As expected, this saves me from file handle exhaustion, but it is grossly inefficient for large files.
test2 files = mapM_ doStuff files where doStuff f = do contents <- readFile f putStrLn $ head $ lines contents return $! force contents force (x:xs) = force xs force [] = ()
In the fourth example, I explicitly close the filehandle. This also saves me from file handle exhaustion, but I must be carefull to force everything I need to be read before returning. Returning a lazy computation would be no good, as discovered in [1]. In this case, putStrLn does all the forcing I need.
test3 files = mapM_ bracketStuff files where bracketStuff f = bracket (openFile f ReadMode) hClose doStuff doStuff h = hGetContents h >>= putStrLn.head.lines
As Oleg points out in [2], all of the above have the problem that it is impossible to tell the difference between a read error and end-of-file. I had intended to write an example using explicitly sequenced I/O, but Oleg has saved me the trouble with the post he made just now [3]. [1]http://www.haskell.org/pipermail/haskell-cafe/2007-March/023189.html [2]http://www.haskell.org/pipermail/haskell-cafe/2007-March/023073.html [3]http://www.haskell.org/pipermail/haskell-cafe/2007-March/023523.html