
claus.reinke:
Not necessarily so, since you are making assumptions about the timeliness of garbage collection. I was similarly sceptical of Claus' suggestion:
Claus Reinke:
in order to keep the overall structure, one could move readFile backwards and parseEmail forwards in the pipeline, until the two meet. then make sure that parseEmail completely constructs the internal representation of each email, thereby keeping no implicit references to the external representation.
you are quite right to be skeptical!-) indeed, in the latest Handle documentation, we still find the following excuse for GHC:
http://www.haskell.org/ghc/docs/latest/html/libraries/base/System-IO.html#t%...
GHC note: a Handle will be automatically closed when the garbage collector detects that it has become unreferenced by the program. However, relying on this behaviour is not generally recommended: the garbage collector is unpredictable. If possible, use explicit an explicit hClose to close Handles when they are no longer required. GHC does not currently attempt to free up file descriptors when they have run out, it is your responsibility to ensure that this doesn't happen. this issue has been discussed in the past, and i consider it a bug if the memory manager tells me to handle memory myself;-) so i do hope that this infelicity will be removed in the future (run out of file descriptors -> run a garbage collection and try again, before giving up entirely).
in fact, my local version had two variants of processFile - the one i posted and one with explicit file handle handling (the code was restructured this way exactly to hide this implementation decision in a single function). i did test both variants on a directory with lots of copies of a few emails (>2000 files), and both worked on my system, so i hoped -rather than checked- that the handle collection issue had finally been fixed, and made the mistake of removing the more complex variant before posting. thanks for pointing out that error - as the documentation above demonstrates, it isn't good to rely on assumptions, nor on tests.
so here is the alternate variant of processFile (for which i imported System.IO):
processFile path = do f <- openFile path ReadMode text <- hGetContents f let email = parseEmail text email `seq` hClose f return email
all this hazzle to expose a file handle to call hClose on, just so that the GC does not have to..
Are we at the point that we should consider adding some documentation how to deal with this issue? And are the recommendations to either use strict IO (should we have a package for System.IO.Strict??), or via strictness on the consumer of the data. -- Don