
Am Sonntag 21 Februar 2010 02:23:48 schrieb Tom Tobin:
On Sat, Feb 20, 2010 at 4:42 PM, Stephen Blackheath [to Haskell-Beginners]
wrote:
Tom,
The bad news is that 1. Haskell makes no guarantee about when the files are closed,
Hmm, Data.ByteString.Lazy.readFile's docstring says:
"Read an entire file lazily into a ByteString. The Handle will be held open until EOF is encountered."
There is no hard guarantee when the file will be closed, but looking at the relevant code, hGetContentsN :: Int -> Handle -> IO ByteString hGetContentsN k h = lazyRead -- TODO close on exceptions where lazyRead = unsafeInterleaveIO loop loop = do c <- S.hGetNonBlocking h k --TODO: I think this should distinguish EOF from no data available -- the underlying POSIX call makes this distincion, returning either -- 0 or EAGAIN if S.null c then do eof <- hIsEOF h if eof then hClose h >> return Empty else hWaitForInput h (-1) >> loop else do cs <- lazyRead return (Chunk c cs) I'd say the file is closed as soon as EOF is encountered. If you don't open too many files before you've finished reading, it shouldn't be a problem.
It certainly seemed to change matters once I switched that $ to $!, though; I don't see why that would have helped me unless the handles were indeed being closed.
Right. The $! forced the file to be read until the end, so it was closed before too many others were opened.
2. file handles are a limited resource
Well, yes, that's why I ran into the original problem.
and 3. lazy I/O doesn't handle errors in a recoverable fashion.
I suppose this will be something I'll run into before too long.
Unfortunately this means that lazy I/O is fundamentally unsound.
The only safe way to do it is to read the file strictly in blocks using Data.ByteString.hGet.
But with the strict version of ByteString, how would I compute the SHA1 hash of an 8 GB file on a machine with quite a bit less memory? I can't imagine Haskell just has no way to handle a case that other languages handle easily.
Incrementally, like the SHA1 hash is computed with a lazy ByteString. Read a chunk of the file (multiple of 512 bits is a good idea), process it, read next chunk, ..., until the end, then close the file. The difference is that you have exact control what happens when this way, the unsafeInterleaveIO in the lazy ByteString code takes that control away. However, by forcing the results at the proper places, you gain enough control to avoid the leaking of file handles and several other unpleasant surprises - normally, at least, there may be cases where you can't.