
Instead of a question, I thought I'd share a moment of lazy-evaluation enlightenment I had last night. I have some code that recursively descends a directory, gets the SHA1 hashes for all the files, and builds a map of which file paths share the same SHA1 hash. The code that actually generates the hash looked like this: sha1file :: FilePath -> IO String sha1file fn = do bs <- expandPath fn >>= BSL.readFile return $ PureSHA.showDigest $ PureSHA.sha1 bs Everything worked fine on paths without many files in them, but choked on paths with many files: "Exception: getCurrentDirectory: resource exhausted (Too many open files)" This was driving me crazy; ByteString.Lazy.readFile is supposed to close the file once it's done. I kept going over my code, wondering what was at fault, until it finally clicked: *the hashes weren't being generated until I actually tried to view them*, and thus all the files were being held open until that point! I made a single change to my "sha1file" function: sha1file :: FilePath -> IO String sha1file fn = do bs <- expandPath fn >>= BSL.readFile return $ PureSHA.showDigest $! PureSHA.sha1 bs (the "$!") ... and everything worked perfectly. The code now finished processing each file before opening the next one, and I was happy. :-)

Tom, The bad news is that 1. Haskell makes no guarantee about when the files are closed, 2. file handles are a limited resource, and 3. lazy I/O doesn't handle errors in a recoverable fashion. Unfortunately this means that lazy I/O is fundamentally unsound. The only safe way to do it is to read the file strictly in blocks using Data.ByteString.hGet. Steve Tom Tobin wrote:
Instead of a question, I thought I'd share a moment of lazy-evaluation enlightenment I had last night.
I have some code that recursively descends a directory, gets the SHA1 hashes for all the files, and builds a map of which file paths share the same SHA1 hash. The code that actually generates the hash looked like this:
sha1file :: FilePath -> IO String sha1file fn = do bs <- expandPath fn >>= BSL.readFile return $ PureSHA.showDigest $ PureSHA.sha1 bs
Everything worked fine on paths without many files in them, but choked on paths with many files:
"Exception: getCurrentDirectory: resource exhausted (Too many open files)"
This was driving me crazy; ByteString.Lazy.readFile is supposed to close the file once it's done. I kept going over my code, wondering what was at fault, until it finally clicked: *the hashes weren't being generated until I actually tried to view them*, and thus all the files were being held open until that point! I made a single change to my "sha1file" function:
sha1file :: FilePath -> IO String sha1file fn = do bs <- expandPath fn >>= BSL.readFile return $ PureSHA.showDigest $! PureSHA.sha1 bs
(the "$!") ... and everything worked perfectly. The code now finished processing each file before opening the next one, and I was happy. :-) _______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners

On Sat, Feb 20, 2010 at 4:42 PM, Stephen Blackheath [to
Haskell-Beginners]
Tom,
The bad news is that 1. Haskell makes no guarantee about when the files are closed,
Hmm, Data.ByteString.Lazy.readFile's docstring says: "Read an entire file lazily into a ByteString. The Handle will be held open until EOF is encountered." It certainly seemed to change matters once I switched that $ to $!, though; I don't see why that would have helped me unless the handles were indeed being closed.
2. file handles are a limited resource
Well, yes, that's why I ran into the original problem.
and 3. lazy I/O doesn't handle errors in a recoverable fashion.
I suppose this will be something I'll run into before too long.
Unfortunately this means that lazy I/O is fundamentally unsound.
The only safe way to do it is to read the file strictly in blocks using Data.ByteString.hGet.
But with the strict version of ByteString, how would I compute the SHA1 hash of an 8 GB file on a machine with quite a bit less memory? I can't imagine Haskell just has no way to handle a case that other languages handle easily.

Am Sonntag 21 Februar 2010 02:23:48 schrieb Tom Tobin:
On Sat, Feb 20, 2010 at 4:42 PM, Stephen Blackheath [to Haskell-Beginners]
wrote:
Tom,
The bad news is that 1. Haskell makes no guarantee about when the files are closed,
Hmm, Data.ByteString.Lazy.readFile's docstring says:
"Read an entire file lazily into a ByteString. The Handle will be held open until EOF is encountered."
There is no hard guarantee when the file will be closed, but looking at the relevant code, hGetContentsN :: Int -> Handle -> IO ByteString hGetContentsN k h = lazyRead -- TODO close on exceptions where lazyRead = unsafeInterleaveIO loop loop = do c <- S.hGetNonBlocking h k --TODO: I think this should distinguish EOF from no data available -- the underlying POSIX call makes this distincion, returning either -- 0 or EAGAIN if S.null c then do eof <- hIsEOF h if eof then hClose h >> return Empty else hWaitForInput h (-1) >> loop else do cs <- lazyRead return (Chunk c cs) I'd say the file is closed as soon as EOF is encountered. If you don't open too many files before you've finished reading, it shouldn't be a problem.
It certainly seemed to change matters once I switched that $ to $!, though; I don't see why that would have helped me unless the handles were indeed being closed.
Right. The $! forced the file to be read until the end, so it was closed before too many others were opened.
2. file handles are a limited resource
Well, yes, that's why I ran into the original problem.
and 3. lazy I/O doesn't handle errors in a recoverable fashion.
I suppose this will be something I'll run into before too long.
Unfortunately this means that lazy I/O is fundamentally unsound.
The only safe way to do it is to read the file strictly in blocks using Data.ByteString.hGet.
But with the strict version of ByteString, how would I compute the SHA1 hash of an 8 GB file on a machine with quite a bit less memory? I can't imagine Haskell just has no way to handle a case that other languages handle easily.
Incrementally, like the SHA1 hash is computed with a lazy ByteString. Read a chunk of the file (multiple of 512 bits is a good idea), process it, read next chunk, ..., until the end, then close the file. The difference is that you have exact control what happens when this way, the unsafeInterleaveIO in the lazy ByteString code takes that control away. However, by forcing the results at the proper places, you gain enough control to avoid the leaking of file handles and several other unpleasant surprises - normally, at least, there may be cases where you can't.
participants (3)
-
Daniel Fischer
-
Stephen Blackheath [to Haskell-Beginners]
-
Tom Tobin