On hGetContents semi-closenesscloseness

From http://www.haskell.org/ghc/docs/latest/html/libraries/base/System-IO.html: Computation hGetContents hdl returns the list of characters corresponding to the unread portion of the channel or file managed by hdl, which is put into an intermediate state, semi-closed. In this state, hdl is effectively closed, but items are read from hdl on demand and accumulated in a special list returned by hGetContents hdl. What state is that? It seems to be something related to Haskell. I couldn't find a definition for it in the unix documentation I have laying around. Continuing reading System IO documentation I read: Any operation that fails because a handle is closed, also fails if a handle is semi-closed. The only exception is hClose. A semi-closed handle becomes closed: * if hClose is applied to it; * if an I/O error occurs when reading an item from the handle; * or once the entire contents of the handle has been read. So it looks like hGetContents sets some flag in the handler saying it is in that semi-closed state. So no other operations are valid, but I think the file descriptor to the file is actually kept open. It's only when the contents are entirely consumed that the descriptor gets closed. Is hGetContents responsible for closing the descriptor? Or is it the garbage collector? Who closes the descriptor when the contents are read? Looking at hGetContents function definition, it uses lazyRead to read the contents, but it calls a wantReadableHandle which might or might not close the handle after lazyRead. By looking at the documentation it seems like the only way for us to actively close the descriptor is either reading the whole thing or calling hClose. But one has to be very carefully when to call the handler, because it doesn't matter if it looks like it was consumed, it really has to be consumed. The following code prints the contents of foo file to the screen: openFile "foo" ReadMode >>= \handle -> (hGetContents handle >>= (\s -> putStr s >> hClose handle)) [1] The following code does not: openFile "foo" ReadMode >>= \handle -> (hGetContents handle >>= (\s -> hClose handle >> putStr s)) [2] It is common knowledge that haskell is very lazy, so it only does things when absolutely necessary, otherwise it prefers to write it off in the TODO list. It does that even if writing to the TODO takes longer than the computation would, that's how lazy it is. That's the origin of the often used expression "he is quite a haskell". The question most people doesn't have a good answer is: when does Haskell thinks it is necessary to do something? In [2] the lazyRead inside of hGetContents (or perhaps hGetContents all together) only gets executed after hClose handle. Why is that? How do I figure out the ordering of computation?

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 (I'm probably glossing over important stuff and getting some details wrong, as usual, but I hope it's good enough to give some idea of what's going on.) On 2/15/11 11:57 , Rafael Cunha de Almeida wrote:
What state is that? It seems to be something related to Haskell. I couldn't find a definition for it in the unix documentation I have laying around.
Yes, it's specific to Haskell's runtime; if you have a handle being read lazily "in the background" (see unsafeInterleaveIO), trying to use it "in the foreground" is problematic. Specifically, which call(s) should get the data?
entirely consumed that the descriptor gets closed. Is hGetContents responsible for closing the descriptor? Or is it the garbage collector? Who closes the descriptor when the contents are read? Looking at hGetContents function
The garbage collector closes the handle, as I understand it.
openFile "foo" ReadMode >>= \handle -> (hGetContents handle >>= (\s -> hClose handle >> putStr s)) [2]
This is a classic example of the dangers of hGetContents (and, more generally, of unsafeInterleaveIO). In general, you should use lazy I/O only for "quick and dirty" stuff and avoid it for serious programming. You can get many of the benefits of lazy I/O without the nondeterminacy by using iteratee-based I/O (http://hackage.haskell.org/package/iteratee). The usual way to deal with this is to force the read in some way, usually by forcing evaluation of the length of the data (let s' = length s in evaluate $ s' `seq` s' -- or something like that).
The question most people doesn't have a good answer is: when does Haskell thinks it is necessary to do something?
Haskell is actually what manufacturing folks call "just in time"; things are evaluated when they are needed. Usually this means that when you output something, anything needed to compute that output will be done then. The exceptions are things like Control.Exception.evaluate (which you can treat as doing output but without *actually* outputting anything), mentioned above, plus you can indicate that some computation must be evaluated before another by means of Prelude.seq. You can also declare a type as being strict by prefixing an exclamation mark (so the runtime will always evaluate a computation before binding it), and with the BangPatterns extension you can also declare a pattern match binding as strict the same way. Be aware that in most cases, evaluating a computation takes it to "weak head normal form", which means that (as one would expect from a lazy language) only the minimum amount of evaluation is done. If nothing else forces evaluation, this means that the computation is evaluated to the point of its top level constructor and no further. You can think of it this way: all expressions in Haskell are represented by "thunks" (little chunks of code), and evaluation replaces the outermost thunk in an expression with the result of running it. So if we have an expression @(@[@a,@b],@(Foo @(Bar @d))) (where a @ precedes a sub-expression which is unevaluated/a thunk), WHNF removes the outermost (leftmost, here) @ by evaluating the tuple constructor while leaving the elements of the tuple unevaluated. If you need to force evaluation in other ways, take a look at Control.DeepSeq (http://hackage.haskell.org/package/deepseq). The upshot of the above is that you can determine the order of evaluation by working backwards from output computations. It may be a partial ordering, because when there are multiple independent computations required by another computation, the order in which they are evaluated is undefined. In practice, this is usually unimportant because in pure code there is by definition no difference between evaluation order in those cases (this is technically called "referential integrity"); but when unsafeInterleaveIO is used (as with hGetContents), it allows pure code to behave indeterminately (it violates referential integrity). This is why it is "unsafe" (and why hGetContents is thereby unsafe), and why mechanisms like Control.Exception.evaluate and seq are provided. - -- brandon s. allbery [linux,solaris,freebsd,perl] allbery.b@gmail.com system administrator [openafs,heimdal,too many hats] kf8nh -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk1azs4ACgkQIn7hlCsL25XlBwCg0dxc4pElXfFGNRh7m1Vezva4 dgQAnjIxlJhwTn2JBto005KfRSpc2Svr =sQo7 -----END PGP SIGNATURE-----

On Tue, 2011-02-15 at 14:06 -0500, Brandon S Allbery KF8NH wrote:
entirely consumed that the descriptor gets closed. Is hGetContents responsible for closing the descriptor? Or is it the garbage collector? Who closes the descriptor when the contents are read? Looking at hGetContents function
The garbage collector closes the handle, as I understand it.
The handle is actually closed as soon as you read all the way to the end of the file. However, because reading is done as a side effect of forcing a lazy, supposedly-pure value, it's hard to predict when that will happen, and apprently unrelated changes in a different part of the program can cause it to never happen and cause you to leak file handles. In a way, it's analogous to the situation with garbage collection and closing file handles in finalizers; but the details are different and the unpredictable file closing comes from lazy evaluation rather than garbage collection. -- Chris Smith

In a way, it's analogous to the situation with garbage collection and closing file handles in finalizers; but the details are different and the unpredictable file closing comes from lazy evaluation rather than garbage collection.
Except that lazy evaluation can affect *when* the data becomes garbage! ;-)

Brandon S Allbery KF8NH
openFile "foo" ReadMode >>= \handle -> (hGetContents handle >>= (\s -> hClose handle >> putStr s)) [2]
This is a classic example of the dangers of hGetContents (and, more generally, of unsafeInterleaveIO).
Which makes me wonder why it isn't an error to hClose a semi-closed handle as well? The reason that springs to mind is that most systems (or at least Linux) has for some unfathomable reason an arbitrary, low, and fixed limit on the number of open files, and it is therefore sometimes necessary or desirable to close files before opening new ones.
In general, you should use lazy I/O only for "quick and dirty" stuff and avoid it for serious programming.
I must admit that I use lazy I/O all the time - usually in the form of 'readFile' rather than 'hGetContents'.
You can get many of the benefits of lazy I/O without the nondeterminacy by using iteratee-based I/O (http://hackage.haskell.org/package/iteratee).
I think it is fair to say that iteratees are a bit more involved. -k -- If I haven't seen further, it is by standing in the footprints of giants

-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160
Brandon S Allbery KF8NH
Haskell is actually what manufacturing folks call "just in time"; things are evaluated when they are needed. Usually this means that when you output something, anything needed to compute that output will be done then. The exceptions are things like Control.Exception.evaluate (which you can treat as doing output but without *actually* outputting anything), mentioned above, plus you can indicate that some computation must be evaluated before another by means of Prelude.seq. You can also declare a type as being strict by prefixing an exclamation mark (so the runtime will always evaluate a computation before binding it), and with the BangPatterns extension you can also declare a pattern match binding as strict the same way.
Note that pattern matches are strict by default. In fact, a pattern match is the preferred way to force evaluation. Bang patterns only make sure that variables (i.e. wildcards, which wouldn't be evaluated otherwise) in a pattern match are evaluated to WHNF: case expr of Just x -> ... Nothing -> This is strict in the Maybe constructors, but non-strict in the argument of Just. When using a bang pattern, Just (!x) -> ... the match is also strict in the argument of Just. A bang pattern is really just a shortcut for using 'seq': Just x -> seq x $ ... or as some people prefer to write it: Just x | seq x True -> ... Greets, Ertugrul - -- nightmare = unsafePerformIO (getWrongWife >>= sex) http://ertes.de/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQIcBAEBAwAGBQJNW6RSAAoJENVqN/rl3Y0RpmcQAI8no1y9NopXHKx0zCV+xV4W R2i2JdkPl8s7z+BpQLFqdKk+y9CfhqDLTwE+v8ek7fhnJXFWHqvGpi/1p6u3DRuO /ho8YNrzp8gKj0EF0Dn0VCwZ7qdmDBtmNXdHlMUzz2pTd968/9tK0FyUSb20qiBT Suv8lPv8shsGfYX29apo1JCcHRplfeS9wJpIk0/J5xaddTDElL0CBMWkehRFFlHi MVB3KS/6VnCqTCd8RtykzzmxtN2d+sf4a96h9RQWFt/62UPGfH04l2kY+1YiP6fK OaV0iqHyM1TLKT/tzp+duZ57TJ2MX5h00WczHduE01Y+7nnB8b67TAvUXJI21IwA loh1rHlqSoPI/lF1Ti5iJEF2K74waONtl7AM+lmOQZDVuipALQvesXSWTEvS18Mm fo052MUwzWgAXU4hwod5ZvjUNR92Z9vL5JAb1MP7DShE+sLJiXYDAD381XKU6FSi 0C4rMo553JXJcMNrnJhzdxDrfJCzcHIdePG6XOkH+EzRIUPs+mYlHuNTrFbZtyzC LzqfIwDMRQLo0f3KT5cj+6eDV/sUELW3seaTFUSCfIQhJ3molsimQHQy7YsJvvp7 fkcxpAfnAegiTQydfvUFFsdW+ZELeZTyW06iedMisUx1Lhww4Butee8PTZex3jHV qH/RghfnSK3w/cwiZez2 =ecix -----END PGP SIGNATURE-----
participants (6)
-
Andrew Coppin
-
Brandon S Allbery KF8NH
-
Chris Smith
-
Ertugrul Soeylemez
-
Ketil Malde
-
Rafael Cunha de Almeida