Unsafe hGetContents

Simon Marlow wrote:
Ah yes, if you have two lazy input streams both referring to the same underlying stream, that is enough to demonstrate a problem. As for whether Oleg's example is within the rules, it depends whether you consider fdToHandle as "unsafe"
I wasn't aware of the rules. Fortunately, UNIX (FreeBSD and Linux) give plenty of opportunities to shoot oneself. Here is the code from the earlier message without the offending fdToHandle:
{- Haskell98! -}
module Main where
import System.IO
-- f1 and f2 are both pure functions, with the pure type. -- Both compute the result of the subtraction e1 - e2. -- The only difference between them is the sequence of -- evaluating their arguments, e1 `seq` e2 vs. e2 `seq` e1 -- For really pure functions, that difference should not be observable
f1, f2:: Int ->Int ->Int
f1 e1 e2 = e1 `seq` e2 `seq` e1 - e2 f2 e1 e2 = e2 `seq` e1 `seq` e1 - e2
read_int s = read . head . words $ s
main = do let h1 = stdin h2 <- openFile "/dev/stdin" ReadMode s1 <- hGetContents h1 s2 <- hGetContents h2 -- print $ f1 (read_int s1) (read_int s2) print $ f2 (read_int s1) (read_int s2)
It exhibits the same behavior that was described in http://www.haskell.org/pipermail/haskell/2009-March/021064.html I think Windows may have something similar.
The reason it's hard is that to demonstrate a difference you have to get the lazy I/O to commute with some other I/O, and GHC will never do that.
The keyword here is GHC. I may well believe that GHC is able to divine programmer's true intent and so it always does the right thing. But writing in the language standard ``do what the version x.y.z of GHC does'' does not seem very appropriate, or helpful to other implementors.
Haskell's IO library is carefully designed to not run into this problem on its own. It's normally not possible to get two Handles with the same FD... Is this behavior is specified somewhere, or is this just an artifact of a particular GHC implementation?

On Sat, 2009-10-10 at 02:51 -0700, oleg@okmij.org wrote:
The reason it's hard is that to demonstrate a difference you have to get the lazy I/O to commute with some other I/O, and GHC will never do that.
The keyword here is GHC. I may well believe that GHC is able to divine programmer's true intent and so it always does the right thing. But writing in the language standard ``do what the version x.y.z of GHC does'' does not seem very appropriate, or helpful to other implementors.
With access to unsafeInterleaveIO it's fairly straightforward to show that it is non-deterministic. These programs that bypass the safety mechanisms on hGetContents just get us back to having access to the non-deterministic semantics of unsafeInterleaveIO.
Haskell's IO library is carefully designed to not run into this problem on its own. It's normally not possible to get two Handles with the same FD...
Is this behavior is specified somewhere, or is this just an artifact of a particular GHC implementation?
It is in the Haskell 98 report, in the design of the IO library. It does not not mention FDs of course. The IO/Handle functions it provides give no (portable) way to obtain two read handles on the same OS file descriptor. The hGetContents behaviour of semi-closing is to stop you from getting two lazy lists of the same read Handle. There's nothing semantically wrong with you bypassing those restrictions (eg openFile "/dev/fd/0") it just means you end up with a non-deterministic IO program, which is something we typically try to avoid. I am a bit perplexed by this whole discussion. It seems to come down to saying that unsafeInterleaveIO is non-deterministic and that things implemented on top are also non-deterministic. The standard IO library puts up some barriers to restrict the non-determinism, but if you walk around the barrier then you can still find it. It's not clear to me what is supposed to be surprising or alarming here. Duncan

Hello,
well, I think that the fact that we seem to have a program context
that can distinguish "f1" from "f2" is worth discussing because I
would have thought that in a pure language they are interchangable.
The question is, does the context in Oleg's example really distinguish
between "f1" and "f2"? You seem to be saying that this is not the
case: in both cases you end up with the same non-deterministic
program that reads two numbers from the standard input and subtracts
them but you can't assume anything about the order in which the
numbers are extracted from the input---it is merely an artifact of the
GHC implementation that with "f1" the subtraction always happens the
one way, and with "f2" it happens the other way.
I can (sort of) buy this argument, after all, it is quite similar to
what happens with asynchronous exceptions (f1 (error "1") (error "2")
vs f2 (error "1") (error "2")). Still, the whole thing does not
"smell right": there is some impurity going on here, and trying to
offload the problem onto the IO monad only makes reasoning about IO
computations even harder (and it is petty hard to start with). So,
discussion and alternative solutions should be strongly encouraged, I
think.
-Iavor
On Sat, Oct 10, 2009 at 7:38 AM, Duncan Coutts
On Sat, 2009-10-10 at 02:51 -0700, oleg@okmij.org wrote:
The reason it's hard is that to demonstrate a difference you have to get the lazy I/O to commute with some other I/O, and GHC will never do that.
The keyword here is GHC. I may well believe that GHC is able to divine programmer's true intent and so it always does the right thing. But writing in the language standard ``do what the version x.y.z of GHC does'' does not seem very appropriate, or helpful to other implementors.
With access to unsafeInterleaveIO it's fairly straightforward to show that it is non-deterministic. These programs that bypass the safety mechanisms on hGetContents just get us back to having access to the non-deterministic semantics of unsafeInterleaveIO.
Haskell's IO library is carefully designed to not run into this problem on its own. It's normally not possible to get two Handles with the same FD...
Is this behavior is specified somewhere, or is this just an artifact of a particular GHC implementation?
It is in the Haskell 98 report, in the design of the IO library. It does not not mention FDs of course. The IO/Handle functions it provides give no (portable) way to obtain two read handles on the same OS file descriptor. The hGetContents behaviour of semi-closing is to stop you from getting two lazy lists of the same read Handle.
There's nothing semantically wrong with you bypassing those restrictions (eg openFile "/dev/fd/0") it just means you end up with a non-deterministic IO program, which is something we typically try to avoid.
I am a bit perplexed by this whole discussion. It seems to come down to saying that unsafeInterleaveIO is non-deterministic and that things implemented on top are also non-deterministic. The standard IO library puts up some barriers to restrict the non-determinism, but if you walk around the barrier then you can still find it. It's not clear to me what is supposed to be surprising or alarming here.
Duncan
_______________________________________________ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime

Iavor Diatchki wrote:
Hello,
well, I think that the fact that we seem to have a program context that can distinguish "f1" from "f2" is worth discussing because I would have thought that in a pure language they are interchangable. The question is, does the context in Oleg's example really distinguish between "f1" and "f2"? You seem to be saying that this is not the case: in both cases you end up with the same non-deterministic program that reads two numbers from the standard input and subtracts them but you can't assume anything about the order in which the numbers are extracted from the input---it is merely an artifact of the GHC implementation that with "f1" the subtraction always happens the one way, and with "f2" it happens the other way.
I can (sort of) buy this argument, after all, it is quite similar to what happens with asynchronous exceptions (f1 (error "1") (error "2") vs f2 (error "1") (error "2")). Still, the whole thing does not "smell right": there is some impurity going on here, and trying to offload the problem onto the IO monad only makes reasoning about IO computations even harder (and it is petty hard to start with). So, discussion and alternative solutions should be strongly encouraged, I think.
To put it in different words, here an elaboration on what exactly the non-determinism argument is: Consider programs foo1 and foo2 defined as foo :: (a -> b -> c) -> IO String foo f = Control.Exception.catch (evaluate (f (error "1") (error "2")) >> return "3") (\(ErrorCall s) -> return s) foo1 = foo f1 where f1 x y = x `seq` y `seq` () foo2 = foo f2 where f2 x y = y `seq` x `seq` () Knowing how exceptions and seq behave in GHC, it is straightforward to prove that foo1 = return "1" foo2 = return "2" which clearly violates referential transparency. This is bad, so the idea is to disallow the proof. In particular, the idea is that referential transparency can be restored if we only allow proofs that work for all evaluation orders, which is equivalent to introducing non-determinism. In other words, we are only allowed to prove foo1 = return "1" or return "2" foo2 = return "1" or return "2" Moreover, we can push the non-determinism into the IO type and pretend that pure functions A -> B are semantically lifted to Nondet A -> Nondet B with some kind of fmap . The same goes for hGetContents : if you use it twice on the same handle, you're only allowed to prove non-deterministic behavior, which is not very useful if you want a deterministic program. But you are allowed to prove deterministic results if you use it with appropriate caution. In other words, the language semantics guarantees less than GHC actually does. In particular, the semantics only allows reasoning that is independent of the evaluation order and this means to treat IO as non-deterministic in certain cases. Regards, apfelmus -- http://apfelmus.nfshost.com

On Sat, 2009-10-10 at 10:59 -0700, Iavor Diatchki wrote:
Hello,
well, I think that the fact that we seem to have a program context that can distinguish "f1" from "f2" is worth discussing because I would have thought that in a pure language they are interchangable.
Crucially they are contexts in an IO program.
The question is, does the context in Oleg's example really distinguish between "f1" and "f2"? You seem to be saying that this is not the case: in both cases you end up with the same non-deterministic program that reads two numbers from the standard input and subtracts them but you can't assume anything about the order in which the numbers are extracted from the input---it is merely an artifact of the GHC implementation that with "f1" the subtraction always happens the one way, and with "f2" it happens the other way.
Right.
I can (sort of) buy this argument, after all, it is quite similar to what happens with asynchronous exceptions (f1 (error "1") (error "2") vs f2 (error "1") (error "2")). Still, the whole thing does not "smell right": there is some impurity going on here,
No, there's no impurity.
and trying to offload the problem onto the IO monad only makes reasoning about IO computations even harder (and it is petty hard to start with).
Sure, reasoning about non-deterministic IO programs is tricky. But then nobody here is advocating writing non-deterministic IO programs. Lazy IO is sensible and useful when the non-determinism doesn't make any difference to the results. Lets look at a simplified case, instead of general IO and the whole OS API at our disposal, lets look at the case of a single thread of control and mutable variables, specifically the ST monad. We can construct a semantics for this based on a sequence of read / write events for the mutable variables. The ST monad bind gives guarantees about the ordering of the events. So the ST programs are deterministic. do x <- readSTRef v writeSTRef v (x+1) writeSTRef v (x+2) The semantics of this ST program is the trace read(v,x) write(v,x+1) write(v,x+2) We could introduce non-determinism to this system by allowing read / write events to be arbitrarily interleaved with other subsequent events: do x <- readSTRef v unsafeInterleaveST $ writeSTRef v (x+1) writeSTRef v (x+2) now we can have two traces: read(v,x) write(v,x+1) write(v,x+2) or read(v,x) write(v,x+2) write(v,x+1) The semantics is the set of traces, in this case just the two. Of course with this modified ST system we cannot allow a pure runST because we've got non-deterministic ST programs (or we could make it pure by returning the full set of traces). But it'd be ok for IO. Now working with and reasoning about these non-deterministic ST programs is tricky. Depending on the implementation choice for the interleaving we'll get different results and under some implementation choices we'll be able to influence the result by coding pure bits of the program differently. None of this changes the semantics since the semantics just says any possible interleaving is OK. Another interesting thing to note is that we can limit the interleaving somewhat by forcing deferred events to come before other subsequent events: do writeSTRef v 1 x <- unsafeInterleaveST $ readSTRef v writeSTRef v 2 evaluate x writeSTRef v 3 So in the traces for this program, x can have the value 1 or 2 but not 3 because of the partial order on events that we impose using evaluate. We can also do something like Oleg's example (simplified to only a single getChar rather than reading the whole input stream) do let fileContent = "hello" seekPoint <- newSTRef 0 let getChar = do s <- readIORef seekPoint writeIORef seekPoint (s+1) return (fileContent !! s) s1 <- unsafeInterleaveST getChar s2 <- unsafeInterleaveST getChar --evaluate (f1 s1 s2) evaluate (f2 s1 s2) Under some implementations of the interleaving we can expect to get different event interleavings for the f1 program vs the f2 program. So we apparently have a pure function influencing the event ordering. Of course the semantics says we have both event orderings anyway. It is also possible to write ST programs that produce the same result irrespective of the event interleaving. These programs might actually be useful. For example: do writeSTRef v 1 x <- unsafeInterleaveST $ readSTRef v ... -- no more writes to v So here we allow the read from v do be performed any time. We still have loads of different possible traces, but the value of x is the same in each, because the v variable is never written to again. In IO with the full OS API and other programs running concurrently it is harder to reason about. But we can see similar possibilities for non-deterministic primitives where we can still get a deterministic result. One of those is if we read from a mutable variable (a file) and can be sure that there are no other writes to that mutable variable. It's further complicated by the fact that we don't directly read from files but rather from a file descriptor which is a mutable variable containing a seek point into a file. When we read from a file we mutate the seek point. So we also have to guarantee that we are not doing that more than once on the same buffer/seek point (OS FD). If we can assure ourselves of all that then we should be able to show that the result is the same irrespective of the event interleavings (because we've banned the "interfering" events that would influence the result). Of course if we do not maintain those side conditions then we just get back to the non-deterministic hard-to-reason-about IO programs. Duncan

Hmm, Don't you think forkIO deserves some of the same complaints as unsafeInterleaveIO? Things happen in a nondeterministic order! I think what irritates us about unsafeInterleaveIO is that it's IO that tinkers with the internals of the Haskell evaluation system. The OS can't do it: in a C program it might, because there's libc and debuggers and all kinds of things that understand compiled C to some extent. But the Haskell runtime system is pretty much obfuscated to anyone except ourselves. This obfuscation maintains its conceptual purity to a greater extent than is really guaranteed by the standards. This obfuscation is supported in our minds by the fact that functions (->) cannot be compared for equality or deconstructed or serialized in any way, only applied. forkIO causes IO to happen in a nondeterministic order. So does unsafeInterleaveIO. But for unsafeInterleaveIO, the nondeterminism depends in part on how pure functions are written: partly because there is a compiler that makes arbitrary choices, and also partly affected by the strictness properties of the functions. This feels disconcerting to us. And worse: I am not sure if forkIO has a formal guarantee that its IO will complete, but we tend to assume that it will, sooner or later; unsafeInterleaveIO might not happen at all, and frequently does not, due to the observations of how pure functions are written. It's disconcerting. It can affect how we choose to write our pure code, the same way that efficiency and memory concerns can. But if 'catch' can catch a different exception depending even, conceptually, on the phase of the moon, it is a similarly strange stretch to imagine unsafeInterleaveIO doing so. It plays with chronology (like forkIO does) and with the ways Haskell functions are written (like 'catch' does) at the same time. A result is that it makes a lot of people confused when they do something they didn't intend with it. Also, it's a powerful enough tool that when you want to replace its formal nondeterminism with something more precise, you may have quite a bit of work cut out for you, restructuring your code (like Darcs did, IIRC). -Isaac

On 10/10/2009 18:59, Iavor Diatchki wrote:
Hello,
well, I think that the fact that we seem to have a program context that can distinguish "f1" from "f2" is worth discussing because I would have thought that in a pure language they are interchangable. The question is, does the context in Oleg's example really distinguish between "f1" and "f2"? You seem to be saying that this is not the case: in both cases you end up with the same non-deterministic program that reads two numbers from the standard input and subtracts them but you can't assume anything about the order in which the numbers are extracted from the input---it is merely an artifact of the GHC implementation that with "f1" the subtraction always happens the one way, and with "f2" it happens the other way.
I can (sort of) buy this argument, after all, it is quite similar to what happens with asynchronous exceptions (f1 (error "1") (error "2") vs f2 (error "1") (error "2")). Still, the whole thing does not "smell right": there is some impurity going on here, and trying to offload the problem onto the IO monad only makes reasoning about IO computations even harder (and it is petty hard to start with). So, discussion and alternative solutions should be strongly encouraged, I think.
Duncan has found a definition of hGetContents that explains why it has surprising behaviour, and that's very nice because it lets us write the compilers that we want to write, and we get to tell the users to stop moaning because the strange behaviour they're experiencing is allowed according to the spec. :-) Of course, the problem is that users don't want the hGetContents that has non-deterministic semantics, they want a deterministic one. And for that, they want to fix the evaluation order (or something). The obvious drawback with fixing the evaluation order is that it ties the hands of the compiler developers, and makes a fundamental change to the language definition. Things will get a lot worse in the future as we experiment with more elaborate compiler optimisations and evaluation strategies. I predict that eventually we'll have to ditch hGetContents, at least in its current generality. Cheers, Simon
-Iavor
On Sat, Oct 10, 2009 at 7:38 AM, Duncan Coutts
wrote: On Sat, 2009-10-10 at 02:51 -0700, oleg@okmij.org wrote:
The reason it's hard is that to demonstrate a difference you have to get the lazy I/O to commute with some other I/O, and GHC will never do that.
The keyword here is GHC. I may well believe that GHC is able to divine programmer's true intent and so it always does the right thing. But writing in the language standard ``do what the version x.y.z of GHC does'' does not seem very appropriate, or helpful to other implementors.
With access to unsafeInterleaveIO it's fairly straightforward to show that it is non-deterministic. These programs that bypass the safety mechanisms on hGetContents just get us back to having access to the non-deterministic semantics of unsafeInterleaveIO.
Haskell's IO library is carefully designed to not run into this problem on its own. It's normally not possible to get two Handles with the same FD...
Is this behavior is specified somewhere, or is this just an artifact of a particular GHC implementation?
It is in the Haskell 98 report, in the design of the IO library. It does not not mention FDs of course. The IO/Handle functions it provides give no (portable) way to obtain two read handles on the same OS file descriptor. The hGetContents behaviour of semi-closing is to stop you from getting two lazy lists of the same read Handle.
There's nothing semantically wrong with you bypassing those restrictions (eg openFile "/dev/fd/0") it just means you end up with a non-deterministic IO program, which is something we typically try to avoid.
I am a bit perplexed by this whole discussion. It seems to come down to saying that unsafeInterleaveIO is non-deterministic and that things implemented on top are also non-deterministic. The standard IO library puts up some barriers to restrict the non-determinism, but if you walk around the barrier then you can still find it. It's not clear to me what is supposed to be surprising or alarming here.
Duncan
_______________________________________________ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime

On Tue, 2009-10-20 at 13:58 +0100, Simon Marlow wrote:
Duncan has found a definition of hGetContents that explains why it has surprising behaviour, and that's very nice because it lets us write the compilers that we want to write, and we get to tell the users to stop moaning because the strange behaviour they're experiencing is allowed according to the spec. :-)
:-)
Of course, the problem is that users don't want the hGetContents that has non-deterministic semantics, they want a deterministic one. And for that, they want to fix the evaluation order (or something). The obvious drawback with fixing the evaluation order is that it ties the hands of the compiler developers, and makes a fundamental change to the language definition.
I've not yet seen anyone put forward any practical programs that have confusing behaviour but were not written deliberately to be as wacky as possible and avoid all the safety mechanism. The standard use case for hGetContents is reading a read-only file, or stdin where it really does not matter when the read actions occur with respect to other IO actions. You could do it in parallel rather than on-demand and it'd still be ok. There's the beginner mistake where people don't notice that they're not actually demanding anything before closing the file, that's nothing new of course. Duncan

On 20/10/2009 15:24, Duncan Coutts wrote:
On Tue, 2009-10-20 at 13:58 +0100, Simon Marlow wrote:
Duncan has found a definition of hGetContents that explains why it has surprising behaviour, and that's very nice because it lets us write the compilers that we want to write, and we get to tell the users to stop moaning because the strange behaviour they're experiencing is allowed according to the spec. :-)
:-)
Of course, the problem is that users don't want the hGetContents that has non-deterministic semantics, they want a deterministic one. And for that, they want to fix the evaluation order (or something). The obvious drawback with fixing the evaluation order is that it ties the hands of the compiler developers, and makes a fundamental change to the language definition.
I've not yet seen anyone put forward any practical programs that have confusing behaviour but were not written deliberately to be as wacky as possible and avoid all the safety mechanism.
The standard use case for hGetContents is reading a read-only file, or stdin where it really does not matter when the read actions occur with respect to other IO actions. You could do it in parallel rather than on-demand and it'd still be ok.
There's the beginner mistake where people don't notice that they're not actually demanding anything before closing the file, that's nothing new of course.
If the parallel runtime reads files eagerly, that might hide a resource problem that would occur when the program is run on a sequential system, for example. Cheers, Simon

On Tue, 2009-10-20 at 15:45 +0100, Simon Marlow wrote:
I've not yet seen anyone put forward any practical programs that have confusing behaviour but were not written deliberately to be as wacky as possible and avoid all the safety mechanism.
The standard use case for hGetContents is reading a read-only file, or stdin where it really does not matter when the read actions occur with respect to other IO actions. You could do it in parallel rather than on-demand and it'd still be ok.
There's the beginner mistake where people don't notice that they're not actually demanding anything before closing the file, that's nothing new of course.
If the parallel runtime reads files eagerly, that might hide a resource problem that would occur when the program is run on a sequential system, for example.
That's true, but we have the same problem without doing any IO. There are many ways of generating large amounts of data. Duncan
participants (6)
-
Duncan Coutts
-
Heinrich Apfelmus
-
Iavor Diatchki
-
Isaac Dupree
-
oleg@okmij.org
-
Simon Marlow