
On Thu, May 24, 2007 at 10:17:49PM +0100, Jules Bean wrote:
I've been having something of a discussion on #haskell about this but I had to go off-line and, in any case, it's a complicated issue, and I may be able to be more clear in an email.
The key point under discussion was what kind of interface the HTTP library should expose: synchronous, asynchronous? Lazy, strict?
As someone just pointed out, "You don't like lazy IO, do you?". Well, that's a fair characterisation. I think unsafe lazy IO is a very very cute hack, and I'm in awe of some of the performance results which have been achieved, but I think the disadvantages are underestimated.
Of course, there is a potential ambiguity in the phrase 'lazy IO'. You might interpret 'lazy IO' quite reasonably to refer any programming style in which the IO is performed 'as needed' by the rest of the program. So, to be clear, I'm not raising a warning flag about that practice in general, which is a very important programming technique. I'm raising a bit of a warning flag over the particular practice of achieving this in a way which conceals IO inside thunks which have no IO in their types: i.e. using unsafeInterleaveIO or even unsafePerformIO.
Why is this a bad idea? Normally evaluating a haskell expression can have no side-effects. This is important because, in a lazy language, you never quite know[*] when something's going to be evaluated. Or if it will. Side-effects, on the other hand, are important things (like launching nuclear missiles) and it's rather nice to be precise about when they happen. One particular kind of side effect which is slightly less cataclysmic (only slightly) is the throwing of an exception. If pure code, which is normally guaranteed to "at worst" fail to terminate can suddenly throw an exception from somewhere deep in its midst, then it's extremely hard for your program to work out how far it has got, and what it has done, and what it hasn't done, and what it should do to recover. On the other hand, no failure may occur, but the data may never be processed, meaning that the IO is never 'finished' and valuable system resources are locked up forever. (Consider a naive program which reads only the first 1000 bytes of an XML document before getting an unrecoverable parse failure. The socket will never be closed, and system resources will be consumed permanently.)
Yes, obviously lazy IO needs to be done with care, but pure functions always consume resources, and lazy IO is not unique in this regard. It does change the nature of the resources consumed, but that's all. No function can "at worst" fail to terminate, they can always fail with error, or run out of stack space. It seems that your real problem here is that sockets aren't freed when programs exit. I suppose that's a potential problem, but it doesn't seem like a critical one. I assume firefox has already permanently consumed gobbs of system resources, and it hasn't bothered me yet... except for the memory, and that's fortunately not permanent. (Incidentally, couldn't atexit be used to clean up sockets in case of unclean exiting?) Obviously lazy IO can only be used with IO operations that are considered "safe" by the programmer (usually read operations), but for those operations, when the programmer declares himself to not care when the reading is actually done, lazy IO is a beautiful thing. In particular, it allows the writing of modular reusable functions. That's actually a Good Thing... and as long as write operations are the only ones that require cleanup, it's also perfectly safe.
Trivial programs may be perfectly content to simply bail out if an exception is thrown. That's very sensible behaviour for a small 'pluggable' application (most of the various unix command line utilities all bail out silently or nearly silently on SIGPIPE, for example). However this is not acceptable behaviour in a complex program, clearly. There may be resources which need to be released, there may be data which needs saving, there may be reconstruction to be attempted on whatever it was that 'broke'.
Error handling and recovery is hard. Always has been. One of the things that simplifies such issues is knowing "where" exceptions can occur. It greatly simplifies them. In haskell they can only occur in the IO monad, and they can only occur in rather specific ways: in most cases, thrown by particular IO primitives; they can also be thrown 'To' you by other threads, but as the programmer, that's your problem!.
This is irrelevant to the question of lazy IO or not lazy IO. As you say, all errors happen in the IO monad, and that's true with or without lazy IO, since ultimately IO is the only consumer of lazy data. Proper use of bracket catches all errors (modulo bugs in bracket, and signals being thrown... but certainly all calls to error), and you can do that at the top level, if you like. The downside in error checking when using lazy IO is just that the part of your program where errors pop up becomes less deterministic. However, since errors can happen at any time even without lazy IO, this is only a question of probability of errors showing up at certain times (think out of memory conditions, signals thrown, etc). Well-designed programs will be written robustly. (Yes, that's a truism, but it's one you seem to be forgetting.)
Ok. Five paragraphs of advocacy is plenty. If anyone is still reading now, then they must be either really interested in this problem, or really bored. Either way, it's good to have you with me! These issues are explained rather more elegantly by Oleg in [1]. ... Given these three pairs of options, what need is there for an unsafe lazy GET? What niche does it fill that is not equally well filled by one of these?
Program conciseness, perhaps. The kind of haskell oneliner whose performance makes us so (justly) proud. In isolation, though I don't find that a convincing argument; not with the disadvantages taken also into account. The strongest argument then is that you have a 'stream processing' function, that is written 'naively' on [Word8] or Lazy ByteString, and wants to run as data is available, yet without wasting space. I'm inclined to feel that, if you really want to be able to run over 650M files, and you want run as data is available, then you in practice want to be able to give feedback to the rest of your application on your progress so far; I.e, L.Bytestring -> a is actually too simple a type anyway.
Yes, this is the argument for lazy IO, and it's a valid one. Any adequately powerful interface can be used to implement a lazy IO function, and people will do so, whether or not it makes you happy. It'd be nice to have it in the library itself. Program conciseness is a real issue. Simple effective APIs make for useful libraries, and the simplest API is likely to be the most commonly used. If the simplest API is strict, then that means that there'll most often be *no* feedback until the download is complete. A lazy download means that feedback can be provided instantly, as the data is consumed. True, you need to include some feedback logic in your consumer, but that's where you'll almost certainly want it anyhow. And in many cases the feedback could come for free, in the form of output. -- David Roundy Department of Physics Oregon State University