Re: [Haskell-cafe] Network.HTTP+ByteStrings Interface--Or: How to shepherd handles and go with the flow at the same time?

28 May 2007

      On Thu, May 24, 2007 at 10:17:49PM +0100, Jules Bean wrote:
...
I've been having something of a discussion on #haskell about this but
I had to go off-line and, in any case, it's a complicated issue, and I
may be able to be more clear in an email.
The key point under discussion was what kind of interface the HTTP
library should expose: synchronous, asynchronous? Lazy, strict?
As someone just pointed out, "You don't like lazy IO, do you?". Well,
that's a fair characterisation. I think unsafe lazy IO is a very very
cute hack, and I'm in awe of some of the performance results which
have been achieved, but I think the disadvantages are underestimated.
...
Of course, there is a potential ambiguity in the phrase 'lazy IO'. You
might interpret 'lazy IO' quite reasonably to refer any programming
style in which the IO is performed 'as needed' by the rest of the
program. So, to be clear, I'm not raising a warning flag about that
practice in general, which is a very important programming
technique. I'm raising a bit of a warning flag over the particular
practice of achieving this in a way which conceals IO inside thunks
which have no IO in their types: i.e. using unsafeInterleaveIO or even
unsafePerformIO.
Why is this a bad idea? Normally evaluating a haskell expression can
have no side-effects. This is important because, in a lazy language,
you never quite know[*] when something's going to be evaluated. Or if
it will. Side-effects, on the other hand, are important things (like
launching nuclear missiles) and it's rather nice to be precise about
when they happen. One particular kind of side effect which is slightly
less cataclysmic (only slightly) is the throwing of an exception. If
pure code, which is normally guaranteed to "at worst" fail to
terminate can suddenly throw an exception from somewhere deep in its
midst, then it's extremely hard for your program to work out how far
it has got, and what it has done, and what it hasn't done, and what it
should do to recover. On the other hand, no failure may occur, but the
data may never be processed, meaning that the IO is never 'finished'
and valuable system resources are locked up forever. (Consider a naive
program which reads only the first 1000 bytes of an XML document
before getting an unrecoverable parse failure. The socket will never
be closed, and system resources will be consumed permanently.)
Yes, obviously lazy IO needs to be done with care, but pure functions
always consume resources, and lazy IO is not unique in this regard.  It
does change the nature of the resources consumed, but that's all.  No
function can "at worst" fail to terminate, they can always fail with error,
or run out of stack space.

It seems that your real problem here is that sockets aren't freed when
programs exit.  I suppose that's a potential problem, but it doesn't seem
like a critical one.  I assume firefox has already permanently consumed
gobbs of system resources, and it hasn't bothered me yet... except for the
memory, and that's fortunately not permanent.  (Incidentally, couldn't
atexit be used to clean up sockets in case of unclean exiting?)

Obviously lazy IO can only be used with IO operations that are considered
"safe" by the programmer (usually read operations), but for those
operations, when the programmer declares himself to not care when the
reading is actually done, lazy IO is a beautiful thing.  In particular, it
allows the writing of modular reusable functions.  That's actually a Good
Thing... and as long as write operations are the only ones that require
cleanup, it's also perfectly safe.
...
Trivial programs may be perfectly content to simply bail out if an
exception is thrown. That's very sensible behaviour for a small
'pluggable' application (most of the various unix command line
utilities all bail out silently or nearly silently on SIGPIPE, for
example). However this is not acceptable behaviour in a complex
program, clearly. There may be resources which need to be released,
there may be data which needs saving, there may be reconstruction to
be attempted on whatever it was that 'broke'.
Error handling and recovery is hard. Always has been. One of the
things that simplifies such issues is knowing "where" exceptions can
occur. It greatly simplifies them. In haskell they can only occur in
the IO monad, and they can only occur in rather specific ways: in most
cases, thrown by particular IO primitives; they can also be thrown
'To' you by other threads, but as the programmer, that's your
problem!.
This is irrelevant to the question of lazy IO or not lazy IO.  As you say,
all errors happen in the IO monad, and that's true with or without lazy IO,
since ultimately IO is the only consumer of lazy data.  Proper use of
bracket catches all errors (modulo bugs in bracket, and signals being
thrown... but certainly all calls to error), and you can do that at the top
level, if you like.

The downside in error checking when using lazy IO is just that the part of
your program where errors pop up becomes less deterministic.  However,
since errors can happen at any time even without lazy IO, this is only a
question of probability of errors showing up at certain times (think out of
memory conditions, signals thrown, etc).  Well-designed programs will be
written robustly.  (Yes, that's a truism, but it's one you seem to be
forgetting.)
...
Ok. Five paragraphs of advocacy is plenty. If anyone is still reading
now, then they must be either really interested in this problem, or
really bored. Either way, it's good to have you with me! These issues
are explained rather more elegantly by Oleg in [1].
...
Given these three pairs of options, what need is there for an unsafe
lazy GET?  What niche does it fill that is not equally well filled by
one of these?
Program conciseness, perhaps. The kind of haskell oneliner whose
performance makes us so (justly) proud. In isolation, though I don't
find that a convincing argument; not with the disadvantages taken also
into account.
The strongest argument then is that you have a 'stream processing'
function, that is written 'naively' on [Word8] or Lazy ByteString, and
wants to run as data is available, yet without wasting space. I'm
inclined to feel that, if you really want to be able to run over 650M
files, and you want run as data is available, then you in practice
want to be able to give feedback to the rest of your application on
your progress so far; I.e, L.Bytestring -> a is actually too simple a
type anyway.
Yes, this is the argument for lazy IO, and it's a valid one.  Any
adequately powerful interface can be used to implement a lazy IO function,
and people will do so, whether or not it makes you happy.  It'd be nice to
have it in the library itself.

Program conciseness is a real issue.  Simple effective APIs make for useful
libraries, and the simplest API is likely to be the most commonly used.  If
the simplest API is strict, then that means that there'll most often be
*no* feedback until the download is complete.  A lazy download means that
feedback can be provided instantly, as the data is consumed.  True, you
need to include some feedback logic in your consumer, but that's where
you'll almost certainly want it anyhow.  And in many cases the feedback
could come for free, in the form of output.
-- 
David Roundy
Department of Physics
Oregon State University