Re: [Haskell-cafe] Network.HTTP+ByteStrings Interface--Or: How to shepherd handles and go with the flow at the same time?

24 May 2007

      I've been having something of a discussion on #haskell about this but
I had to go off-line and, in any case, it's a complicated issue, and I
may be able to be more clear in an email.

The key point under discussion was what kind of interface the HTTP
library should expose: synchronous, asynchronous? Lazy, strict?

As someone just pointed out, "You don't like lazy IO, do you?". Well,
that's a fair characterisation. I think unsafe lazy IO is a very very
cute hack, and I'm in awe of some of the performance results which
have been achieved, but I think the disadvantages are underestimated.

Of course, there is a potential ambiguity in the phrase 'lazy IO'. You
might interpret 'lazy IO' quite reasonably to refer any programming
style in which the IO is performed 'as needed' by the rest of the
program. So, to be clear, I'm not raising a warning flag about that
practice in general, which is a very important programming
technique. I'm raising a bit of a warning flag over the particular
practice of achieving this in a way which conceals IO inside thunks
which have no IO in their types: i.e. using unsafeInterleaveIO or even
unsafePerformIO.

Why is this a bad idea? Normally evaluating a haskell expression can
have no side-effects. This is important because, in a lazy language,
you never quite know[*] when something's going to be evaluated. Or if
it will. Side-effects, on the other hand, are important things (like
launching nuclear missiles) and it's rather nice to be precise about
when they happen. One particular kind of side effect which is slightly
less cataclysmic (only slightly) is the throwing of an exception. If
pure code, which is normally guaranteed to "at worst" fail to
terminate can suddenly throw an exception from somewhere deep in its
midst, then it's extremely hard for your program to work out how far
it has got, and what it has done, and what it hasn't done, and what it
should do to recover. On the other hand, no failure may occur, but the
data may never be processed, meaning that the IO is never 'finished'
and valuable system resources are locked up forever. (Consider a naive
program which reads only the first 1000 bytes of an XML document
before getting an unrecoverable parse failure. The socket will never
be closed, and system resources will be consumed permanently.)

Trivial programs may be perfectly content to simply bail out if an
exception is thrown. That's very sensible behaviour for a small
'pluggable' application (most of the various unix command line
utilities all bail out silently or nearly silently on SIGPIPE, for
example). However this is not acceptable behaviour in a complex
program, clearly. There may be resources which need to be released,
there may be data which needs saving, there may be reconstruction to
be attempted on whatever it was that 'broke'.

Error handling and recovery is hard. Always has been. One of the
things that simplifies such issues is knowing "where" exceptions can
occur. It greatly simplifies them. In haskell they can only occur in
the IO monad, and they can only occur in rather specific ways: in most
cases, thrown by particular IO primitives; they can also be thrown
'To' you by other threads, but as the programmer, that's your
problem!.

Ok. Five paragraphs of advocacy is plenty. If anyone is still reading
now, then they must be either really interested in this problem, or
really bored. Either way, it's good to have you with me! These issues
are explained rather more elegantly by Oleg in [1].

So, where does that leave the HTTP library? Well here are the kinds of
interface I can imagine. I'm deliberately ignoring all the stuff about
request headers, request content, and imagining that this is all about
URL -> ByteString. Here are the options that occur to me:

A. Strict, synchronous GET
    sSynGET :: URL -> IO ByteString

    Quite simply blocks the thread until the whole data has
    arrived. Throws some kind of exception on failure, presumably. This
    is a simple primitive, appropriate for relatively small files
    (files which fit comfortably in your memory) and simple
    programs. It's also great for programs which want to take their own
    control over the degree of asynchrony; they can just fork as many
    threads as they choose to GET with.

B. Strict, asynchronous GET
    sAsynGET :: URL -> IO (MVar ByteString)

    Download the entire data, but do it in a separate thread. Give me
    an MVar so I can synchronise on the data's arrival in whichever way
    suits my program best. Suitable for small files which fit
    conveniently in memory. Very easy to implement in terms of forkIO
    and sSynGET so really it's a convenience function.

C. Strict, synchronous, GET-to-file
    sSynFileGET :: URL -> FilePath -> IO ()
D. Strict, asynchronous, GET-to-file
    sAsynFileGET :: URL -> FilePath -> IO (MVar ())

    Download the entire data to a local file. This means that it
    doesn't matter if the data is far bigger than local memory, it can
    still be done efficiently. [Note that this doesn't mean it must use
    lazy getContents magic under the hood. It could easily use strict
    hGet in reasonable sized chunks and write them out straightaway.]
    The only difference between the two variants is that one keeps
    control until completion, the other gives you an MVar which you can
    block on if/when you choose.  This method is appropriate for
    clients which need extremely large data, and don't mind waiting for
    it to finish before they start processing. It is also appropriate
    for clients which want random access to large data requests (using
    underlying hSeek-based random file IO, once the file has downloaded).

E,F. Progressive GET
    pSynGET :: URL -> ((Bool,ByteString) -> IO ()) -> IO ()
    pAsynGET :: URL -> ((Bool,ByteString) -> IO ()) -> IO (MVar ())

    (This is a particular simple case of Oleg's iteratees, I
    think) Download the data at whatever speed is convenient. As data
    arrives, feed it to the 'callback' provided. The ByteString is the
    new chunk of data, the 'Bool' is just supposed to indicate whether
    or not this is the final chunk. You can imagine slight
    variations. Note that the library promises not to retain any
    references to the ByteString chunks, so if the callback processes
    them and then discards them they are eligible for garbage
    collection. If the callback wishes to accumulate them, it can quite
    easily 'append' the strict chunks into a lazy bytestring, which is
    an efficient operation. This is suitable for applications which
    wish to do something like progressive display of a pJPEG, or
    compute a summary function, or perhaps even display an animation,
    although that last would normally need a little more structure to
    try to guarantee the data rate.

    Incidentally there are more complex options than (Bool,Bytestring)
    -> IO ().  A simple and obvious change is to add a return
    value. Another is a 'state monad by hand', as in (Bool,Bytestring)
    -> s -> s, and change the final return value of the type to IO s,
    which allows the callback to accumulate summary information and
    still be written as pure code. Other options allow the 'callback'
    to request early termination, by layering in an 'Either' type in
    there. Another more sophisticated option, I think, is the higher
    rank

    MonadTrans t => URL ->
    	      	   ((forall m. Monad m) => (Bool,ByteString) -> t m)
		   -> t IO ()

    ...which, unless I've made a mistake, allows you to write in 'any
    monad which can be expressed as a transformer', by transforming it
    over IO, but still contains the implicit promise that the
    'callback' does no IO. For example t = StateT reduces to the
    earlier s -> s example, in effect, with a slightly different data
    layout.

    Another couple of refinements to the above are that in practice you
    normally want to 'guarantee' your callback only runs on a chunk
    size of at least X (in some cases 'exactly X'), or you want to
    guarantee it's called at least every Y seconds. Neither of these
    are hard to add.

Given these three pairs of options, what need is there for an unsafe
lazy GET?  What niche does it fill that is not equally well filled by
one of these?

Program conciseness, perhaps. The kind of haskell oneliner whose
performance makes us so (justly) proud. In isolation, though I don't
find that a convincing argument; not with the disadvantages taken also
into account.

The strongest argument then is that you have a 'stream processing'
function, that is written 'naively' on [Word8] or Lazy ByteString, and
wants to run as data is available, yet without wasting space. I'm
inclined to feel that, if you really want to be able to run over 650M
files, and you want run as data is available, then you in practice
want to be able to give feedback to the rest of your application on
your progress so far; I.e, L.Bytestring -> a is actually too simple a
type anyway.

I'm interested to know what opinions other people have on this,
whether I've made any serious logic mistakes or just overlooked
another approach which has advantages. Having spent quite a while this
evening thinking this over, I just don't see the convincing case for
the unsafe lazy approach, and I see plenty of problems with it...

Cheers,

Jules

* Well, OK. If you're smart, then you know. Because you're smart, and
   you thought about it carefully. But most of the time, you don't need
   to know. And if you produce data (a pure structure) which is then
   consumed by a library - or vice versa - then you don't know unless
   you read their code.

[1] http://www.haskell.org/pipermail/haskell-cafe/2007-March/023073.html

Re: [Haskell-cafe] Network.HTTP+ByteStrings Interface--Or: How to shepherd handles and go with the flow at the same time?

Jules Bean