
I've been having something of a discussion on #haskell about this but I had to go off-line and, in any case, it's a complicated issue, and I may be able to be more clear in an email. The key point under discussion was what kind of interface the HTTP library should expose: synchronous, asynchronous? Lazy, strict? As someone just pointed out, "You don't like lazy IO, do you?". Well, that's a fair characterisation. I think unsafe lazy IO is a very very cute hack, and I'm in awe of some of the performance results which have been achieved, but I think the disadvantages are underestimated. Of course, there is a potential ambiguity in the phrase 'lazy IO'. You might interpret 'lazy IO' quite reasonably to refer any programming style in which the IO is performed 'as needed' by the rest of the program. So, to be clear, I'm not raising a warning flag about that practice in general, which is a very important programming technique. I'm raising a bit of a warning flag over the particular practice of achieving this in a way which conceals IO inside thunks which have no IO in their types: i.e. using unsafeInterleaveIO or even unsafePerformIO. Why is this a bad idea? Normally evaluating a haskell expression can have no side-effects. This is important because, in a lazy language, you never quite know[*] when something's going to be evaluated. Or if it will. Side-effects, on the other hand, are important things (like launching nuclear missiles) and it's rather nice to be precise about when they happen. One particular kind of side effect which is slightly less cataclysmic (only slightly) is the throwing of an exception. If pure code, which is normally guaranteed to "at worst" fail to terminate can suddenly throw an exception from somewhere deep in its midst, then it's extremely hard for your program to work out how far it has got, and what it has done, and what it hasn't done, and what it should do to recover. On the other hand, no failure may occur, but the data may never be processed, meaning that the IO is never 'finished' and valuable system resources are locked up forever. (Consider a naive program which reads only the first 1000 bytes of an XML document before getting an unrecoverable parse failure. The socket will never be closed, and system resources will be consumed permanently.) Trivial programs may be perfectly content to simply bail out if an exception is thrown. That's very sensible behaviour for a small 'pluggable' application (most of the various unix command line utilities all bail out silently or nearly silently on SIGPIPE, for example). However this is not acceptable behaviour in a complex program, clearly. There may be resources which need to be released, there may be data which needs saving, there may be reconstruction to be attempted on whatever it was that 'broke'. Error handling and recovery is hard. Always has been. One of the things that simplifies such issues is knowing "where" exceptions can occur. It greatly simplifies them. In haskell they can only occur in the IO monad, and they can only occur in rather specific ways: in most cases, thrown by particular IO primitives; they can also be thrown 'To' you by other threads, but as the programmer, that's your problem!. Ok. Five paragraphs of advocacy is plenty. If anyone is still reading now, then they must be either really interested in this problem, or really bored. Either way, it's good to have you with me! These issues are explained rather more elegantly by Oleg in [1]. So, where does that leave the HTTP library? Well here are the kinds of interface I can imagine. I'm deliberately ignoring all the stuff about request headers, request content, and imagining that this is all about URL -> ByteString. Here are the options that occur to me: A. Strict, synchronous GET sSynGET :: URL -> IO ByteString Quite simply blocks the thread until the whole data has arrived. Throws some kind of exception on failure, presumably. This is a simple primitive, appropriate for relatively small files (files which fit comfortably in your memory) and simple programs. It's also great for programs which want to take their own control over the degree of asynchrony; they can just fork as many threads as they choose to GET with. B. Strict, asynchronous GET sAsynGET :: URL -> IO (MVar ByteString) Download the entire data, but do it in a separate thread. Give me an MVar so I can synchronise on the data's arrival in whichever way suits my program best. Suitable for small files which fit conveniently in memory. Very easy to implement in terms of forkIO and sSynGET so really it's a convenience function. C. Strict, synchronous, GET-to-file sSynFileGET :: URL -> FilePath -> IO () D. Strict, asynchronous, GET-to-file sAsynFileGET :: URL -> FilePath -> IO (MVar ()) Download the entire data to a local file. This means that it doesn't matter if the data is far bigger than local memory, it can still be done efficiently. [Note that this doesn't mean it must use lazy getContents magic under the hood. It could easily use strict hGet in reasonable sized chunks and write them out straightaway.] The only difference between the two variants is that one keeps control until completion, the other gives you an MVar which you can block on if/when you choose. This method is appropriate for clients which need extremely large data, and don't mind waiting for it to finish before they start processing. It is also appropriate for clients which want random access to large data requests (using underlying hSeek-based random file IO, once the file has downloaded). E,F. Progressive GET pSynGET :: URL -> ((Bool,ByteString) -> IO ()) -> IO () pAsynGET :: URL -> ((Bool,ByteString) -> IO ()) -> IO (MVar ()) (This is a particular simple case of Oleg's iteratees, I think) Download the data at whatever speed is convenient. As data arrives, feed it to the 'callback' provided. The ByteString is the new chunk of data, the 'Bool' is just supposed to indicate whether or not this is the final chunk. You can imagine slight variations. Note that the library promises not to retain any references to the ByteString chunks, so if the callback processes them and then discards them they are eligible for garbage collection. If the callback wishes to accumulate them, it can quite easily 'append' the strict chunks into a lazy bytestring, which is an efficient operation. This is suitable for applications which wish to do something like progressive display of a pJPEG, or compute a summary function, or perhaps even display an animation, although that last would normally need a little more structure to try to guarantee the data rate. Incidentally there are more complex options than (Bool,Bytestring) -> IO (). A simple and obvious change is to add a return value. Another is a 'state monad by hand', as in (Bool,Bytestring) -> s -> s, and change the final return value of the type to IO s, which allows the callback to accumulate summary information and still be written as pure code. Other options allow the 'callback' to request early termination, by layering in an 'Either' type in there. Another more sophisticated option, I think, is the higher rank MonadTrans t => URL -> ((forall m. Monad m) => (Bool,ByteString) -> t m) -> t IO () ...which, unless I've made a mistake, allows you to write in 'any monad which can be expressed as a transformer', by transforming it over IO, but still contains the implicit promise that the 'callback' does no IO. For example t = StateT reduces to the earlier s -> s example, in effect, with a slightly different data layout. Another couple of refinements to the above are that in practice you normally want to 'guarantee' your callback only runs on a chunk size of at least X (in some cases 'exactly X'), or you want to guarantee it's called at least every Y seconds. Neither of these are hard to add. Given these three pairs of options, what need is there for an unsafe lazy GET? What niche does it fill that is not equally well filled by one of these? Program conciseness, perhaps. The kind of haskell oneliner whose performance makes us so (justly) proud. In isolation, though I don't find that a convincing argument; not with the disadvantages taken also into account. The strongest argument then is that you have a 'stream processing' function, that is written 'naively' on [Word8] or Lazy ByteString, and wants to run as data is available, yet without wasting space. I'm inclined to feel that, if you really want to be able to run over 650M files, and you want run as data is available, then you in practice want to be able to give feedback to the rest of your application on your progress so far; I.e, L.Bytestring -> a is actually too simple a type anyway. I'm interested to know what opinions other people have on this, whether I've made any serious logic mistakes or just overlooked another approach which has advantages. Having spent quite a while this evening thinking this over, I just don't see the convincing case for the unsafe lazy approach, and I see plenty of problems with it... Cheers, Jules * Well, OK. If you're smart, then you know. Because you're smart, and you thought about it carefully. But most of the time, you don't need to know. And if you produce data (a pure structure) which is then consumed by a library - or vice versa - then you don't know unless you read their code. [1] http://www.haskell.org/pipermail/haskell-cafe/2007-March/023073.html