Network.HTTP+ByteStrings Interface--Or: How to shepherd handles and go with the flow at the same time?

Thomas Schilling

23 May 2007 23 May '07

8:56 p.m.

Hello Cafe! I'd greatly appreciate any ideas/comments on the design of the interface to the Network.HTTP library with a LazyByteString (LBS) backend. As has been discussed previously on this list [1] lazy evaluation can complicate resource management, which is especially critical if resources are seriously limited--in this case network sockets and/or file handles. In [2] Oleg shows how we can use left folds as a general mechanism to traverse collections and shows that this is superior to cursors, generators or streams. I agree with his arguments, however, I still don't see that this can be a good interface for the http library. For a start, what should the collection type be? Word8, Char8, (strict) ByteString (BS)? Unless we get some kind of fusion for the first two we're pretty much stuck with the latter. The interface could look like this (untested): import Data.ByteString (ByteString) -- i.e., strict BSs import qualified Data.ByteString as BS handleRequest :: MonadIO m => Request -> a -> (ByteString -> a -> m (Either a a)) -> m a Ignoring error handling for now, this might be implemented as: handleRequest req seed processor = do a <- iter seed bodySize closeConnection req return a where iter seed 0 = return seed iter seed n = do let c = min chunkSize n bs <- liftIO $ readBlock req c -- calls BS.hGet handle c next <- processor bs seed case next of Left rslt -> return rslt Right seed' -> iter seed' (n-c) -- insert catchError stuff here (requires signature change, of course) This should work fine and if the 'processor' function does not hold on to the chunk would run in constant space. Unfortunately, this has a big disadvantage. Most operations on the returned data will probably be stream-like functions, such as parsing the data into some kind of tree. [2] shows a method how to convert enumerators to streams, but the used stream type data MyStream m a = MyNil (Maybe a) | MyCons a (m (MyStream m a)) is incompatible with [a] which is used by lazy ByteStrings due to the embedding of m. I also don't know if fusion can work on monads. My current suggestion would therefore be a less "save"[*] solution (again, untested and modulo error handling). -- | Execute the request and call @f@ with the returned response body. -- The socket will be closed immediately after @f@ terminates. You must therefore -- make sure that any data you might want to returned has to be forced, e.g. using -- (length . take) lbs withRequest :: MonadIO => Request -> a -> (LazyByteString -> a -> m a) -> m a The implementation would lazily read the contents (implemented as described in [3]) and forcing it would be left to the function parameter. E.g. getHTML :: String -> IO HTMLParseTree getHTML addr = do r <- mkRequest addr tree <- withRequest r emptyTree parseHTML seq tree $ return tree -- ! (I'm afraid this is necessary) tricky :: LazyByteString -> String -> IO LazyByteString -- result will not really be lazy trickyt str addr = do r <- mkRequest addr withRequest r L.empty dropNTake -- we might have to force the result here where dropNTake s _ = L.take 100000 . L.drop 100000 If these 'seq's are really necessary, then this would be a pretty hard to use interface. So, any ideas / suggestions ? / Thomas [*] .. "save" in the sense that it does not enforce certain behavior by means of the type signature or API design. [1] .. http://thread.gmane.org/gmane.comp.lang.haskell.cafe/20528/focus=20635 [2] .. http://okmij.org/ftp/papers/LL3-collections-enumerators.txt [3] .. http://nominolo.blogspot.com/2007/05/networkhttp-bytestrings.html -- "Remember! Everytime you say 'Web 2.0' God kills a startup!" - userfriendly.org, Jul 31, 2006

Show replies by date

haskell＠list.mightyreason.com

23 May 23 May

10:21 p.m.

I am uncertain about all the issues here, but.... Why do you need to convert Socket to Handle? I have no clue if this code I pasted below works but it does compile:

...

import Network.Socket import Data.ByteString.Base as Base

-- 'recvBSFrom' gets a strict ByteString from a socket. -- createAndTrim' was not quite documented, so I looked at -- http://darcs.haskell.org/packages/base/Data/ByteString/Base.hs recvBSFrom :: Socket -> Int -> IO (ByteString,SockAddr) recvBSFrom sock nbytes = Base.createAndTrim' nbytes $ \ptr -> do (len,sockaddr) <- recvBufFrom sock ptr nbytes -- maybe check if len is (-1) ? return (0,len,sockaddr)

I was not sure how recvBufFrom handles errors, so if recvBufFrom might return a len of (-1) then the above will need to detect this. (This looks like it might happen for non-blocking sockets). The above was modeled after recvfrom, reproduced below from source at http://darcs.haskell.org/packages/network/Network/Socket.hsc

...

recvFrom :: Socket -> Int -> IO (String, Int, SockAddr) recvFrom sock nbytes = allocaBytes nbytes $ \ptr -> do (len, sockaddr) <- recvBufFrom sock ptr nbytes str <- peekCStringLen (ptr, len) return (str, len, sockaddr)

If you can figure out how to defer the calls to recvBSFrom properly then you can make Lazy ByteStrings as well. One would need to understand how to set the socket to non-blocking (?) and mimic the hGetContentsN code from http://darcs.haskell.org/packages/base/Data/ByteString/Lazy.hs If this hypothetical 'socketGetContentsAsLazyByteString' also creating a newEmptyMVar then it could return a (tryPutMVar m ()) action to allow the consumer of the lazy string to signal to the deferred reading process (which periodically calls (isEmptyMVar m)) that it should close the socket or at least stop trying to read from it. That way there is no _need_ to 'seq' your way to the end of the lazy bytestring to cause it to close. -- Chris

Thomas Schilling

11:18 p.m.

On 5/24/07, haskell@list.mightyreason.com wrote:

...

Why do you need to convert Socket to Handle?

Initially, we chose to use socketToHandle for simplicity reasons--why duplicate functionality if we can reuse it? After Simon Marlow's comment that my reason to assume it inappropriate does no longer hold, we decided to just keep it that way. I'm not completely sure, but I think that handle created from the socket will not use up any of the file handles, or if it does then it's the socket behind it, but in any case it doesn't change anything w.r.t. to the total available number of resource handles. If it turns out that we should in fact add functions to work directly on sockets then it would be mostly straightforward translation of hGet, hGetNonBlocking and hGetContents. But that's not an issue, ATM.

...

If this hypothetical 'socketGetContentsAsLazyByteString' also creating a newEmptyMVar then it could return a (tryPutMVar m ()) action to allow the consumer of the lazy string to signal to the deferred reading process (which periodically calls (isEmptyMVar m)) that it should close the socket or at least stop trying to read from it. That way there is no _need_ to 'seq' your way to the end of the lazy bytestring to cause it to close.

Now here comes the actual problem, sorry if that wasn't clear from my original post. We actually don't want to rely on the consumer to signal that it is done reading--this would be manual memory management and this is unreliable. Also, we can't be sure the string will ever be consumed completely. Another unsatisfactory solution is to attach a finalizer which is executed when the garbage collector detects that the object is not live anymore, which is not guaranteed at all. These are precisely the reasons why Oleg proposed the fold-based method. Unfortunately, this does not seem to work nicely in its current form. Since, I want to keep some form of laziness though, my trade-off is to create a "lazy zone" in which we read data on demand, but if someone needs to hold on to it longer he has to gather the data explicitly, which, as showed, might be tricky to use, thus undesirable (especially for a library). / Thomas -- "Remember! Everytime you say 'Web 2.0' God kills a startup!" - userfriendly.org, Jul 31, 2006

Jules Bean

24 May 24 May

9:17 p.m.

I've been having something of a discussion on #haskell about this but I had to go off-line and, in any case, it's a complicated issue, and I may be able to be more clear in an email. The key point under discussion was what kind of interface the HTTP library should expose: synchronous, asynchronous? Lazy, strict? As someone just pointed out, "You don't like lazy IO, do you?". Well, that's a fair characterisation. I think unsafe lazy IO is a very very cute hack, and I'm in awe of some of the performance results which have been achieved, but I think the disadvantages are underestimated. Of course, there is a potential ambiguity in the phrase 'lazy IO'. You might interpret 'lazy IO' quite reasonably to refer any programming style in which the IO is performed 'as needed' by the rest of the program. So, to be clear, I'm not raising a warning flag about that practice in general, which is a very important programming technique. I'm raising a bit of a warning flag over the particular practice of achieving this in a way which conceals IO inside thunks which have no IO in their types: i.e. using unsafeInterleaveIO or even unsafePerformIO. Why is this a bad idea? Normally evaluating a haskell expression can have no side-effects. This is important because, in a lazy language, you never quite know[*] when something's going to be evaluated. Or if it will. Side-effects, on the other hand, are important things (like launching nuclear missiles) and it's rather nice to be precise about when they happen. One particular kind of side effect which is slightly less cataclysmic (only slightly) is the throwing of an exception. If pure code, which is normally guaranteed to "at worst" fail to terminate can suddenly throw an exception from somewhere deep in its midst, then it's extremely hard for your program to work out how far it has got, and what it has done, and what it hasn't done, and what it should do to recover. On the other hand, no failure may occur, but the data may never be processed, meaning that the IO is never 'finished' and valuable system resources are locked up forever. (Consider a naive program which reads only the first 1000 bytes of an XML document before getting an unrecoverable parse failure. The socket will never be closed, and system resources will be consumed permanently.) Trivial programs may be perfectly content to simply bail out if an exception is thrown. That's very sensible behaviour for a small 'pluggable' application (most of the various unix command line utilities all bail out silently or nearly silently on SIGPIPE, for example). However this is not acceptable behaviour in a complex program, clearly. There may be resources which need to be released, there may be data which needs saving, there may be reconstruction to be attempted on whatever it was that 'broke'. Error handling and recovery is hard. Always has been. One of the things that simplifies such issues is knowing "where" exceptions can occur. It greatly simplifies them. In haskell they can only occur in the IO monad, and they can only occur in rather specific ways: in most cases, thrown by particular IO primitives; they can also be thrown 'To' you by other threads, but as the programmer, that's your problem!. Ok. Five paragraphs of advocacy is plenty. If anyone is still reading now, then they must be either really interested in this problem, or really bored. Either way, it's good to have you with me! These issues are explained rather more elegantly by Oleg in [1]. So, where does that leave the HTTP library? Well here are the kinds of interface I can imagine. I'm deliberately ignoring all the stuff about request headers, request content, and imagining that this is all about URL -> ByteString. Here are the options that occur to me: A. Strict, synchronous GET sSynGET :: URL -> IO ByteString Quite simply blocks the thread until the whole data has arrived. Throws some kind of exception on failure, presumably. This is a simple primitive, appropriate for relatively small files (files which fit comfortably in your memory) and simple programs. It's also great for programs which want to take their own control over the degree of asynchrony; they can just fork as many threads as they choose to GET with. B. Strict, asynchronous GET sAsynGET :: URL -> IO (MVar ByteString) Download the entire data, but do it in a separate thread. Give me an MVar so I can synchronise on the data's arrival in whichever way suits my program best. Suitable for small files which fit conveniently in memory. Very easy to implement in terms of forkIO and sSynGET so really it's a convenience function. C. Strict, synchronous, GET-to-file sSynFileGET :: URL -> FilePath -> IO () D. Strict, asynchronous, GET-to-file sAsynFileGET :: URL -> FilePath -> IO (MVar ()) Download the entire data to a local file. This means that it doesn't matter if the data is far bigger than local memory, it can still be done efficiently. [Note that this doesn't mean it must use lazy getContents magic under the hood. It could easily use strict hGet in reasonable sized chunks and write them out straightaway.] The only difference between the two variants is that one keeps control until completion, the other gives you an MVar which you can block on if/when you choose. This method is appropriate for clients which need extremely large data, and don't mind waiting for it to finish before they start processing. It is also appropriate for clients which want random access to large data requests (using underlying hSeek-based random file IO, once the file has downloaded). E,F. Progressive GET pSynGET :: URL -> ((Bool,ByteString) -> IO ()) -> IO () pAsynGET :: URL -> ((Bool,ByteString) -> IO ()) -> IO (MVar ()) (This is a particular simple case of Oleg's iteratees, I think) Download the data at whatever speed is convenient. As data arrives, feed it to the 'callback' provided. The ByteString is the new chunk of data, the 'Bool' is just supposed to indicate whether or not this is the final chunk. You can imagine slight variations. Note that the library promises not to retain any references to the ByteString chunks, so if the callback processes them and then discards them they are eligible for garbage collection. If the callback wishes to accumulate them, it can quite easily 'append' the strict chunks into a lazy bytestring, which is an efficient operation. This is suitable for applications which wish to do something like progressive display of a pJPEG, or compute a summary function, or perhaps even display an animation, although that last would normally need a little more structure to try to guarantee the data rate. Incidentally there are more complex options than (Bool,Bytestring) -> IO (). A simple and obvious change is to add a return value. Another is a 'state monad by hand', as in (Bool,Bytestring) -> s -> s, and change the final return value of the type to IO s, which allows the callback to accumulate summary information and still be written as pure code. Other options allow the 'callback' to request early termination, by layering in an 'Either' type in there. Another more sophisticated option, I think, is the higher rank MonadTrans t => URL -> ((forall m. Monad m) => (Bool,ByteString) -> t m) -> t IO () ...which, unless I've made a mistake, allows you to write in 'any monad which can be expressed as a transformer', by transforming it over IO, but still contains the implicit promise that the 'callback' does no IO. For example t = StateT reduces to the earlier s -> s example, in effect, with a slightly different data layout. Another couple of refinements to the above are that in practice you normally want to 'guarantee' your callback only runs on a chunk size of at least X (in some cases 'exactly X'), or you want to guarantee it's called at least every Y seconds. Neither of these are hard to add. Given these three pairs of options, what need is there for an unsafe lazy GET? What niche does it fill that is not equally well filled by one of these? Program conciseness, perhaps. The kind of haskell oneliner whose performance makes us so (justly) proud. In isolation, though I don't find that a convincing argument; not with the disadvantages taken also into account. The strongest argument then is that you have a 'stream processing' function, that is written 'naively' on [Word8] or Lazy ByteString, and wants to run as data is available, yet without wasting space. I'm inclined to feel that, if you really want to be able to run over 650M files, and you want run as data is available, then you in practice want to be able to give feedback to the rest of your application on your progress so far; I.e, L.Bytestring -> a is actually too simple a type anyway. I'm interested to know what opinions other people have on this, whether I've made any serious logic mistakes or just overlooked another approach which has advantages. Having spent quite a while this evening thinking this over, I just don't see the convincing case for the unsafe lazy approach, and I see plenty of problems with it... Cheers, Jules * Well, OK. If you're smart, then you know. Because you're smart, and you thought about it carefully. But most of the time, you don't need to know. And if you produce data (a pure structure) which is then consumed by a library - or vice versa - then you don't know unless you read their code. [1] http://www.haskell.org/pipermail/haskell-cafe/2007-March/023073.html

Pete Kazmier

26 May 26 May

3:51 a.m.

As a newbie to Haskell, I found your thorough analysis very interesting. Thanks for the great read! I have a few questions regarding some of your comments, see below: Jules Bean writes:

...

E,F. Progressive GET pSynGET :: URL -> ((Bool,ByteString) -> IO ()) -> IO () pAsynGET :: URL -> ((Bool,ByteString) -> IO ()) -> IO (MVar ())

(This is a particular simple case of Oleg's iteratees, I think) Download the data at whatever speed is convenient. As data arrives, feed it to the 'callback' provided. The ByteString is the new chunk of data, the 'Bool' is just supposed to indicate whether or not this is the final chunk.

...

Incidentally there are more complex options than (Bool,Bytestring) -> IO (). A simple and obvious change is to add a return value. Another is a 'state monad by hand', as in (Bool,Bytestring) -> s -> s, and change the final return value of the type to IO s, which allows the callback to accumulate summary information and still be written as pure code.

I want to be sure that I understand the implications of the callback function returning an IO action as originally proposed versus it being a pure function. It would seem to me that if it were a pure callback the usefulness would be limited as I would not be able to take the data read from the network and immediately write it out to a file. Is this correct? And if the above is correct, is there a way to define the callback such that one does not have to hardcode the IO monad in the return type so you can have the best of both worlds?

...

Other options allow the 'callback' to request early termination, by layering in an 'Either' type in there.

I believe the ability to request early termination is important, and was one of the nice features of Oleg's left-fold enumerators. It would be a shame if the API did not offer this capability.

...

Another more sophisticated option, I think, is the higher rank

MonadTrans t => URL -> ((forall m. Monad m) => (Bool,ByteString) -> t m) -> t IO ()

...which, unless I've made a mistake, allows you to write in 'any monad which can be expressed as a transformer', by transforming it over IO, but still contains the implicit promise that the 'callback' does no IO. For example t = StateT reduces to the earlier s -> s example, in effect, with a slightly different data layout.

I don't fully understand this, but would this prevent one from calling IO actions as it was receiving the chunks in the callback (such as writing it to a file immediately)?

Jules Bean

7:51 a.m.

New subject: Network.HTTP+ByteStrings Interface--Or: How to shepherd handles and go with the flow at the same time?

Pete Kazmier wrote:

...

Jules Bean writes:

...
E,F. Progressive GET pSynGET :: URL -> ((Bool,ByteString) -> IO ()) -> IO () pAsynGET :: URL -> ((Bool,ByteString) -> IO ()) -> IO (MVar ())

...

...
Incidentally there are more complex options than (Bool,Bytestring) -> IO (). A simple and obvious change is to add a return value. Another is a 'state monad by hand', as in (Bool,Bytestring) -> s -> s, and change the final return value of the type to IO s, which allows the callback to accumulate summary information and still be written as pure code.

I want to be sure that I understand the implications of the callback function returning an IO action as originally proposed versus it being a pure function. It would seem to me that if it were a pure callback the usefulness would be limited as I would not be able to take the data read from the network and immediately write it out to a file. Is this correct?

Absolutely. A sensibly flexible API needs both possibilities; the IO-doing callback and the non-IO-doing callback.

...

And if the above is correct, is there a way to define the callback such that one does not have to hardcode the IO monad in the return type so you can have the best of both worlds?

Not that I could think of. The closest I could get to a general type was 'any monad which can be interleaved with IO', for which the best type I could think of was 'any monad which can be written as a monad transformer over IO', which was my final example below. You certainly can't have the very general type 'any monad m', since there is no way to interleave "a general monad" with IO. To interleave a monad with IO, you essentially need a pair of functions runM : m a -> IO (FOO a) and mkM : FOO a -> m a, where 'FOO' is some kind of structure which "freezes" all the side-effects of m. Like the 'freeze' routine in some of the array classes. Both 'MonadIO' and 'MonadTrans' are possible ways to get this kind of structure.

...

...
Other options allow the 'callback' to request early termination, by layering in an 'Either' type in there.

I believe the ability to request early termination is important, and was one of the nice features of Oleg's left-fold enumerators. It would be a shame if the API did not offer this capability.

Yes, I agree. I was simplifying to make the presentation shorter, not because I felt that feature was optional.

...

...
Another more sophisticated option, I think, is the higher rank

MonadTrans t => URL -> ((forall m. Monad m) => (Bool,ByteString) -> t m) -> t IO ()

...which, unless I've made a mistake, allows you to write in 'any monad which can be expressed as a transformer', by transforming it over IO, but still contains the implicit promise that the 'callback' does no IO. For example t = StateT reduces to the earlier s -> s example, in effect, with a slightly different data layout.

I don't fully understand this, but would this prevent one from calling IO actions as it was receiving the chunks in the callback (such as writing it to a file immediately)?

Yes, and that's the point :) Of course, you want both variants available: doing IO, and definitely not doing IO. Whilst the IO case is definitely a common one (progressively rendering graphics, etc), the non IO case is also quite feasible (parsing/calculating/summarising). If it happens that you don't need to do any IO, it's nice to have that reflected in the type. It's easier to write tests for non-IO code, for example. Jules

Bulat Ziganshin

27 May 27 May

10:15 a.m.

New subject: Re[2]: Network.HTTP+ByteStrings Interface--Or: How to shepherd handles and go with the flow at the same time?

Hello Jules, Friday, May 25, 2007, 1:17:49 AM, you wrote:

...

The key point under discussion was what kind of interface the HTTP library should expose: synchronous, asynchronous? Lazy, strict?

isn't it possible to implement simplest (strict sync) interface as base and then add higher levels if they are complicated enough ? (otherwise it should be simpler to do the same in each application / 3rd-aprty lib) in particular, we have a long-standing problem of customized io manager (instead of current io thread) which should work with any async i/o method (kqueue and so on) and this manager should be used for low-level implementation (i.e. we will have sync interface but internally all i/o will be done async via this thread and many haskell lightweight thread can work simultaneously) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

David Roundy

28 May 28 May

7:20 p.m.

On Thu, May 24, 2007 at 10:17:49PM +0100, Jules Bean wrote:

...

I've been having something of a discussion on #haskell about this but I had to go off-line and, in any case, it's a complicated issue, and I may be able to be more clear in an email.

The key point under discussion was what kind of interface the HTTP library should expose: synchronous, asynchronous? Lazy, strict?

As someone just pointed out, "You don't like lazy IO, do you?". Well, that's a fair characterisation. I think unsafe lazy IO is a very very cute hack, and I'm in awe of some of the performance results which have been achieved, but I think the disadvantages are underestimated.

...

Of course, there is a potential ambiguity in the phrase 'lazy IO'. You might interpret 'lazy IO' quite reasonably to refer any programming style in which the IO is performed 'as needed' by the rest of the program. So, to be clear, I'm not raising a warning flag about that practice in general, which is a very important programming technique. I'm raising a bit of a warning flag over the particular practice of achieving this in a way which conceals IO inside thunks which have no IO in their types: i.e. using unsafeInterleaveIO or even unsafePerformIO.

Why is this a bad idea? Normally evaluating a haskell expression can have no side-effects. This is important because, in a lazy language, you never quite know[*] when something's going to be evaluated. Or if it will. Side-effects, on the other hand, are important things (like launching nuclear missiles) and it's rather nice to be precise about when they happen. One particular kind of side effect which is slightly less cataclysmic (only slightly) is the throwing of an exception. If pure code, which is normally guaranteed to "at worst" fail to terminate can suddenly throw an exception from somewhere deep in its midst, then it's extremely hard for your program to work out how far it has got, and what it has done, and what it hasn't done, and what it should do to recover. On the other hand, no failure may occur, but the data may never be processed, meaning that the IO is never 'finished' and valuable system resources are locked up forever. (Consider a naive program which reads only the first 1000 bytes of an XML document before getting an unrecoverable parse failure. The socket will never be closed, and system resources will be consumed permanently.)

Yes, obviously lazy IO needs to be done with care, but pure functions always consume resources, and lazy IO is not unique in this regard. It does change the nature of the resources consumed, but that's all. No function can "at worst" fail to terminate, they can always fail with error, or run out of stack space. It seems that your real problem here is that sockets aren't freed when programs exit. I suppose that's a potential problem, but it doesn't seem like a critical one. I assume firefox has already permanently consumed gobbs of system resources, and it hasn't bothered me yet... except for the memory, and that's fortunately not permanent. (Incidentally, couldn't atexit be used to clean up sockets in case of unclean exiting?) Obviously lazy IO can only be used with IO operations that are considered "safe" by the programmer (usually read operations), but for those operations, when the programmer declares himself to not care when the reading is actually done, lazy IO is a beautiful thing. In particular, it allows the writing of modular reusable functions. That's actually a Good Thing... and as long as write operations are the only ones that require cleanup, it's also perfectly safe.

...

Trivial programs may be perfectly content to simply bail out if an exception is thrown. That's very sensible behaviour for a small 'pluggable' application (most of the various unix command line utilities all bail out silently or nearly silently on SIGPIPE, for example). However this is not acceptable behaviour in a complex program, clearly. There may be resources which need to be released, there may be data which needs saving, there may be reconstruction to be attempted on whatever it was that 'broke'.

Error handling and recovery is hard. Always has been. One of the things that simplifies such issues is knowing "where" exceptions can occur. It greatly simplifies them. In haskell they can only occur in the IO monad, and they can only occur in rather specific ways: in most cases, thrown by particular IO primitives; they can also be thrown 'To' you by other threads, but as the programmer, that's your problem!.

This is irrelevant to the question of lazy IO or not lazy IO. As you say, all errors happen in the IO monad, and that's true with or without lazy IO, since ultimately IO is the only consumer of lazy data. Proper use of bracket catches all errors (modulo bugs in bracket, and signals being thrown... but certainly all calls to error), and you can do that at the top level, if you like. The downside in error checking when using lazy IO is just that the part of your program where errors pop up becomes less deterministic. However, since errors can happen at any time even without lazy IO, this is only a question of probability of errors showing up at certain times (think out of memory conditions, signals thrown, etc). Well-designed programs will be written robustly. (Yes, that's a truism, but it's one you seem to be forgetting.)

...

Ok. Five paragraphs of advocacy is plenty. If anyone is still reading now, then they must be either really interested in this problem, or really bored. Either way, it's good to have you with me! These issues are explained rather more elegantly by Oleg in [1]. ... Given these three pairs of options, what need is there for an unsafe lazy GET? What niche does it fill that is not equally well filled by one of these?

Program conciseness, perhaps. The kind of haskell oneliner whose performance makes us so (justly) proud. In isolation, though I don't find that a convincing argument; not with the disadvantages taken also into account. The strongest argument then is that you have a 'stream processing' function, that is written 'naively' on [Word8] or Lazy ByteString, and wants to run as data is available, yet without wasting space. I'm inclined to feel that, if you really want to be able to run over 650M files, and you want run as data is available, then you in practice want to be able to give feedback to the rest of your application on your progress so far; I.e, L.Bytestring -> a is actually too simple a type anyway.

Yes, this is the argument for lazy IO, and it's a valid one. Any adequately powerful interface can be used to implement a lazy IO function, and people will do so, whether or not it makes you happy. It'd be nice to have it in the library itself. Program conciseness is a real issue. Simple effective APIs make for useful libraries, and the simplest API is likely to be the most commonly used. If the simplest API is strict, then that means that there'll most often be *no* feedback until the download is complete. A lazy download means that feedback can be provided instantly, as the data is consumed. True, you need to include some feedback logic in your consumer, but that's where you'll almost certainly want it anyhow. And in many cases the feedback could come for free, in the form of output. -- David Roundy Department of Physics Oregon State University

Alex Jacobson

31 May 31 May

5:28 p.m.

The HAppS HTTP code basically delivers the first 64k and a handle to acquire the rest. The 99% or higher case is that the document fits in memory so the 64k bound is fine. If you have something bigger, the user is going to have to decide how to handle that on a case by case basis. Note: chunk-encoding means that there is no theoretical limit to how big an HTTP request or response may be. -Alex- Jules Bean wrote:

...

I've been having something of a discussion on #haskell about this but I had to go off-line and, in any case, it's a complicated issue, and I may be able to be more clear in an email.

The key point under discussion was what kind of interface the HTTP library should expose: synchronous, asynchronous? Lazy, strict?

As someone just pointed out, "You don't like lazy IO, do you?". Well, that's a fair characterisation. I think unsafe lazy IO is a very very cute hack, and I'm in awe of some of the performance results which have been achieved, but I think the disadvantages are underestimated.

Of course, there is a potential ambiguity in the phrase 'lazy IO'. You might interpret 'lazy IO' quite reasonably to refer any programming style in which the IO is performed 'as needed' by the rest of the program. So, to be clear, I'm not raising a warning flag about that practice in general, which is a very important programming technique. I'm raising a bit of a warning flag over the particular practice of achieving this in a way which conceals IO inside thunks which have no IO in their types: i.e. using unsafeInterleaveIO or even unsafePerformIO.

Why is this a bad idea? Normally evaluating a haskell expression can have no side-effects. This is important because, in a lazy language, you never quite know[*] when something's going to be evaluated. Or if it will. Side-effects, on the other hand, are important things (like launching nuclear missiles) and it's rather nice to be precise about when they happen. One particular kind of side effect which is slightly less cataclysmic (only slightly) is the throwing of an exception. If pure code, which is normally guaranteed to "at worst" fail to terminate can suddenly throw an exception from somewhere deep in its midst, then it's extremely hard for your program to work out how far it has got, and what it has done, and what it hasn't done, and what it should do to recover. On the other hand, no failure may occur, but the data may never be processed, meaning that the IO is never 'finished' and valuable system resources are locked up forever. (Consider a naive program which reads only the first 1000 bytes of an XML document before getting an unrecoverable parse failure. The socket will never be closed, and system resources will be consumed permanently.)

Trivial programs may be perfectly content to simply bail out if an exception is thrown. That's very sensible behaviour for a small 'pluggable' application (most of the various unix command line utilities all bail out silently or nearly silently on SIGPIPE, for example). However this is not acceptable behaviour in a complex program, clearly. There may be resources which need to be released, there may be data which needs saving, there may be reconstruction to be attempted on whatever it was that 'broke'.

Error handling and recovery is hard. Always has been. One of the things that simplifies such issues is knowing "where" exceptions can occur. It greatly simplifies them. In haskell they can only occur in the IO monad, and they can only occur in rather specific ways: in most cases, thrown by particular IO primitives; they can also be thrown 'To' you by other threads, but as the programmer, that's your problem!.

Ok. Five paragraphs of advocacy is plenty. If anyone is still reading now, then they must be either really interested in this problem, or really bored. Either way, it's good to have you with me! These issues are explained rather more elegantly by Oleg in [1].

So, where does that leave the HTTP library? Well here are the kinds of interface I can imagine. I'm deliberately ignoring all the stuff about request headers, request content, and imagining that this is all about URL -> ByteString. Here are the options that occur to me:

A. Strict, synchronous GET sSynGET :: URL -> IO ByteString

Quite simply blocks the thread until the whole data has arrived. Throws some kind of exception on failure, presumably. This is a simple primitive, appropriate for relatively small files (files which fit comfortably in your memory) and simple programs. It's also great for programs which want to take their own control over the degree of asynchrony; they can just fork as many threads as they choose to GET with.

B. Strict, asynchronous GET sAsynGET :: URL -> IO (MVar ByteString)

Download the entire data, but do it in a separate thread. Give me an MVar so I can synchronise on the data's arrival in whichever way suits my program best. Suitable for small files which fit conveniently in memory. Very easy to implement in terms of forkIO and sSynGET so really it's a convenience function.

C. Strict, synchronous, GET-to-file sSynFileGET :: URL -> FilePath -> IO () D. Strict, asynchronous, GET-to-file sAsynFileGET :: URL -> FilePath -> IO (MVar ())

Download the entire data to a local file. This means that it doesn't matter if the data is far bigger than local memory, it can still be done efficiently. [Note that this doesn't mean it must use lazy getContents magic under the hood. It could easily use strict hGet in reasonable sized chunks and write them out straightaway.] The only difference between the two variants is that one keeps control until completion, the other gives you an MVar which you can block on if/when you choose. This method is appropriate for clients which need extremely large data, and don't mind waiting for it to finish before they start processing. It is also appropriate for clients which want random access to large data requests (using underlying hSeek-based random file IO, once the file has downloaded).

E,F. Progressive GET pSynGET :: URL -> ((Bool,ByteString) -> IO ()) -> IO () pAsynGET :: URL -> ((Bool,ByteString) -> IO ()) -> IO (MVar ())

(This is a particular simple case of Oleg's iteratees, I think) Download the data at whatever speed is convenient. As data arrives, feed it to the 'callback' provided. The ByteString is the new chunk of data, the 'Bool' is just supposed to indicate whether or not this is the final chunk. You can imagine slight variations. Note that the library promises not to retain any references to the ByteString chunks, so if the callback processes them and then discards them they are eligible for garbage collection. If the callback wishes to accumulate them, it can quite easily 'append' the strict chunks into a lazy bytestring, which is an efficient operation. This is suitable for applications which wish to do something like progressive display of a pJPEG, or compute a summary function, or perhaps even display an animation, although that last would normally need a little more structure to try to guarantee the data rate.

Incidentally there are more complex options than (Bool,Bytestring) -> IO (). A simple and obvious change is to add a return value. Another is a 'state monad by hand', as in (Bool,Bytestring) -> s -> s, and change the final return value of the type to IO s, which allows the callback to accumulate summary information and still be written as pure code. Other options allow the 'callback' to request early termination, by layering in an 'Either' type in there. Another more sophisticated option, I think, is the higher rank

MonadTrans t => URL -> ((forall m. Monad m) => (Bool,ByteString) -> t m) -> t IO ()

...which, unless I've made a mistake, allows you to write in 'any monad which can be expressed as a transformer', by transforming it over IO, but still contains the implicit promise that the 'callback' does no IO. For example t = StateT reduces to the earlier s -> s example, in effect, with a slightly different data layout.

Another couple of refinements to the above are that in practice you normally want to 'guarantee' your callback only runs on a chunk size of at least X (in some cases 'exactly X'), or you want to guarantee it's called at least every Y seconds. Neither of these are hard to add.

Given these three pairs of options, what need is there for an unsafe lazy GET? What niche does it fill that is not equally well filled by one of these?

Program conciseness, perhaps. The kind of haskell oneliner whose performance makes us so (justly) proud. In isolation, though I don't find that a convincing argument; not with the disadvantages taken also into account.

The strongest argument then is that you have a 'stream processing' function, that is written 'naively' on [Word8] or Lazy ByteString, and wants to run as data is available, yet without wasting space. I'm inclined to feel that, if you really want to be able to run over 650M files, and you want run as data is available, then you in practice want to be able to give feedback to the rest of your application on your progress so far; I.e, L.Bytestring -> a is actually too simple a type anyway.

I'm interested to know what opinions other people have on this, whether I've made any serious logic mistakes or just overlooked another approach which has advantages. Having spent quite a while this evening thinking this over, I just don't see the convincing case for the unsafe lazy approach, and I see plenty of problems with it...

Cheers,

Jules

* Well, OK. If you're smart, then you know. Because you're smart, and you thought about it carefully. But most of the time, you don't need to know. And if you produce data (a pure structure) which is then consumed by a library - or vice versa - then you don't know unless you read their code.

[1] http://www.haskell.org/pipermail/haskell-cafe/2007-March/023073.html

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

6612

Age (days ago)

6620

Last active (days ago)

List overview

Download

8 comments

7 participants

participants (7)

Alex Jacobson
Bulat Ziganshin
David Roundy
haskell＠list.mightyreason.com
Jules Bean
Pete Kazmier
Thomas Schilling