Takusen and strictness, and perils of getContents

Takusen permits on-demand processing on three different levels. It is specifically designed for database processing in bounded memory with predictable resource utilization and no resource leaks. But first, about getContents. It has been mentioned a while ago that getContents should be renamed to unsafeGetContents. I strongly support that suggestion. I believe getContents should be used sparingly (I personally never used it), and I believe it cannot give precise resource guarantees and is a wrong model for database interfaces. I will not dwell on the fact that getContents permits IO to occur while evaluating pure code -- which is just wrong. There is a practical consequence of this supposedly theoretical impurity: error handling. As the manual states ``A semi-closed handle becomes closed: ... if an I/O error occurs when reading an item from the handle; or once the entire contents of the handle has been read.'' That is, it is not possible to tell if all the data from the channel have been read or an I/O error has interfered. It is not possible to find out any details about that I/O error. That alone disqualifies getContents from any serious use. Even more egregious is resource handling and that business with semi-closed handles, which is a resource leak. Interfacing with a database requires managing lots of resources: database connection, prepared statement handle, statement handle, result set, database cursor, transaction, input buffers. Takusen was specifically designed to be able to tell exactly when a resource is no longer needed and can be _safely_ disposed of. That guarantee is not available with getContents -- the resources associated with the handle are disposed when the consumer of getContents is finished with it. Since the consumer may be pure code, it is impossible to tell when the evaluation finishes. It may be in a totally different part of the code. To get more predictability, we have to add seq and deepSeq -- thus defeating the laziness we supposedly have gained with getContents, and hoping that two wrongs somehow make it right. Regarding Takusen: it is designed for incremental processing of database data, on three levels: -- unless the programmer said that the query will yield small amount of data, we don't ask the database for all of the result set at the same time. We ask to deliver data in increments of 10 or 100 rows (the programmer may tune the amount). The retrieved chunk is placed into pre-allocated buffers. -- the retrieved chunk is given to an iteratee one row at a time. The iteratee may at each point specify that it has had enough. The processing immediately stops, no further chunks are retrieved and all resources of the query are disposed of. -- Alternatively, Takusen offers the cursor-based interface, with getNext and getCurrent methods. The rows are retrieved on-demand in chunks. The interface is designed to restrict operations on a cursor to a region of code. Once the region is exited (normally or by exception), all associated resources are disposed of because they are statically guaranteed to be unavailable outside the region. Because the moments of resource allocation and deallocation are so well known, Takusen can take care of all of it. The programmer will never have to worry about resource leaks, deallocations, etc. A bit of experience: I have implemented a Web application server in Haskell, using Takusen as a back end. The server runs as a FastCGI dynamic server, retrieving a chunk of rows from the database, formatting the rows (e.g., in XML), sending them up the FastCGI interface and ultimately to the client, coming back for the next chunk. The advantage of that stream-wise processing is low latency, low memory consumption, and client limiting the database retrieval rate. Typical requests routinely ask for thousands of database rows; the server runs continuously serving hundred of requests, in constant memory. The executable is 2.6 MB in size (GHC 6.4.2); the running process takes VmSize of 6608 kB, including VmRSS of 3596 kB and VmData of 1412 kB. The code has not a single unsafePerformIO (and aside from an S-expression parsing code I inherited) I used not a single strictness annotation. The line count (including comments) is 7500 lines in 30 files.

oleg@pobox.com wrote:
Takusen permits on-demand processing on three different levels. It is specifically designed for database processing in bounded memory with predictable resource utilization and no resource leaks.
But first, about getContents. It has been mentioned a while ago that getContents should be renamed to unsafeGetContents. I strongly support that suggestion. I believe getContents should be used sparingly (I personally never used it), and I believe it cannot give precise resource guarantees and is a wrong model for database interfaces.
I will not dwell on the fact that getContents permits IO to occur while evaluating pure code -- which is just wrong. There is a practical consequence of this supposedly theoretical impurity: error handling. As the manual states ``A semi-closed handle becomes closed: .... if an I/O error occurs when reading an item from the handle; or once the entire contents of the handle has been read.'' That is, it is not possible to tell if all the data from the channel have been read or an I/O error has interfered. It is not possible to find out any details about that I/O error. That alone disqualifies getContents from any serious use. Even more egregious is resource handling and that business with semi-closed handles, which is a resource leak.
All of which constitutes the "lazy I/O considered harmful" folklore, which really should be written up somewhere. Anyway, I just wanted to point out that nowadays we have the option of using imprecise exceptions to report errors in lazy I/O. The standard I/O library doesn't do this at the moment (I think it would be good to have a discussion about whether it should sometime), but Data.ByteString's lazy I/O does report errors using exceptions. Cheers, Simon

Simon Marlow wrote:
Anyway, I just wanted to point out that nowadays we have the option of using imprecise exceptions to report errors in lazy I/O.
Is this really a solution? Currently, getContents reports no errors but does perfect error recovery: the result of the computation prior to the error is preserved and reported to the caller. Imprecise exceptions give us error reporting -- but no error recovery. All previously computed results are lost. Here's a typical scenario: do l <- getContents return (map process l) If an error occurs reading lazy input, we'd like to log the error and assume the input is terminated with EOF. If getContents raises an imprecise exception, what do we do? return (map process (catch l (\e -> syslog e >> return []))) That of course won't work: l is a value rather than an effectful computation; besides; catch can't occur in the pure code to start with. What we'd like are _resumable_ exceptions. The exception handler receives not only the exception indicator but also the continuation where the exception has occurred. Invoking this continuation means error recovery. Resumable exceptions are used extensively in CL; they are also available in OCaml. So, hypothetically we could write do l <- getContents resume_catch (return (map process l)) (\e k -> syslog e >> k []) Besides the obvious typing problem, this won't work for the reason that exceptions raised in the pure code are _imprecise_ -- that is, no precise continuation is available, even in principle.

oleg@pobox.com wrote:
Simon Marlow wrote:
Anyway, I just wanted to point out that nowadays we have the option of using imprecise exceptions to report errors in lazy I/O.
Is this really a solution? Currently, getContents reports no errors but does perfect error recovery: the result of the computation prior to the error is preserved and reported to the caller. Imprecise exceptions give us error reporting -- but no error recovery. All previously computed results are lost. Here's a typical scenario: do l <- getContents return (map process l) If an error occurs reading lazy input, we'd like to log the error and assume the input is terminated with EOF. If getContents raises an imprecise exception, what do we do? return (map process (catch l (\e -> syslog e >> return []))) That of course won't work: l is a value rather than an effectful computation; besides; catch can't occur in the pure code to start with. What we'd like are _resumable_ exceptions. The exception handler receives not only the exception indicator but also the continuation where the exception has occurred. Invoking this continuation means error recovery. Resumable exceptions are used extensively in CL; they are also available in OCaml. So, hypothetically we could write do l <- getContents resume_catch (return (map process l)) (\e k -> syslog e >> k [])
Besides the obvious typing problem, this won't work for the reason that exceptions raised in the pure code are _imprecise_ -- that is, no precise continuation is available, even in principle.
Yes, I think I agree with that. Resumable exceptions don't make any sense for pure code (I can certainly imagine implementing them though, and they make sense in the IO monad). But all is not lost: if an exception is raised during getContents for example, you still have the partial results: the list ends in an exception, and I can write a function that returns the non-exceptional portion (in IO, of course). Cheers, Simon

Is this really a solution? Currently, getContents reports no errors but does perfect error recovery: the result of the computation prior to the error is preserved and reported to the caller. Imprecise exceptions give us error reporting -- but no error recovery. All previously computed results are lost. Here's a typical scenario: do l <- getContents return (map process l) If an error occurs reading lazy input, we'd like to log the error and assume the input is terminated with EOF.
i was wondering: could you perhaps use the old Hood Observe trick to help you out? Hood wraps lazy sources in unsafePerformIO handlers that log things as they are demanded, then passes them on. you might be able to use the same to pass things on as they are demanded, while logging exceptions and replacing exceptional with repaired or default values?
What we'd like are _resumable_ exceptions. The exception handler receives not only the exception indicator but also the continuation where the exception has occurred. Invoking this continuation means error recovery. Resumable exceptions are used extensively in CL; they are also available in OCaml. So, hypothetically we could write do l <- getContents resume_catch (return (map process l)) (\e k -> syslog e >> k [])
Besides the obvious typing problem, this won't work for the reason that exceptions raised in the pure code are _imprecise_ -- that is, no precise continuation is available, even in principle.
yes, i've often wondered why exceptions are not optionally resumable, with a handler that may decide to return or abort, depending on how seriously the protected code is likely to be affected by the exception in hand. that way, exception handling and normal processing would be better separated, and simple fault tolerance easier to achieve, whereas now the exception handler would have to know how to restart the interrupted computation from scratch. in terms of imprecise semantics, that might not be a problem, either: yes, you can't guarantee that the same exception will be raised if you repeat the experiment, but that is why the handler is in IO already. no matter what exception and continuation it receives, if the exception is non-fatal, it can log the problem and return, with a repaired value, to the pure code that raised the exception. in terms of stack-based implementation, it might be the simple difference of clearing the stack in the handler, rather than in the raiser, giving the handler the option to resume or abort the raiser. or is that too naive?-) claus

Is this really a solution? Currently, getContents reports no errors but does perfect error recovery: the result of the computation prior to the error is preserved and reported to the caller. Imprecise exceptions give us error reporting -- but no error recovery. All previously computed results are lost. Here's a typical scenario: do l <- getContents return (map process l) If an error occurs reading lazy input, we'd like to log the error and assume the input is terminated with EOF.
i was wondering: could you perhaps use the old Hood Observe trick to help you out? Hood wraps lazy sources in unsafePerformIO handlers that log things as they are demanded, then passes them on. you might be able to use the same to pass things on as they are demanded, while logging exceptions and replacing exceptional with repaired or default values?
attached is an implementation sketch, with an example problem (a lazily produced String that never gets its 10th element, and includes an error for every uppercase Char in the input. with the handlers commented out, we get things like this: $ (echo "Hi "; sleep 1; echo "There, World") | runHaskell ResumeCatch.hs ---------- *** Exception: oops while replacing str with a handled str (in the definition of safestr) gives: $ (echo "Hi "; sleep 1; echo "There, World") | runHaskell ResumeCatch.hs ---------- ?I ?HERE ---------- ?I ?HERE{- ResumeCatch.hs:30:8-57: Non-exhaustive patterns in function process -} all is well that ends well ---------- the usual caveats about unsafePerformIO apply, so perhaps you wouldn't want to use this in a database library.. claus

the usual caveats about unsafePerformIO apply, so perhaps you wouldn't want to use this in a database library..
Indeed. This is quite problematic, from the practical point of view of making resources difficult to control (cf. another thread of file handle leakage), to the theoretical point that side effects and lazy evaluation strategy is a bad mix, severely limiting the equational theory and making the code hard to reason about. I do care about all of these issues; otherwise I would have programmed in C. That reminds of Simon Peyton-Jones POPL2003 presentation, the retrospective on Haskell. He said that the fact that lazy evaluation and side effects are poor match has kept the designers from adding all kinds of problematic hacks to the language. The laziness has kept Haskell pure -- until the monad (notation) has come along and showed how to do side-effects in the principled way. If keeping the purity and keeping unsolved problems open until a principled solution comes along have worked so well in the past, why to change now? As to the original question
Is this really a solution? Currently, getContents reports no errors but does perfect error recovery: the result of the computation prior to the error is preserved and reported to the caller. Imprecise exceptions give us error reporting -- but no error recovery. All previously computed results are lost. Here's a typical scenario: do l <- getContents return (map process l)
a better (albeit still quite unsatisfactory) answer might be to change the interface of getContents so it would take the handler as an argument: newGetContents :: (Exception -> IO String) -> IO String The old getContents is equivalent to "newGetContents (const (return []))". If the handler needs to notify the rest of the program of an error, it may save the information from the exception in a IORef defined in outer scopes. If this looks like the inversion of control, that's because it is... Often the problem can be solved via a left-fold enumerator, like the one in Takusen. In the context of reading file, such an enumerator is described in http://okmij.org/ftp/Haskell/misc.html#fold-stream One of the examples in that article was specifically reading only a few characters from a file. With enumerator, we guarantee that file handles do not leak, that files are closed at the precise and predictable moments, and we never read the whole file in memory unless the programmer specifically wishes to.
participants (3)
-
Claus Reinke
-
oleg@pobox.com
-
Simon Marlow