[Haskell-cafe] Re: Left fold enumerator - a real pearl overlooked?

2 Mar 2009

      Hello,

I am not a super-geek (at least, not compared to others on this list),
but I'll take a try at this anyway.  The benefits of iteratees mostly
depend on differences between lazy and strict IO (see ch. 7 of Real
World Haskell for more on this).

Lazy IO allows for some seriously cool tricks.  You can write
functions that process entire files at once, without explicitly
writing loops or recursion, and the runtime system will process data
in chunks without attempting to load the whole file into memory.  This
allows for a very functional style of code, making it easy to abstract
and compose operations.

This benefit comes with a high cost.  Lazy IO can make it much more
difficult to manage resources.  For example, suppose you open a Handle
and use hGetContents to read from that handle into a data structure.
Then you close the Handle.  Later on, when you try to use the data,
you get an error message about the handle being closed.  Because the
data structure wasn't fully evaluated before you closed the handle,
some IO necessary to fill it was never performed, and now that the
handle is closed it's too late.  You don't have to explicitly close
handles (when the result is garbage collected the handle should be
closed automatically), but if you're doing a lot of IO this means that
you'll have open handles accumulating until they can be GC'd, using up
a finite resource.

Lazy IO often makes it difficult to reason about the behavior of your
code.  One problem is unintentionally holding onto values, preventing
them from being GC'd.  Here's an example of a different sort of
problem.  Suppose you're writing a file, and the format specifies that
you first write the size of data that follows, then the actual data.
In order to obtain this value, you do something like the following:

process = do
  data <- getFile "Infile.txt"
  writeFile "Outfile.txt" $ show $ length data

You've now read the entire contents of Infile.txt into memory.  If you
have any code following that uses the string in "data", it won't get
GC'd until after "data" is processed.  If Infile.txt is large, this is
a problem.  You've lost one of the main benefits of lazy IO, that data
is read only as necessary.  Since the entire file is necessary to
calculate the length of the string, the whole thing is forced into
memory.  In general, lazy IO doesn't have a good solution if you run
into this issue.

Strict IO avoids these problems, but only because the programmer has
to manually manage the resources.  Now the programmer is responsible
for explicit recursion (typically in the IO monad), meaning the
programmer needs to handle exceptions in the recursion.  Buffering
must also be managed manually.  You've avoided the pitfalls of lazy
IO, but now have a lot more work.  Most of what you write will also be
much less composable, leading to a lot of functions that do slight
variations on the same thing.

If you do strict IO long enough, you will soon desire a more
generalized solution.  First you'll make a function which will take a
function parameter and recursively apply it to data read from a file
(the enumerator).  You'll want to add exception handling as well.
Next, you'll find that you frequently don't need to read all of a
file, only part of it.  You'll want to modify your recursive function
so that the function which is being repeatedly applied (the iteratee)
can signal when it's finished, so the enumerator can stop processing
the file at that point.  After several more enhancements, you'll end
up with what is essentially Oleg's iteratee code.

At this point you have achieved the following:
1.  Composability - both iteratees and enumerators can be composed.  A
number of basic iteratees are provided which can be combined by the
user as necessary.
2.  Layering - stream processing can be separated, e.g. one layer that
creates lines, another to create words, etc.  It is possible to make
iteratees that operate on words, lines, or any other element that
makes sense for the problem domain.
3.  Efficiency - data is processed in chunks, and only as much IO is
performed as necessary to evaluate the result.
4.  Safety - IO exceptions don't leak outside the enumerator, and the
file handle is closed after IO exceptions.
5.  Generality - many different types of resources can be read
(handles, files, sockets, etc.).  In addition, they can easily be
combined.  One of Oleg's examples muxes input from two sockets into
one stream in a process which is transparent to the iteratee.
Furthermore, this works with any monad, not just IO (although IO
actions must be performed in IO).
6.  Seeking - seeking is allowed for seekable resources.

So you've managed to achieve many of the benefits of lazy IO
(composability, generality, and efficiency) without the drawbacks.
Furthermore resources are managed strictly and released as soon as
possible.

I hope this is helpful.

Cheers,
John Lato
...
Message: 12
Date: Mon, 02 Mar 2009 01:31:02 +0100
From: G??nther Schmidt 
Subject: [Haskell-cafe] Re: Left fold enumerator - a real pearl
       overlooked?
To: haskell-cafe@haskell.org
Message-ID: 
Content-Type: text/plain; charset=windows-1252; format=flowed
Hi everyone,
after reading all the responses I would like to ask someone, anyone, to
kind of summarize the merits of the left-fold-enumerator approach.
 From all that I read so far about it all I was able to gather was that
it has significance but I'm still not even sure what for and what not for.
Apparently Oleg has done various CS work, this particular piece just
being one. But he also broaches the topic at very high level, ok, too
high for me, ie. no CS or higher math background.
Would one of the super geeks please summarize it up? (In RWH kind of
style if possible)
Günther
John Lato schrieb:
...
Hi Don,
Would you please elaborate on what features or capabilities you think
are missing from left-fold that would elevate it out of the special
purpose category?  I think that the conception is so completely
different from bytestrings that just saying it's not a bytestring
equivalent doesn't give me any ideas as to what would make it more
useful.  Since the technique is being actively developed and
researched, IMO this is a good time to be making changes.
Incidentally, in my package I've made newtypes that read data into
strict bytestrings.  It would be relatively simple to use
unsafeInterleaveIO in an enumerator to create lazy bytestrings using
this technique.  I don't see why anyone would want to do so, however,
since it would have all the negatives of lazy IO and be less efficient
than simply using lazy bytestrings directly.
Cheers,
John
On Sat, Feb 28, 2009 at 10:54 PM, Don Stewart  wrote:
...
...
There are a few iteratee/enumerator design questions that remain,
which Oleg and others would like to explore more fully.  The results
of that research will likely find there way into this library.
I agree. There's no left-fold 'bytestring' equivalent. So it remains a
special purpose technique.
-- Don