Re: [Haskell-cafe] compare iteratee with python's yield

29 Jul 2011

      Your question is insightful. The posted answers are
excellent. I merely wish to illustrate one, already made, point on a
concrete example.

The complete code is in the file
	http://okmij.org/ftp/Haskell/Iteratee/IterDemo.hs

The file implements a series of progressively more complex examples
-- from wc to grep, -- stressing compositionality. The whole program
is assembled from independent building blocks; as our task changes, we
replace one building block with another. The building blocks,
iteratees, may be stateful. We never have to worry about writing out the
whole state of the program and properly initializing it. The overall
state is implicitly composed from the state of each building block; 
each iteratee manages its own state without leaking it.

The particular example to illustrate the state encapsulation is
grep1_word
...
grep1_word word fname =
  let search = IterateeM.dropWhile (/= word) >> IterateeM.head 
      runI'  = runI . en_handle (\_ -> "not found:" ++ word) in
  print =<< run =<< enum_file fname 
    (runI =<< en_lines 
     ((runI' =<< en_words search)
      `en_pair` (count_i `en_pair` en_last)))
which implements "grep -w -n -m 1 <word> <filename>". It searches for
the first occurrence of the given word in the file and returns the
line of that occurrence and the line number.  (The example also
illustrates exception handling: The exception is raised by
Iteratee.head when IterateeM.dropWhile dropped the whole stream while
looking for the given word.) A few comments: en_lines is the
enumeratee, transforming a stream of characters into a stream of
lines; en_words is the enumeratee that transforms the stream of lines
to the stream of words. The stateful iteratee count_i counts the
elements in any stream; en_last, the analogue of List.last, returns
the last seen element. The combinator en_pair pairs two iteratees to
run in parallel, processing the same stream chunk-by-chunk. The
processing stops as soon as one of the iteratees in the pair
stopped. That behavior does the trick: "en_words search" stops when
the word is found. The whole pair is stopped too. The other component
of the pair, (count_i `en_pair` en_last), will receive EOF and tell us
the line and the line count they have last seen.

I guess the example can be written in Python-like style as follows
(simplifying and assuming that the file is already parsed into lines):

counter = 0
for line in file_lines:
  counter = counter + 1
  if contains line word:
     print counter
     print line
     break

Suppose we wish to change our example to print not only the found
line, but a few lines before it (the option "-B" of grep). We have to
modify our Python-like code:
 -- add a piece of state to keep the context lines,
    and initialize it
 -- add code to update the state as we read lines
 -- add the finalization code, to convert the state to the desired
    result

The code will look like the following (### mark the changes):

counter = 0
lines_seen = []  ###
for line in file_lines:
  counter = counter + 1
  accumulate lines_seen line ###
  if contains line word:
     print counter
     print (finalize lines_seen) ###
     break

In contrast, to modify the iteratee code for the new task, we make the
_single_ change: replace en_last with (en_lastN 5) (for 5 lines of
context).

It should be stressed that the Python code has to be changed in three
*distinct, disconnected places*. New state has to be defined and
initialized, before the loop. The state has to be updated, somewhere
in the loop. The state has to be finalized, in the loop termination
part. When changing code, it is easy to overlook the needed
changes. For example, it is easier to accumulate the context lines in
reverse order. When printing the lines, we should not forget to
reverse them. When changes are disconnected, it is hard to reason and
test for them. With so many other things going on, it is hard to write
a proper test (for example, when testing the context accumulation we
don't care about word matching; we should test the accumulation in
isolation of whatever else we are doing in the iteration).

The iteratee en_lastN deals only with the context accumulation aspect:
...
-- Return N last received elements
-- The acc below is the list of N last elements in the reverse
-- order
en_lastN :: Monad m => Int -> Iteratee el m [el]
en_lastN n = ie_cont $ step []
 where
 step acc (Chunk [])  = ie_contM (step acc) 
 step acc (Chunk ls)  = ie_contM (step $! Prelude.take n $ reverse ls ++ acc)
 step acc stream      = ie_doneM (reverse acc) stream
The initialization of the state, the accumulation and the finalization are all
in the same place. The en_lastN iteratee does not care of counting,
searching, or whatever else is going on. In fact, the iteratee is pure
and deals with arbitrary elements (not necessarily lines). We can test
it using Quickcheck or HUnit. No IO is needed for the test.

Thus in the iteratee style, the overall state of the program becomes
distributed, across iteratees that encapsulate and manage only their
own, small portion of the program state.

The code IterDemo.hs shows other examples, including emulating
generators (aka, unfold_stream, or the `while loop'), to build the
stream of intermediate results of an iteratee. An iteratee in question
does not even suspect that it is being used `incrementally',
seduced to reveal the intermediate results.

Thank you for the inspiring question.

oleg＠okmij.org

tags

participants (1)