Re: [Haskell-cafe] Patterns for processing large but finite streams

1 Jul 2011

      ...
From: Eugene Kirpichov 
Subject: [Haskell-cafe] Patterns for processing large but finite
       streams
To: Haskell Cafe 
Message-ID: 
Content-Type: text/plain; charset=ISO-8859-1
Hi,
I'm rewriting timeplot to avoid holding the whole input in memory, and
naturally a problem arises:
How to represent large but finite streams and functions that process
them, returning other streams or some kinds of aggregate values?
Examples:
* Adjacent differences of a stream of numbers
* Given a stream of numbers with times, split it into buckets by time
of given width and produce a stream of (bucket, 50%,75% and 90%
quantiles in this bucket)
* Sum a stream of numbers
Is this, perhaps, what comonads are for? Or iteratees?
This is exactly what iteratees are for.  Specifically, enumeratees are
stream transformers, iteratees are stream consumers, and enumerators are
stream producers.  Consider adjacent differences:

Given the stream [a, b, c, d, e ....], you want to produce [b-a, c-b, d-c,
e-d, ...]

This is a stream transformer, so you need an enumeratee.  Using iteratee,
there are at least two obvious ways to produce it:

1)  High-level, but probably not as good performance. Use
Data.Iteratee.ListLike.roll
...
import Data.Iteratee as I
import Control.Applicative
diff [x,y] = y-x
diff [x]    = 0
diffs = convStream (map diff <$> roll 2 1)
2) somewhat explicit, probably better performance
...
import qualified Data.ListLike as LL
e' iter = do
  h <- I.head
  unfoldConvStream f h iter
where
 f lastEl = do
   c <- getChunk
   if LL.null c
     then return (lastEl, LL.empty)
     else do
       let h = LL.head c
            t = LL.tail c
       return (LL.last c, LL.cons (h-lastEl) (LL.zipWith (-) t (LL.init
c)))
either of these can be run by using an enumerator:

*Main> enumPure1Chunk [1..10] (joinI $ e stream2list) >>= run
[1,1,1,1,1,1,1,1,1,0]
*Main> let e'2 = e' :: Enumeratee [Int] [Int] IO a
*Main> enumPure1Chunk [1..10] (joinI $ e'2 stream2list) >>= run
[1,1,1,1,1,1,1,1,1]

I should optimize 'roll', it wouldn't be hard.

Summing is easy; iteratee has "Data.Iteratee.ListLike.sum" built-in, but you
could also use a fold.

enumPure1Chunk is only useful for small amounts of data, but iteratee
packages provide enumerators over files, handles, etc., as well as
mechanisms by which you can create your own enumerators.

The e' enumeratee is really just a model; I'd probably write one specific to
whichever type of stream I wanted to work with.  This one assumes a cheap
'cons', for example.

For producing a stream of buckets, if the times are ordered it would be
simple to do with Data.Iteratee.ListLike.breakE.  If the times aren't
ordered, I would probably use 'group' instead to collect a set number of
samples.

In my view, the biggest difference between iteratee and enumerator is the
stream abstraction.  Iteratee provides
...
I.Iteratee s m a
where 's' is the stream type, e.g. [Int], String, ByteString, Vector Word8,
etc.  Although the library only processes a chunk at a time (where a chunk
is a subsection of the stream), the type is that of the whole stream.

Enumerator instead provides
...
E.Iteratee s m a
here, 's' is the type of the chunk.  Enumerator treats the stream as having
type [s].

The implementations are different too, but ideally that would be hidden from
most users.  Although the iteratee implementation allows you to use >>= and
$, whereas enumerator sometimes requires you to use >>== and $$.

I think IterIO mostly follows the same type as Enumerator, although the
implementation is again quite different.

John

John Lato

tags

participants (1)