Re: [Haskell-cafe] Patterns for processing large but finite streams

From: Eugene Kirpichov
Subject: [Haskell-cafe] Patterns for processing large but finite streams To: Haskell Cafe Message-ID: Content-Type: text/plain; charset=ISO-8859-1 Hi,
I'm rewriting timeplot to avoid holding the whole input in memory, and naturally a problem arises:
How to represent large but finite streams and functions that process them, returning other streams or some kinds of aggregate values?
Examples: * Adjacent differences of a stream of numbers * Given a stream of numbers with times, split it into buckets by time of given width and produce a stream of (bucket, 50%,75% and 90% quantiles in this bucket) * Sum a stream of numbers
Is this, perhaps, what comonads are for? Or iteratees?
This is exactly what iteratees are for. Specifically, enumeratees are stream transformers, iteratees are stream consumers, and enumerators are stream producers. Consider adjacent differences: Given the stream [a, b, c, d, e ....], you want to produce [b-a, c-b, d-c, e-d, ...] This is a stream transformer, so you need an enumeratee. Using iteratee, there are at least two obvious ways to produce it: 1) High-level, but probably not as good performance. Use Data.Iteratee.ListLike.roll
import Data.Iteratee as I import Control.Applicative
diff [x,y] = y-x diff [x] = 0 diffs = convStream (map diff <$> roll 2 1)
2) somewhat explicit, probably better performance
import qualified Data.ListLike as LL e' iter = do h <- I.head unfoldConvStream f h iter where f lastEl = do c <- getChunk if LL.null c then return (lastEl, LL.empty) else do let h = LL.head c t = LL.tail c return (LL.last c, LL.cons (h-lastEl) (LL.zipWith (-) t (LL.init c)))
either of these can be run by using an enumerator: *Main> enumPure1Chunk [1..10] (joinI $ e stream2list) >>= run [1,1,1,1,1,1,1,1,1,0] *Main> let e'2 = e' :: Enumeratee [Int] [Int] IO a *Main> enumPure1Chunk [1..10] (joinI $ e'2 stream2list) >>= run [1,1,1,1,1,1,1,1,1] I should optimize 'roll', it wouldn't be hard. Summing is easy; iteratee has "Data.Iteratee.ListLike.sum" built-in, but you could also use a fold. enumPure1Chunk is only useful for small amounts of data, but iteratee packages provide enumerators over files, handles, etc., as well as mechanisms by which you can create your own enumerators. The e' enumeratee is really just a model; I'd probably write one specific to whichever type of stream I wanted to work with. This one assumes a cheap 'cons', for example. For producing a stream of buckets, if the times are ordered it would be simple to do with Data.Iteratee.ListLike.breakE. If the times aren't ordered, I would probably use 'group' instead to collect a set number of samples. In my view, the biggest difference between iteratee and enumerator is the stream abstraction. Iteratee provides
I.Iteratee s m a
where 's' is the stream type, e.g. [Int], String, ByteString, Vector Word8, etc. Although the library only processes a chunk at a time (where a chunk is a subsection of the stream), the type is that of the whole stream. Enumerator instead provides
E.Iteratee s m a
here, 's' is the type of the chunk. Enumerator treats the stream as having type [s]. The implementations are different too, but ideally that would be hidden from most users. Although the iteratee implementation allows you to use >>= and $, whereas enumerator sometimes requires you to use >>== and $$. I think IterIO mostly follows the same type as Enumerator, although the implementation is again quite different. John
participants (1)
-
John Lato