Hi Cafe,
In the last couple of days I completed my quest of making my graphing utility timeplot (
http://jkff.info/software/timeplotters ) not load the whole input dataset into memory and consequently be able to deal with datasets of any size, provided however that the amount of data to *draw* is not so large. On the go it also got a huge speedup - previously visualizing a cluster activity dataset with a million events took around 15 minutes and a gig of memory, now it takes 20 seconds and 6 Mb max residence.
(I haven't yet uploaded to hackage as I have to give it a bit more testing)
The refactoring involved a number of interesting programming patterns that I'd like to share with you and ask for feedback - perhaps something can be simplified.
Strictness is extremely important here - the last memory leak I eliminated was lack of bang patterns in teeSummary.
There's an interesting function statefulSummary that shows how closures allow you to achieve encapsulation over an unknown piece of state - curiously enough, you can't define StreamSummary a r as StreamSummary { init :: s, insert :: a->s->s, finalize :: s->r } (existentially qualified over s), as then you can't define summaryByKey - you don't know what type to store in the states map.
There are also a few interesting functions in that file - e.g. edges2eventsSummary, which applies a summary over a stream of "long" events to a stream of rise/fall edges.
This means that you can define a "stream transformer" (Stream a -> Stream b) as a function (StreamSummary b -> StreamSummary a), which can be much easier. I have to think more about this idea.