
On Wed, 14 May 2008, Paul Johnson wrote:
We've had a big debate over
mean xs = foldl' (+) 0 xs / fromIntegral (length xs)
For anyone who didn't follow it, the problem is that "mean" needs to traverse its argument twice, so the entire list has to be held in memory. So if "xs = [1..1000000000]" then "mean xs" uses all your memory, but "foldl' (+) 0 xs" and "length xs" by themselves do not.
I also wondered how to solve that problem. It also occurs in implementations of "wc", that is counting lines, words and characters of a text in parallel.
The solution is for the programmer to rewrite "mean" to accumulate a pair containing the running total and count together, then do the division. This makes me wonder: could there be a compiler optimisation rule for this, collapsing two iterations over a list into one. I've never tried writing GHC rules, but something like:
f (foldl' g x) (foldl' h x) = (uncurry f) (foldl' (\i (gt,ht) -> (g i gt, h i ht)))
The left hand sides of GHC rules must have a definite function as top-level construct, whereas 'f' is a variable. You may define a function which points the optimizer to the places where something shall be replaced. Say fuseFold :: (a -> b -> c) -> a -> b -> c fuseFold f x y = f x y {-# RULES forall f g h x y. fuseFold f (foldl' g x) (foldl' h x) = ... #-} However, instead of writing library functions which use foldl', it might be better to implement and export the accumulating functions separately. They can be used with 'foldl's of arbitrary data structures and they can be fused explicitly without GHC rules. However, I suspect you cannot rewrite say (length (words str)) or (length (lines str)) in that manner in an elegant way.
The first problem that occurs to me is the number of permutations of fold and map functions.
foldl' could be merged with 'map' to a foldl', however the existing fusion frameworks use different primitives, like foldr/build.