Two-iteration optimisation (was: GHC Predictability)

We've had a big debate over
mean xs = foldl' (+) 0 xs / fromIntegral (length xs)
For anyone who didn't follow it, the problem is that "mean" needs to traverse its argument twice, so the entire list has to be held in memory. So if "xs = [1..1000000000]" then "mean xs" uses all your memory, but "foldl' (+) 0 xs" and "length xs" by themselves do not. The solution is for the programmer to rewrite "mean" to accumulate a pair containing the running total and count together, then do the division. This makes me wonder: could there be a compiler optimisation rule for this, collapsing two iterations over a list into one. I've never tried writing GHC rules, but something like:
f (foldl' g x) (foldl' h x) = (uncurry f) (foldl' (\i (gt,ht) -> (g i gt, h i ht)))
The first problem that occurs to me is the number of permutations of fold and map functions. Paul.

paul:
We've had a big debate over
mean xs = foldl' (+) 0 xs / fromIntegral (length xs)
For anyone who didn't follow it, the problem is that "mean" needs to traverse its argument twice, so the entire list has to be held in memory. So if "xs = [1..1000000000]" then "mean xs" uses all your memory, but "foldl' (+) 0 xs" and "length xs" by themselves do not.
The solution is for the programmer to rewrite "mean" to accumulate a pair containing the running total and count together, then do the division. This makes me wonder: could there be a compiler optimisation rule for this, collapsing two iterations over a list into one. I've never tried writing GHC rules, but something like:
f (foldl' g x) (foldl' h x) = (uncurry f) (foldl' (\i (gt,ht) -> (g i gt, h i ht)))
The first problem that occurs to me is the number of permutations of fold and map functions.
You'd want a general fusion framework for this, and other accumulators, that's let's you treat 'f' as a zip of some kind that pulls items from its two arguments. And then require only a single rewrite rule for all your loop forms. Stream fusion (see "Lists to Streams to Nothing at All") at least does this for zips, zip (foldl .. ) (foldl .. ) Will end up with a single loop, but for an arbitrary 'f' instead of zip, seems harder. -- Don

Don Stewart wrote:
You'd want a general fusion framework for this, and other accumulators, that's let's you treat 'f' as a zip of some kind that pulls items from its two arguments. And then require only a single rewrite rule for all your loop forms.
Stream fusion (see "Lists to Streams to Nothing at All") at least does this for zips,
zip (foldl .. ) (foldl .. )
Will end up with a single loop, but for an arbitrary 'f' instead of zip, seems harder.
It seems the thing to do would be to expose a top-level function for explicitly specifying that this is what you want to do: foo :: (x' -> y' -> z') -> ([x] -> x') -> ([y] -> y') -> [x] -> [y] -> z' You could then say mean xs = foo (/) sum length which is pretty declarative, and gives the compiler an easy time with optimising. Of course, if nobody chooses to *use* this function... it won't help you any.

Don Stewart wrote:
You'd want a general fusion framework for this... Stream fusion... at least does this for zips... but for an arbitrary 'f' instead of zip, seems harder.
And of course, you wouldn't want that: f xs = xs : map expensiveCalculation xs Please don't fuse those two loops into one. In the case of "mean", the outer function in question is /, and that is a good candidate for fusion because it is strict in both arguments. I think Don is right. There is a lot of room for expanding the scope of fusion, but it needs to be done one function at a time until our strictness analysis gets better. In the general case, I like the way things work now. The original complaint about "unpredictability" is, in my opinion, a "bug in the documentation". This is yet another concept whose basics need to be made much more accessible to newbies. -Yitz

Yitzchak Gale wrote:
And of course, you wouldn't want that:
f xs = xs : map expensiveCalculation xs
Please don't fuse those two loops into one.
...doesn't type check. Did you mean (++)?
In the case of "mean", the outer function in question is /, and that is a good candidate for fusion because it is strict in both arguments.
I think Don is right. There is a lot of room for expanding the scope of fusion, but it needs to be done one function at a time until our strictness analysis gets better.
Mmm, strictnes analysis. That sounds hard...
In the general case, I like the way things work now. The original complaint about "unpredictability" is, in my opinion, a "bug in the documentation". This is yet another concept whose basics need to be made much more accessible to newbies.
...which is really the point I was trying to get across, yes. :-) Certainly there are experts [like Don] who seemingly have no trouble knocking out blindingly fast Haskell code. The problem is that only the halowed experts can do this; it's currently rather inaccessible to beginners and intermediates. For example, back when I really *was* a beginner, I found several fleeting references to "laziness can be bad sometimes", but I couldn't find anywhere that explains *why*. The concept that laziness could be something you *don't* want seemed highly counter-intuitive. [After all, laziness is this wonderful thing that stops you doing extra work you don't need to!] Indeed, it wasn't until I wrote a Haskell interpretter that I discovered the [main] problem - huge unevaluated expressions being needlessly dragged around. The fact that laziness is bad sometimes is a pretty basic piece of information for anybody remotely serious about performance programming to be made aware of. I think the information is out there, it's just not tefficially "findable" right now. I'm not 100% sure what the best way to improve this situation would be, but I think it involves somebody somewhere sitting down to write a single, coherant account from beginning to end explaining what the issues are, in language that people who aren't compiler designers can understand.

On Thu, May 15, 2008 at 8:45 PM, Andrew Coppin
Yitzchak Gale wrote:
And of course, you wouldn't want that:
f xs = xs : map expensiveCalculation xs
Please don't fuse those two loops into one.
...doesn't type check. Did you mean (++)?
Hmm? While the name might be misleading... expensiveCalculation = (:[]) Luke

Luke Palmer wrote:
On Thu, May 15, 2008 at 8:45 PM, Andrew Coppin
wrote: Yitzchak Gale wrote:
And of course, you wouldn't want that:
f xs = xs : map expensiveCalculation xs
Please don't fuse those two loops into one.
...doesn't type check. Did you mean (++)?
Hmm? While the name might be misleading...
expensiveCalculation = (:[])
Ah. So f :: [x] -> [[x]]. Devious...

On Wed, 14 May 2008, Paul Johnson wrote:
We've had a big debate over
mean xs = foldl' (+) 0 xs / fromIntegral (length xs)
For anyone who didn't follow it, the problem is that "mean" needs to traverse its argument twice, so the entire list has to be held in memory. So if "xs = [1..1000000000]" then "mean xs" uses all your memory, but "foldl' (+) 0 xs" and "length xs" by themselves do not.
I also wondered how to solve that problem. It also occurs in implementations of "wc", that is counting lines, words and characters of a text in parallel.
The solution is for the programmer to rewrite "mean" to accumulate a pair containing the running total and count together, then do the division. This makes me wonder: could there be a compiler optimisation rule for this, collapsing two iterations over a list into one. I've never tried writing GHC rules, but something like:
f (foldl' g x) (foldl' h x) = (uncurry f) (foldl' (\i (gt,ht) -> (g i gt, h i ht)))
The left hand sides of GHC rules must have a definite function as top-level construct, whereas 'f' is a variable. You may define a function which points the optimizer to the places where something shall be replaced. Say fuseFold :: (a -> b -> c) -> a -> b -> c fuseFold f x y = f x y {-# RULES forall f g h x y. fuseFold f (foldl' g x) (foldl' h x) = ... #-} However, instead of writing library functions which use foldl', it might be better to implement and export the accumulating functions separately. They can be used with 'foldl's of arbitrary data structures and they can be fused explicitly without GHC rules. However, I suspect you cannot rewrite say (length (words str)) or (length (lines str)) in that manner in an elegant way.
The first problem that occurs to me is the number of permutations of fold and map functions.
foldl' could be merged with 'map' to a foldl', however the existing fusion frameworks use different primitives, like foldr/build.

Paul Johnson wrote:
The solution is for the programmer to rewrite "mean" to accumulate a pair containing the running total and count together, then do the division. This makes me wonder: could there be a compiler optimisation rule for this, collapsing two iterations over a list into one.
Do not forget: http://www.haskell.org/tmrwiki/WhyAttributeGrammarsMatter Though not a compiler optimisation, it's a thinker optimisation. I have another thought. It is now time to investigate the viability of mean x = s `par` l `par` (s/l) where s = sum x l = length x If you worry that the sum thread and the length thread are not synchronized and therefore there is still no bound on the list prefix kept in memory, I'm sure you can improve it by one of the chunking strategies. As our hardware grows more cores but not more gigahertz, it may become detrimal to fuse. Fusion implies entangling, an anti-thesis to parallelism. One day we may call this an optimisation: unfuse code hand-fused by programmers, then parallelize. Optimisation people on the imperative side are already doing this (they have more hand-fused code than we do), though targetting at older hardware like vector machines and SIMD, not yet multiple cores.

Albert Y. C. Lai wrote:
If you worry that the sum thread and the length thread are not synchronized and therefore there is still no bound on the list prefix kept in memory, I'm sure you can improve it by one of the chunking strategies.
I'm more worried that data now has to go through tiny CPU <--> RAM busses rather than zipping along at the CPU's core frequency as it would if only 1 CPU were involved. I would think the cache issues make the sequential version a fair bit faster. (Not to mention that no GC is required.)
As our hardware grows more cores but not more gigahertz, it may become detrimal to fuse. Fusion implies entangling, an anti-thesis to parallelism. One day we may call this an optimisation: unfuse code hand-fused by programmers, then parallelize. Optimisation people on the imperative side are already doing this (they have more hand-fused code than we do), though targetting at older hardware like vector machines and SIMD, not yet multiple cores.
The key to parallel processing is splitting a task into *independent* tasks so you can do them in parallel. If they're still interdependent [like the above], I suspect you're not gaining much.

On Wed, May 14, 2008 at 06:08:28PM +0100, Paul Johnson wrote:
We've had a big debate over
mean xs = foldl' (+) 0 xs / fromIntegral (length xs)
For anyone who didn't follow it, the problem is that "mean" needs to traverse its argument twice, so the entire list has to be held in memory. So if "xs = [1..1000000000]" then "mean xs" uses all your memory, but "foldl' (+) 0 xs" and "length xs" by themselves do not.
The solution is for the programmer to rewrite "mean" to accumulate a pair containing the running total and count together, then do the division. This makes me wonder: could there be a compiler optimisation rule for this, collapsing two iterations over a list into one. I've never tried writing GHC rules, but something like:
f (foldl' g x) (foldl' h x) = (uncurry f) (foldl' (\i (gt,ht) -> (g i gt, h i ht)))
The first problem that occurs to me is the number of permutations of fold and map functions.
Paul.
There are two problems with this rule. The first is that the function is not strict in 'gt' and 'ht' -- this is easily fixed with a little bit of seq. The other problem is that 'f' must be strict in both parameters for this rule to hold. As far as I know, there is no access to strictness information in rule pragmas. Cheers, Spencer Janssen
participants (8)
-
Albert Y. C. Lai
-
Andrew Coppin
-
Don Stewart
-
Henning Thielemann
-
Luke Palmer
-
Paul Johnson
-
Spencer Janssen
-
Yitzchak Gale