On Thu, Nov 25, 2010 at 11:32 AM, Joachim Breitner <mail@joachim-breitner.de> wrote:
Hi,

although semantically it could, ghc does not do common subexpression
elimination (CSE), at least not in every case. The canonical example
where it would be bad is the function (with div to avoid numerical
casting):

> avg :: [Int] -> Int
> avg n = (sum [0..n] `div` length [0..n])

The reason why [0..n] is not shared is because then it would be kept in
memory completely after the evaluation of sum, because length will still
need it, possibly overflowing the memory. In the non-shared form,
though, the garbage collector can clean up directly behind sum and
length, and the program runs in constant space.

Note that the shared expression would be very large compared to the
thunks.

Now consider the program:

> avg' :: [Int] -> (Int, Int)
> avg' n = (sum [0..n] `div` length [0..n], length [0..n])

I’d say that CSE to
> avg n = (sum [0..n] `div` l, l)
>   where l = length [0..n]
is safe, because the shared expression (an Int) is smaller than the
thunks and the compiler can tell.

So I wonder:
 * Is sharing values of type Int (and Bool and similar small values)
always safe?
 * If so: does GHC already do that?
 * Would it be technically possible?
 * Is there an established theory that can tell, for a sharing
candidate, whether it is safe to share it?

I'm just going to answer your last question.

Jörgen Gustavsson and David Sands developed the theory of space improvement for call-by-need. That can help you answer the other questions you have. But that being said, it's fairly complicated stuff, and it's not a very easy to use theory. I think it's inherent in the problem though, the space behavior of lazy programs is just weird. If you're curious to read about space improvement see papers [GS01] and [GS99] on the following page:
http://www.cse.chalmers.se/~dave/davewww.html

Cheers,

Josef