Safer common subexpression elimination

Hi, although semantically it could, ghc does not do common subexpression elimination (CSE), at least not in every case. The canonical example where it would be bad is the function (with div to avoid numerical casting):
avg :: [Int] -> Int avg n = (sum [0..n] `div` length [0..n])
The reason why [0..n] is not shared is because then it would be kept in memory completely after the evaluation of sum, because length will still need it, possibly overflowing the memory. In the non-shared form, though, the garbage collector can clean up directly behind sum and length, and the program runs in constant space. Note that the shared expression would be very large compared to the thunks. Now consider the program:
avg' :: [Int] -> (Int, Int) avg' n = (sum [0..n] `div` length [0..n], length [0..n])
avg n = (sum [0..n] `div` l, l) where l = length [0..n] is safe, because the shared expression (an Int) is smaller than the
I’d say that CSE to thunks and the compiler can tell. So I wonder: * Is sharing values of type Int (and Bool and similar small values) always safe? * If so: does GHC already do that? * Would it be technically possible? * Is there an established theory that can tell, for a sharing candidate, whether it is safe to share it? Thanks, Joachim -- Joachim "nomeata" Breitner mail: mail@joachim-breitner.de | ICQ# 74513189 | GPG-Key: 4743206C JID: nomeata@joachim-breitner.de | http://www.joachim-breitner.de/ Debian Developer: nomeata@debian.org

On Thu, Nov 25, 2010 at 11:32 AM, Joachim Breitner wrote: Hi, although semantically it could, ghc does not do common subexpression
elimination (CSE), at least not in every case. The canonical example
where it would be bad is the function (with div to avoid numerical
casting): avg :: [Int] -> Int
avg n = (sum [0..n] `div` length [0..n]) The reason why [0..n] is not shared is because then it would be kept in
memory completely after the evaluation of sum, because length will still
need it, possibly overflowing the memory. In the non-shared form,
though, the garbage collector can clean up directly behind sum and
length, and the program runs in constant space. Note that the shared expression would be very large compared to the
thunks. Now consider the program: avg' :: [Int] -> (Int, Int)
avg' n = (sum [0..n] `div` length [0..n], length [0..n]) avg n = (sum [0..n] `div` l, l)
where l = length [0..n]
is safe, because the shared expression (an Int) is smaller than the I’d say that CSE to
thunks and the compiler can tell. So I wonder:
* Is sharing values of type Int (and Bool and similar small values)
always safe?
* If so: does GHC already do that?
* Would it be technically possible?
* Is there an established theory that can tell, for a sharing
candidate, whether it is safe to share it? I'm just going to answer your last question. Jörgen Gustavsson and David Sands developed the theory of space improvement
for call-by-need. That can help you answer the other questions you have. But
that being said, it's fairly complicated stuff, and it's not a very easy to
use theory. I think it's inherent in the problem though, the
space behavior of lazy programs is just weird. If you're curious to read
about space improvement see papers [GS01] and [GS99] on the following page:
http://www.cse.chalmers.se/~dave/davewww.html
Cheers,
Josef

On Thu, Nov 25, 2010 at 2:32 AM, Joachim Breitner
So I wonder: * Is sharing values of type Int (and Bool and similar small values) always safe? * If so: does GHC already do that? * Would it be technically possible? * Is there an established theory that can tell, for a sharing candidate, whether it is safe to share it?
Jhc does this for simple types, which are unary constructors with a primitive argument (CPR types). I never actually ran tests to see how much it helps if any. John

Joachim Breitner schrieb:
Hi,
although semantically it could, ghc does not do common subexpression elimination (CSE), at least not in every case. The canonical example where it would be bad is the function (with div to avoid numerical casting):
avg :: [Int] -> Int avg n = (sum [0..n] `div` length [0..n])
The reason why [0..n] is not shared is because then it would be kept in memory completely after the evaluation of sum, because length will still need it, possibly overflowing the memory. In the non-shared form, though, the garbage collector can clean up directly behind sum and length, and the program runs in constant space.
Note that the shared expression would be very large compared to the thunks.
Now consider the program:
avg' :: [Int] -> (Int, Int) avg' n = (sum [0..n] `div` length [0..n], length [0..n])
It think this is not typecorrect, since 'n' denotes a list and is used as upper bound in [0..n]. Should it be
avg' ns = (sum ns `div` length ns, length ns)
?
avg n = (sum [0..n] `div` l, l) where l = length [0..n] is safe, because the shared expression (an Int) is smaller than the
I’d say that CSE to thunks and the compiler can tell.
If this is meant to be
avg n = (sum ns `div` l, l) where l = length ns
according to my substitution above, then 'ns' still has to be shared and stored completely. I think the general problem is more complicated than common subexpression elimination, because lazy evaluation sometimes forces an inefficient order of computation. Other examples are
viewR xs = (init xs, last xs)
and an elegant Haskell implementation of the Unix command 'wc' like
wc text = (length text, length (words text), length (lines text))

Hello Henning, Am Samstag, den 04.12.2010, 12:41 +0100 schrieb Henning Thielemann:
Joachim Breitner schrieb:
Now consider the program:
avg' :: [Int] -> (Int, Int) avg' n = (sum [0..n] `div` length [0..n], length [0..n])
It think this is not typecorrect, since 'n' denotes a list and is used as upper bound in [0..n]. Should it be
avg' ns = (sum ns `div` length ns, length ns)
?
Thanks for noting. The type annotation is wrong. In your case (which I had first), ns is shared and thus would not allow for constant-space calculation of sum and length and would fail to demonstrate my point.
avg n = (sum [0..n] `div` l, l) where l = length [0..n] is safe, because the shared expression (an Int) is smaller than the
I’d say that CSE to thunks and the compiler can tell.
If this is meant to be
avg n = (sum ns `div` l, l) where l = length ns
according to my substitution above, then 'ns' still has to be shared and stored completely.
Right, sorry for the confusion by the wrong type. Greetings, Joachim -- Joachim Breitner e-Mail: mail@joachim-breitner.de Homepage: http://www.joachim-breitner.de ICQ#: 74513189 Jabber-ID: nomeata@joachim-breitner.de

Joachim Breitner schrieb:
Am Samstag, den 04.12.2010, 12:41 +0100 schrieb Henning Thielemann:
Joachim Breitner schrieb:
Now consider the program:
avg' :: [Int] -> (Int, Int) avg' n = (sum [0..n] `div` length [0..n], length [0..n]) It think this is not typecorrect, since 'n' denotes a list and is used as upper bound in [0..n]. Should it be
avg' ns = (sum ns `div` length ns, length ns) ?
Thanks for noting. The type annotation is wrong. In your case (which I had first), ns is shared and thus would not allow for constant-space calculation of sum and length and would fail to demonstrate my point.
I see. It may be that GHC actually eliminates common subexpression of primitive types. Its Core output will tell you. Nonetheless, maybe it is more sensible for a compiler not to answer the question whether a common subexpression should be shared or not, but it may try to evaluate the depending expressions in a way such that the shared data structure can be garbage collected as computation goes on.
participants (4)
-
Henning Thielemann
-
Joachim Breitner
-
John Meacham
-
Josef Svenningsson