optimization and rewrite rules questions

In the recently burried haskell-cafe thread "speed: ghc vs gcc", Bulat pointed out some of the optimizations that GHC doesn't do, such as loop unrolling. I suggested a way of experimenting with loop unrolling, using template haskell to bypass GHC's blindspot (it usually doesn't unfold recursive definitions http://www.haskell.org/pipermail/glasgow-haskell-users/2007-July/012936.html , but if we unfold a loop combinator at compile time, GHC's normal optimizations can take over from there): http://www.haskell.org/pipermail/haskell-cafe/2009-February/056241.html While this is fine as far as it goes (it should really be handled within GHC), and does offer some initial speedup, Bulat pointed out that GCC does further optimizations after unrolling, such as reassociating sums to expose potential for constant folding: http://www.haskell.org/pipermail/haskell-cafe/2009-February/056367.html (since the ghc -ddump-simpl output doesn't show this optimization, I assume that gcc handles it, and the "*ghc*" in that message is a typo, but haven't checked - how would I do that, btw?). In this case, GHC optimizations following the loop unrolling leave a sum like (note the repeated variable interspersed with constants) (GHC.Prim.+# (GHC.Prim.+# ww_s1lN 3) (GHC.Prim.+# (GHC.Prim.+# ww_s1lN 2) (GHC.Prim.+# (GHC.Prim.+# ww_s1lN 1) (GHC.Prim.+# (GHC.Prim.+# ww_s1lN 0) ww_s1lR)))))))) which can be simplified (assuming associativity and commutativity of + here..) after sorting the variable references and constants into separate groups. We currently inherit such optimizations when using -fvia-C, even though GHC sometimes produces C code that GCC can't handle optimally. If I understand correctly, -fvia-C is on its way out - is that correct, and what plans are there for recovering the optimizations previously left to GCC? The next thing I was looking at was rewrite rules, the obvious GHC tool for implementing this kind of rule (var+const1)+(var+const2) ==> 2*var + const3 and I ran into more questions: - can RULES left-hand sides test for variables (I don't want to reassociate sums randomly, that wouldn't terminate; instead, I want to float out subterms that are non-variable, and group repeated variables)? - is there any way to control the rewrite strategy, similar to strategy combinators (if rules are applied all over the place, they introduce new structure not covered by rules; if I could limit the strategy to top-down, or bottom-up, I could at least cover some special cases)? - how would one handle this kind of optimization in GHC in full generality? wait for compiler plugins? are there features of rewrite rules that I'm missing? would it make sense to flag rewrite rules system improvements as a GHC GSoC project, given that GHC will have to pull its weight there when moving away from GCC? Claus

Okay, I've found a combination of incantations that happens to
work, but only for this particular example. So this does not solve
the original questions, and I'm still interested in suggestions. But it
does give a concrete example of what I'd like to be able to do (or
better, what GHC should be doing, and GCC apparently does),
in the hope that this helps the discussion along.
This message a bit lengthy, as I try to summarize the approach,
for those here who didn't follow the haskell-cafe thread earlier
(so this should be fairly self-contained) and include timings.
1. Old things first, the TH macro for syntactic replication of calls
to 'f', each call getting its own counter 'i', and 'x' as the first parameter:
------------------------------------------------------------------
{-# LANGUAGE TemplateHaskell #-}
module Apply where
import Language.Haskell.TH.Syntax
apply i bound | i

| II is where I'd like to be able to distinguish variables, constants, | and complex expressions in the left-hand sides of RULES, and | I and III are where I'd like control over the rewrite strategy, as | in strategy combinators. I'm deep in icfp submissions, so no time to reply properly. You can distinguish between literals, variables etc, in GHC's BuiltinRules. These are not hard to write: see prelude/PrelRules. But they are built in, not part of the source program. Maybe that's ok, since you are messing with built-in arithmetic. Another avenue is to elaborate the language of rules somehow to let you say what you want. But I don't know a good *spec* for such a feature, let alone an impl. Another possibility is to write a Core-to-Core optimiser aimed at your target area. Max B's plugins would let you dynamically link that to a distributed GHC. But it's not in the HEAD yet... Simon

| II is where I'd like to be able to distinguish variables, constants, | and complex expressions in the left-hand sides of RULES, and | I and III are where I'd like control over the rewrite strategy, as | in strategy combinators.
I'm deep in icfp submissions, so no time to reply properly.
Okay, not urgent, just general concerns about GHC performance, can wait a few days.
You can distinguish between literals, variables etc, in GHC's BuiltinRules.
Not really sufficient. If the loop body was about Maps instead of Ints, different rules would apply. And the same limitations apply for all uses of RULES, which were meant to provide for user-extensible library-specific optimization by transformation.
Another avenue is to elaborate the language of rules somehow to let you say what you want. But I don't know a good *spec* for such a feature, let alone an impl. Another possibility is to write a Core-to-Core optimiser aimed at your target area. Max B's plugins would let you dynamically link that to a distributed GHC. But it's not in the HEAD yet...
My guess would be to start from standard Strafunski/SYB-style traversals over Core as the "semantic API", utilizing compiler plugins. Then look for better syntax, closer to the existing RULES (for a start, something like quasi-quoting, so that left- and right-hand sides of rules can still be written in concrete syntax, even if they are just Haskell function definitions). Then move all RULES to the new framework (to reduce duplication inside GHC, and to provide a single programmer API) and start adding further optimizations to libraries (I wonder: surely others have encountered these limitations when adding RULES for their libraries?). But that is just a guess - as you imply, someone would have to sit down, design, implement, and test. That is why I asked whether it would make a good GSoC topic for GHC HQ (lots of potential, and building on one of last year's topics). Claus

2009/2/24 Claus Reinke
In the recently burried haskell-cafe thread "speed: ghc vs gcc", Bulat pointed out some of the optimizations that GHC doesn't do, such as loop unrolling. I suggested a way of experimenting with loop unrolling, using template haskell to bypass GHC's blindspot (it usually doesn't unfold recursive definitions http://www.haskell.org/pipermail/glasgow-haskell-users/2007-July/012936.html , but if we unfold a loop combinator at compile time, GHC's normal optimizations can take over from there):
http://www.haskell.org/pipermail/haskell-cafe/2009-February/056241.html
Just a note - there is a solution that doesn't require Template Haskell which I use in my own code. Here is a sketch: fact = fix4 fact_worker {-# INLINE fact_worker #-} fact_worker recurse n | n <= 0 = 1 | otherwise = n * recurse (n - 1) {-# INLINE fix4 #-} fix4 f = f1 where f1 = f f2 f2 = f f3 f3 = f f4 f4 = f f1 There is probably a way to generalise this to arbitrary levels of unrolling by using instances of a typeclass on type level numerals. Cheers, Max

but if we unfold a loop combinator at compile time, GHC's normal optimizations can take over from there):
http://www.haskell.org/pipermail/haskell-cafe/2009-February/056241.html
Just a note - there is a solution that doesn't require Template Haskell which I use in my own code. Here is a sketch:
That is in fact the same solution!-) Just that I stayed close to the example in the original thread, hence a fixpoint-combinator with implicit tail-recursion and built-in counter rather than one with explicit general recursion.
fact = fix4 fact_worker
{-# INLINE fact_worker #-} fact_worker recurse n | n <= 0 = 1 | otherwise = n * recurse (n - 1)
{-# INLINE fix4 #-} fix4 f = f1 where f1 = f f2 f2 = f f3 f3 = f f4 f4 = f f1
There is probably a way to generalise this to arbitrary levels of unrolling by using instances of a typeclass on type level numerals.
Semantically, one could compute the nested application without meta-level help, but that involves another recursive definition, which GHC won't unfold during compilation. So I used TH, just to generate the equivalent to the 'fixN' definition. Since only the fixpoint/loop combinators need to be unfolded statically, one could indeed do it by hand, for a suitable range of unfolding depths, and provide them as a library. Claus
participants (3)
-
Claus Reinke
-
Max Bolingbroke
-
Simon Peyton-Jones