GHC optimisations

Does GHC do stuff like converting (2*) into (shift 1) or converting x + x into 2*x? If I do x * sin 12, is GHC likely to compute sin 12 at compile-time?

On 8/19/07, Andrew Coppin
Does GHC do stuff like converting (2*) into (shift 1) or converting x + x into 2*x?
Hmm, that's an interesting architecture where multiplication is cheaper than addition :-)
If I do x * sin 12, is GHC likely to compute sin 12 at compile-time?
I seriously doubt it, but there's an easy way to find out the answer to all these questions: pass the -ddump-simpl flag to GHC so that it prints out the intermediate code it generates. There's some information in the users' guide about how to read the results. Cheers, Tim -- Tim Chevalier * catamorphism.org * Often in error, never in doubt "I always wanted to be commander-in-chief of my own one-woman army" -- Ani DiFranco

On Sun, Aug 19, 2007 at 12:53:07PM +0100, Andrew Coppin wrote:
Does GHC do stuff like converting (2*) into (shift 1) or converting x + x into 2*x?
For a good time, compile some code which uses even or odd :: Int -> Bool using -O2 -fasm -ddump-asm... The compiler *really* shouldn't be using 'idivl'. (If you use -fvia-C -optc-O2, the C compiler will notice the operations and optimize it itself. This is one of the very few areas where -fvia-C is still better.)
If I do x * sin 12, is GHC likely to compute sin 12 at compile-time?
Also try -ddump-simpl-stats and -ddump-simpl-iterations if you want to know *why*. (The extremely obscure 'full laziness' transformation performed by GHC has a fundamental effect on the compilation of x * sin 12...) Stefan

Stefan O'Rear wrote:
On Sun, Aug 19, 2007 at 12:53:07PM +0100, Andrew Coppin wrote:
Does GHC do stuff like converting (2*) into (shift 1) or converting x + x into 2*x?
For a good time, compile some code which uses even or odd :: Int -> Bool using -O2 -fasm -ddump-asm... The compiler *really* shouldn't be using 'idivl'.
(If you use -fvia-C -optc-O2, the C compiler will notice the operations and optimize it itself. This is one of the very few areas where -fvia-C is still better.)
The way I heard it is that compilation via C is "always better", though they "plan" to change that some day. I don't know how true that is...
If I do x * sin 12, is GHC likely to compute sin 12 at compile-time?
Also try -ddump-simpl-stats and -ddump-simpl-iterations if you want to know *why*. (The extremely obscure 'full laziness' transformation performed by GHC has a fundamental effect on the compilation of x * sin 12...)
Hmm, OK.

On Mon, Aug 20, 2007 at 06:30:27PM +0100, Andrew Coppin wrote:
Stefan O'Rear wrote:
On Sun, Aug 19, 2007 at 12:53:07PM +0100, Andrew Coppin wrote:
Does GHC do stuff like converting (2*) into (shift 1) or converting x + x into 2*x?
For a good time, compile some code which uses even or odd :: Int -> Bool using -O2 -fasm -ddump-asm... The compiler *really* shouldn't be using 'idivl'.
(If you use -fvia-C -optc-O2, the C compiler will notice the operations and optimize it itself. This is one of the very few areas where -fvia-C is still better.)
The way I heard it is that compilation via C is "always better", though they "plan" to change that some day. I don't know how true that is...
Currently, it's never worse. GHC's backend is about as good as GCC; most of the optimiations it doesn't do are not possible for GCC because of various lack-of-information problems (the stack pointer never aliases the heap pointer, stuff like that). It's conceivable that at some point -fasm will be faster, because you have the possibility of much more accurate aliasing information inside the compiler, than can be coded in C. In the meantime, note that the runtime difference is less than 3% and the compile time difference is over 100%, so it's only worthwhile if you expect *this version* of your program to be used more than 30 times, ie releases only. Stefan

On 8/21/07, Stefan O'Rear
Currently, it's never worse. GHC's backend is about as good as GCC; most of the optimiations it doesn't do are not possible for GCC because of various lack-of-information problems (the stack pointer never aliases the heap pointer, stuff like that). It's conceivable that at some point -fasm will be faster, because you have the possibility of much more accurate aliasing information inside the compiler, than can be coded in C. In the meantime, note that the runtime difference is less than 3% and the compile time difference is over 100%, so it's only worthwhile if you expect *this version* of your program to be used more than 30 times, ie releases only.
Wait, you're saying that ghc can produce "pure" c-code, that doesnt contain any assembly code, and that runs as fast as ghc code that does contain assembly? Sooo.... if I was feeling "evil", could I take this c-code and pipe it into something that turns it into C#??? If it contains lots of macros (or any macros at all perhaps...), this becomes non-trivial, but otherwise I think most things in C can be mapped fairly trivially to C#? (It's a one-way mapping of course, eg "delete" in C is simply dropped when mapped to c#). (Not that I have any good reason to do this, simply... fun).

On Tue, Aug 21, 2007 at 09:39:32PM +0800, Hugh Perkins wrote:
On 8/21/07, Stefan O'Rear
wrote: Currently, it's never worse. GHC's backend is about as good as GCC; most of the optimiations it doesn't do are not possible for GCC because of various lack-of-information problems (the stack pointer never aliases the heap pointer, stuff like that). It's conceivable that at some point -fasm will be faster, because you have the possibility of much more accurate aliasing information inside the compiler, than can be coded in C. In the meantime, note that the runtime difference is less than 3% and the compile time difference is over 100%, so it's only worthwhile if you expect *this version* of your program to be used more than 30 times, ie releases only.
Wait, you're saying that ghc can produce "pure" c-code, that doesnt contain any assembly code, and that runs as fast as ghc code that does contain assembly?
No. Name: Registerized C Performance: 1.00 Flags: -fvia-C C, but with gcc and machine specific hacks to implement general tail calls and (most notably) register global variables. Name: Native code generator Performance: 0.97 Flags: -fasm GHC's own mini C compiler converts the internal C-- data into assembly code, which is then piped to gas. Name: Unregisterized C Performance: 0.40 Flags: -unreg Generates near-ANSI C, using memory variables for the VM's registers and the returning function pointer hack seen in oh so many Scheme compilers. Good for early stages of porting, and not much else. Name: Byte-code Performance: 0.05 Flags: -fbyte-code (GHCi HEAD only) Generates a compact form of STG code, and then interprets it. A generally quite bad idea, whose main redeeming feature is that it doesn't require starting the GNU toolchain.
Sooo.... if I was feeling "evil", could I take this c-code and pipe it into something that turns it into C#???
Yes. You could do the same with the original haskell. It's called a compiler.
If it contains lots of macros (or any macros at all perhaps...), this becomes non-trivial,
I fail to see how macros have anything to do with this. Especially since cpp removes them all.
but otherwise I think most things in C can be mapped fairly trivially to C#?
Unsafe C#, sure. Haskell's type system is strictly more expressive than C#, and you need to sacrifice either machine efficiency or checked safety.
(It's a one-way mapping of course, eg "delete" in C is simply dropped when mapped to c#).
There is no delete in C, and even if there was, GHC wouldn't use it. Allocation is *the* major bottleneck of functional programs, and having a custom allocator inlined into every call site is vital to have usable performance.
(Not that I have any good reason to do this, simply... fun).
Stefan

Thank-you for the information. It was very useful. Couple of reactions FWIW:
On 8/21/07, Stefan O'Rear
Sooo.... if I was feeling "evil", could I take this c-code and pipe it into something that turns it into C#???
Yes. You could do the same with the original haskell. It's called a compiler.
Yes, that is true. However, this is also true, for an appropriate compiler, for programs such as: "Give me the first 10 numbers of the Fibonnacci (spelling?) series". The compiler can search on the internet for what is the Fibonnacci series, and/or ask its friends. In a subsequent version, a compiler could in fact compile programs such as: "Go!" ... where the compiler uses context to deduce what I want it to do ;-) Nevertheless certain compilers are easier to write than others, and writing code to automatically port ghc-generated C code is likely to be significantly easier than to compile Haskell to C#, or to .Net bytecode, from scratch.
Name: Native code generator Performance: 0.97 Flags: -fasm
GHC's own mini C compiler converts the internal C-- data into assembly code, which is then piped to gas.
Ah, hence SPJ's C-- project?
Name: Unregisterized C Performance: 0.40 Flags: -unreg
Generates near-ANSI C, using memory variables for the VM's registers and the returning function pointer hack seen in oh so many Scheme compilers. Good for early stages of porting, and not much else.
Could be good enough. C# compiler and VM provides some optimizations which could handle this. What is the function pointer hack? Specifically, is that why you say "near-ANSI" C, rather than "ANSI C"?
If it contains lots of macros (or any macros at all perhaps...), this becomes non-trivial,
I fail to see how macros have anything to do with this. Especially since cpp removes them all.
but otherwise I think most things in C can be mapped fairly trivially to C#?
Unsafe C#, sure. Haskell's type system is strictly more expressive than C#, and you need to sacrifice either machine efficiency or checked safety.
(It's a one-way mapping of course, eg "delete" in C is simply dropped when mapped to c#).
There is no delete in C, and even if there was, GHC wouldn't use it. Allocation is *the* major bottleneck of functional programs, and having a custom allocator inlined into every call site is vital to have usable performance.
(Not that I have any good reason to do this, simply... fun).
Stefan
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux)
iD8DBQFGyur/FBz7OZ2P+dIRAleTAJ9WiK8tCp0QZE4syG4BZk5EFm1FuQCgzYGK NUv22zY5IgeqkEJ5kL3yriQ= =0Xkq -----END PGP SIGNATURE-----

Arggh, lagged out, and accidentally hit send before deleting the quoted text. Sorry :-((((

Hi
Wait, you're saying that ghc can produce "pure" c-code, that doesnt contain any assembly code, and that runs as fast as ghc code that does contain assembly?
No. It can produce pure C code (unregistered), but to get high performance it processes the output assembly afterwards (registered).
Sooo.... if I was feeling "evil", could I take this c-code and pipe it into something that turns it into C#???
You might be able to. Much easier would be to use Yhc and pass the --dotnet flag which generates .NET binaries natively.
macros (or any macros at all perhaps...), this becomes non-trivial, but otherwise I think most things in C can be mapped fairly trivially to C#? (It's a one-way mapping of course, eg "delete" in C is simply dropped when mapped to c#).
There isn't going to be much free/delete, its all a garbage collected heap. Thanks Neil

GHC does some constant folding, but little by way of strength reduction, or using shifts instead of multiplication. It's pretty easy to add more: it's all done in a single module. Look at primOpRules in the module PrelRules. Patches welcome! But please also supply test-suite tests that check the correctness of the rules. Simon | -----Original Message----- | From: haskell-cafe-bounces@haskell.org [mailto:haskell-cafe- | bounces@haskell.org] On Behalf Of Stefan O'Rear | Sent: 19 August 2007 20:14 | To: Andrew Coppin | Cc: haskell-cafe@haskell.org | Subject: Re: [Haskell-cafe] GHC optimisations | | On Sun, Aug 19, 2007 at 12:53:07PM +0100, Andrew Coppin wrote: | > Does GHC do stuff like converting (2*) into (shift 1) or converting x | + x | > into 2*x? | | For a good time, compile some code which uses even or odd :: Int -> | Bool | using -O2 -fasm -ddump-asm... The compiler *really* shouldn't be using | 'idivl'. | | (If you use -fvia-C -optc-O2, the C compiler will notice the operations | and optimize it itself. This is one of the very few areas where -fvia- | C | is still better.) | | > If I do x * sin 12, is GHC likely to compute sin 12 at compile-time? | | Also try -ddump-simpl-stats and -ddump-simpl-iterations if you want to | know *why*. (The extremely obscure 'full laziness' transformation | performed by GHC has a fundamental effect on the compilation of x * sin | 12...) | | Stefan

On Mon, Aug 20, 2007 at 09:57:38PM +0100, Simon Peyton-Jones wrote:
GHC does some constant folding, but little by way of strength reduction, or using shifts instead of multiplication. It's pretty easy to add more: it's all done in a single module. Look at primOpRules in the module PrelRules.
Patches welcome! But please also supply test-suite tests that check the correctness of the rules.
Sucking another example out of comp.lang.functional:
This:
import System
f :: Int -> Int -> Int
f s n = if n > 0 then f (s+n) (n-1) else s
main = do
[n] <- getArgs
putStrLn $ show $ f 0 (read n)
is 3-4x slower than this:
#include

phil:
On Mon, Aug 20, 2007 at 09:57:38PM +0100, Simon Peyton-Jones wrote:
GHC does some constant folding, but little by way of strength reduction, or using shifts instead of multiplication. It's pretty easy to add more: it's all done in a single module. Look at primOpRules in the module PrelRules.
Patches welcome! But please also supply test-suite tests that check the correctness of the rules.
Sucking another example out of comp.lang.functional:
This:
import System
f :: Int -> Int -> Int f s n = if n > 0 then f (s+n) (n-1) else s
main = do [n] <- getArgs putStrLn $ show $ f 0 (read n)
is 3-4x slower than this:
#include
#include #include int f(int s, int n) { return n > 0 ? f(s+n, n-1) : s; }
int main(int argc, char *argv[]) { assert(argc == 2); printf("%d\n", f(0, strtol(argv[1],0,0))); }
The generated assembler suggests (if I've read it correctly) that gcc is spotting that it can replace the tail call with a jump in the C version, but for some reason it can't spot it for the Haskell version when compiling with -fvia-C (and neither does ghc itself using -fasm). So the haskell version ends up pushing and popping values on and off the stack for every call to f, which is a bit sad.
That doesn't sound quite right. The C version should get a tail call , with gcc -O2, the Haskell version should be a tail call anyway. Let's see: C $ gcc -O t.c -o t $ time ./t 1000000000 zsh: segmentation fault (core dumped) ./t 1000000000 ./t 1000000000 0.02s user 0.22s system 5% cpu 4.640 total Turning on -O2 $ time ./t 1000000000 -243309312 ./t 1000000000 1.89s user 0.00s system 97% cpu 1.940 total And GHC: $ ghc -O2 A.hs -o A $ time ./A 1000000000 -243309312 ./A 1000000000 3.21s user 0.01s system 97% cpu 3.289 total So, what, 1.6x slower than gcc -O2 Seems ok without any tuning. -- Don

On my system, the C version runs about 9x faster than the haskell
version (with -O3 and -O2 -fvia-c -optc-O3 respectively). However, GCC
seems to produce about 70 lines of assembly for the main loop,
compared to about 10 from GHC. I suspect the speed difference is the
result of some heavy optimisation by GCC, which would need to be
hand-tuned for GHC. (I would be interested to see what this would be.
Unfortunately I don't know x86 assembly well enough to understand the
GCC output.)
On 21/08/07, Donald Bruce Stewart
phil:
On Mon, Aug 20, 2007 at 09:57:38PM +0100, Simon Peyton-Jones wrote:
GHC does some constant folding, but little by way of strength reduction, or using shifts instead of multiplication. It's pretty easy to add more: it's all done in a single module. Look at primOpRules in the module PrelRules.
Patches welcome! But please also supply test-suite tests that check the correctness of the rules.
Sucking another example out of comp.lang.functional:
This:
import System
f :: Int -> Int -> Int f s n = if n > 0 then f (s+n) (n-1) else s
main = do [n] <- getArgs putStrLn $ show $ f 0 (read n)
is 3-4x slower than this:
#include
#include #include int f(int s, int n) { return n > 0 ? f(s+n, n-1) : s; }
int main(int argc, char *argv[]) { assert(argc == 2); printf("%d\n", f(0, strtol(argv[1],0,0))); }
The generated assembler suggests (if I've read it correctly) that gcc is spotting that it can replace the tail call with a jump in the C version, but for some reason it can't spot it for the Haskell version when compiling with -fvia-C (and neither does ghc itself using -fasm). So the haskell version ends up pushing and popping values on and off the stack for every call to f, which is a bit sad.
That doesn't sound quite right. The C version should get a tail call , with gcc -O2, the Haskell version should be a tail call anyway.
Let's see:
C $ gcc -O t.c -o t $ time ./t 1000000000 zsh: segmentation fault (core dumped) ./t 1000000000 ./t 1000000000 0.02s user 0.22s system 5% cpu 4.640 total
Turning on -O2
$ time ./t 1000000000 -243309312 ./t 1000000000 1.89s user 0.00s system 97% cpu 1.940 total
And GHC:
$ ghc -O2 A.hs -o A $ time ./A 1000000000 -243309312 ./A 1000000000 3.21s user 0.01s system 97% cpu 3.289 total
So, what, 1.6x slower than gcc -O2 Seems ok without any tuning.
-- Don _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Tue, Aug 21, 2007 at 01:14:20PM +0100, Rodrigo Queiro wrote:
On my system, the C version runs about 9x faster than the haskell version (with -O3 and -O2 -fvia-c -optc-O3 respectively). However, GCC seems to produce about 70 lines of assembly for the main loop, compared to about 10 from GHC. I suspect the speed difference is the result of some heavy optimisation by GCC, which would need to be hand-tuned for GHC. (I would be interested to see what this would be. Unfortunately I don't know x86 assembly well enough to understand the GCC output.)
The fundamental problem is that GHC doesn't have enough registers to to a good job with Haskell. Normal Haskell code makes extensive use of the GHC stack for function calls, the C stack for signal handlers, the capability base pointer for thread state, and the heap for everything else. Which doesn't really leave us in a good state for optimizing. In particular, x86 ghc ALWAYS passes parameters on the stack, even for tail calls. I didn't actually bother to check, but I'm pretty sure that's what the OP was noticing - if you look carefully it's not actually pushing or popping anything, just using stack memory. Situations are far better on x86_64 (16 registers) and ppc (32 registers). There is some work being done on the backend to improve this (in particular, a new and much better register allocator and a parameter-aware Cmm system). Stefan

On Tue, Aug 21, 2007 at 05:25:49AM -0700, Stefan O'Rear wrote:
On Tue, Aug 21, 2007 at 01:14:20PM +0100, Rodrigo Queiro wrote:
On my system, the C version runs about 9x faster than the haskell version (with -O3 and -O2 -fvia-c -optc-O3 respectively). However, GCC seems to produce about 70 lines of assembly for the main loop, compared to about 10 from GHC. I suspect the speed difference is the result of some heavy optimisation by GCC, which would need to be hand-tuned for GHC. (I would be interested to see what this would be. Unfortunately I don't know x86 assembly well enough to understand the GCC output.)
GCC is carrying out two major optimisations that ghc is missing here: replacing the tail call with a jump directly into the function body (having stuffed the correct arguments into the appropriate registers) and unrolling the loop. That's pretty much it. Neither are what I'd call 'heavy' optimisations.
The fundamental problem is that GHC doesn't have enough registers to to a good job with Haskell. Normal Haskell code makes extensive use of the GHC stack for function calls, the C stack for signal handlers, the capability base pointer for thread state, and the heap for everything else. Which doesn't really leave us in a good state for optimizing. In particular, x86 ghc ALWAYS passes parameters on the stack, even for tail calls. I didn't actually bother to check, but I'm pretty sure that's what the OP was noticing - if you look carefully it's not actually pushing or popping anything, just using stack memory.
Yes, absolutely.
Situations are far better on x86_64 (16 registers) and ppc (32 registers). There is some work being done on the backend to improve this (in particular, a new and much better register allocator and a parameter-aware Cmm system).
<fires up ppc box> Ouch. That's even worse: $ ./sum 100000000 C version: 0.16s Haskell : 1.40s Looking at the generated assembler, the ppc version has exactly the same problem that the x86 version does. It carries out the calculation, the stores the result in some memory locations and calls f again so that the preamble to f can pull those same results out of the memory locations in order to put them back into the same registers again! (I'm using ghc 6.6.1 on Debian unstable btw for anyone following along.) cheers, Phil -- http://www.kantaka.co.uk/ .oOo. public key: http://www.kantaka.co.uk/gpg.txt

Don's reply didn't reach me for some reason, but pulling it out of the previous response:
On 21/08/07, Donald Bruce Stewart
wrote: phil:
The generated assembler suggests (if I've read it correctly) that gcc is spotting that it can replace the tail call with a jump in the C version, but for some reason it can't spot it for the Haskell version when compiling with -fvia-C (and neither does ghc itself using -fasm). So the haskell version ends up pushing and popping values on and off the stack for every call to f, which is a bit sad.
That doesn't sound quite right. The C version should get a tail call , with gcc -O2, the Haskell version should be a tail call anyway.
Just to be clear; the Haskell version is a tail call, but it's pushing the values to and from memory (well, cache really of course) for every call to f, which is killing the performance.
Let's see:
C $ gcc -O t.c -o t $ time ./t 1000000000 zsh: segmentation fault (core dumped) ./t 1000000000 ./t 1000000000 0.02s user 0.22s system 5% cpu 4.640 total
Turning on -O2
$ time ./t 1000000000 -243309312 ./t 1000000000 1.89s user 0.00s system 97% cpu 1.940 total
-O3 does better thanks to the loop unrolling, see timings bellow.
And GHC:
$ ghc -O2 A.hs -o A $ time ./A 1000000000 -243309312 ./A 1000000000 3.21s user 0.01s system 97% cpu 3.289 total
So, what, 1.6x slower than gcc -O2 Seems ok without any tuning.
You're getting much better timings than I am! $ time -p ./sum-hs 1000000000 -243309312 real 3.75 user 3.70 $ time -p ./sum-c-O2 1000000000 -243309312 real 1.40 user 1.35 $ time -p ./sum-c-O3 1000000000 -243309312 real 1.21 user 1.18 (My box has a AMD Athlon64 3000+ CPU fwiw, but the powerpc version is even worse when compared to it's respective C binary!) Phil -- http://www.kantaka.co.uk/ .oOo. public key: http://www.kantaka.co.uk/gpg.txt

Simon Peyton-Jones wrote:
GHC does some constant folding, but little by way of strength reduction, or using shifts instead of multiplication. It's pretty easy to add more: it's all done in a single module. Look at primOpRules in the module PrelRules.
Patches welcome! But please also supply test-suite tests that check the correctness of the rules.
So... you mean it's source-level transformation rules? (Rather than wired into the compiler itself somewhere.)

| > GHC does some constant folding, but little by way of strength | reduction, or using shifts instead of multiplication. It's pretty easy | to add more: it's all done in a single module. Look at primOpRules in | the module PrelRules. | > | > Patches welcome! But please also supply test-suite tests that check | the correctness of the rules. | > | | So... you mean it's source-level transformation rules? (Rather than | wired into the compiler itself somewhere.) No, constant folding is part of the compiler, I'm afraid, in the module PrelRules. Simon

Simon Peyton-Jones wrote:
| > GHC does some constant folding, but little by way of strength | reduction, or using shifts instead of multiplication. It's pretty easy | to add more: it's all done in a single module. Look at primOpRules in | the module PrelRules. | > | > Patches welcome! But please also supply test-suite tests that check | the correctness of the rules. | > | | So... you mean it's source-level transformation rules? (Rather than | wired into the compiler itself somewhere.)
No, constant folding is part of the compiler, I'm afraid, in the module PrelRules.
Simon
_Constant_ folding is, but in GHC.Base there are rules like (unboxed) multiplying by zero or one, or adding or subtracting zero, from an unknown other (non-constant) value. I think shifts might be doable via RULES... if you were willing to make one rule for each denominator 2, 4, 8 and so on, which rather depends on max. Int... (and that's not Integers either, I guess) Isaac

Isaac Dupree wrote:
Simon Peyton-Jones wrote:
... No, constant folding is part of the compiler, I'm afraid, in the module PrelRules.
Simon
_Constant_ folding is, but in GHC.Base there are rules like (unboxed) multiplying by zero or one, or adding or subtracting zero, from an unknown other (non-constant) value. I think shifts might be doable via RULES... if you were willing to make one rule for each denominator 2, 4, 8 and so on, which rather depends on max. Int... (and that's not Integers either, I guess)
Just to see what this would look like. First of all, optimizing mod and div can not be done with PrelRules, because they are not primitives, quot and rem are. And most of the nice optimizations with shifts no longer work there. But using rules should work, assuming the inliner is not too fast. Multiplication and division can become shifts:
{-# RULES
-- x * 2^n --> x `shiftL` n "x# *# 2#" forall x#. x# *# 2# = x# `iShiftL#` 1# "2# *# x#" forall x#. 2# *# x# = x# `iShiftL#` 1# -- etc.
-- x `div` 2^n --> x `shiftR` n "x# `divInt#` 2#" forall x#. divInt# x# 2# = x# `iShiftRA#` 1# "x# `divInt#` 4#" forall x#. divInt# x# 4# = x# `iShiftRA#` 2# -- etc.
Mod can become and:
-- x `mod` 2^n --> x .&. (2^n - 1) "x# `modInt#` 2#" forall x#. modInt# x# 2# = andInt# x# 1# "x# `modInt#` 4#" forall x#. modInt# x# 4# = andInt# x# 3# -- etc.
#-}
Here I use a new function (see instance Bits Int),
andInt# :: Int# -> Int# -> Int# andInt# x# y# = word2Int# (int2Word# x# `and#` int2Word# y#)
forall a b. fromInteger a + fromInteger b = fromInteger (a + b) forall a b. fromInteger a * fromInteger b = fromInteger (a * b) -- etc. To allow optimizations on generic Num code, although I am not sure what
but you could write that inline as well. A problem with these rules is that you need a whole lot of them. 32 per operation (on a 32 bit platform), * 4 operations, * 2 separate versions for words and ints = 256. Other rules that could be interesting are: the Haskell spec has to say about this. Now, if you want to get really creative you can use other semi-evil optimization tricks for quot and rem. The following is based on code generated by Visual C++:
-- remPowInt x y == x `rem` (2^y) remPowInt x y | r >= 0 = r | otherwise = ((r - 1) .|. (complement yWithSign)) + 1 where r = x .&. yWithSign yWithSign = (1 `shiftL` (bitSize - 1)) .|. ((1 `shiftL` y) - 1) Or in assembly (for y == 2, so x `rem` 4) and ecx,80000007h jns main+60h (401060h) dec ecx or ecx,0FFFFFFF8h inc ecx
lea eax, [eax+eax*2] Divisions become horrendous constructs with magic numbers, -- eax := ecx / 5 mov eax,66666667h imul ecx sar edx,1 mov eax,edx shr eax,1Fh add eax,edx But such things are probably best left to the code generator / a
The C++ compiler also performs other optimizations when multiplying with other constants, for example *3 becomes something like peephole optimizer, if they are done at all. I think the LEA trick should be feasible. Twan

On Aug 21, 2007, at 22:13 , Twan van Laarhoven wrote:
Other rules that could be interesting are:
forall a b. fromInteger a + fromInteger b = fromInteger (a + b)
I don't think this will work, a and b have to be the same type. -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

Brandon S. Allbery KF8NH wrote:
On Aug 21, 2007, at 22:13 , Twan van Laarhoven wrote:
Other rules that could be interesting are:
forall a b. fromInteger a + fromInteger b = fromInteger (a + b)
I don't think this will work, a and b have to be the same type.
They are of the same type, both are Integers,
forall a b :: Integer. ((fromInteger (a::Integer)) + (fromInteger b)) :: Num n => n = (fromInteger (a + b :: Integer)) :: Num n => n
Twan

| First of all, optimizing mod and div can not be done with PrelRules, | because they are not primitives, quot and rem are. Yes, you can do them with PrelRules! Check out PrelRules.builtinRules. | Multiplication and division can become shifts: | | > {-# RULES | > | > -- x * 2^n --> x `shiftL` n | > "x# *# 2#" forall x#. x# *# 2# = x# `iShiftL#` 1# | > "2# *# x#" forall x#. 2# *# x# = x# `iShiftL#` 1# | > -- etc. | A problem with these rules is that you need a whole lot of them. 32 per | operation (on a 32 bit platform), * 4 operations, * 2 separate versions | for words and ints = 256. I think you should be able to a lot better. For example, to do constant folding for +# you might think you needed a lot of rules 1# +# 2# = 3# 1# +# 3# = 4# etc But not so! See PrelRules for how to write one rule that does all of these at once. I think you can do multiply-to-shift in the same way. The downside of PrelRules is that it's part of the compiler, not in Haskell pragmas; that's what makes it more expressive than rules written in source code. Does that help? If one of the folk listening to this thread wanted to add a page to the GHC Commentary distilling this thread into Wiki material, I'd be happy to check its accuracy. http://hackage.haskell.org/trac/ghc/wiki/Commentary Simon

On Wed, Aug 22, 2007 at 09:04:11AM +0100, Simon Peyton-Jones wrote:
| First of all, optimizing mod and div can not be done with PrelRules, | because they are not primitives, quot and rem are.
Yes, you can do them with PrelRules! Check out PrelRules.builtinRules.
| Multiplication and division can become shifts: | | > {-# RULES | > | > -- x * 2^n --> x `shiftL` n | > "x# *# 2#" forall x#. x# *# 2# = x# `iShiftL#` 1# | > "2# *# x#" forall x#. 2# *# x# = x# `iShiftL#` 1# | > -- etc.
| A problem with these rules is that you need a whole lot of them. 32 per | operation (on a 32 bit platform), * 4 operations, * 2 separate versions | for words and ints = 256.
I think you should be able to a lot better. For example, to do constant folding for +# you might think you needed a lot of rules
1# +# 2# = 3# 1# +# 3# = 4# etc
But not so! See PrelRules for how to write one rule that does all of these at once. I think you can do multiply-to-shift in the same way.
The downside of PrelRules is that it's part of the compiler, not in Haskell pragmas; that's what makes it more expressive than rules written in source code.
Something I've pondered is adding a more-expressive form of RULES which works using general pattern matching: {-# XRULES "*#-to-shift" (*#) (CoreLit (CoreInt num)) obj | num .&. (num - 1) == 0 -> CoreApp (CoreVar "GHC.Prim.iShiftL#") [obj, CoreLit (CoreInt (lg2 num))] obj (CoreLit (CoreInt num)) | num .&. (num - 1) == 0 -> CoreApp (CoreVar "GHC.Prim.iShiftL#") [obj, CoreLit (CoreInt (lg2 num))] #-} This would require reusing the TH infrastructure, and (depending on how much we can abstract) might leak too many details of Core to be useful; on the other hand it would allow some very interesting domain optimizations to be done. Views might be nice here. Opinions on whether something like this is a good idea? Stefan

| Something I've pondered is adding a more-expressive form of RULES which | works using general pattern matching: Yes, but it would need the rule-matcher in the Simplifier to be more sophisticated. Have a look in specialise/Rules.lhs. No need to be so ambitious; just moving towards what you can do in PrelRules would be an improvement Simon | | {-# XRULES | "*#-to-shift" (*#) | (CoreLit (CoreInt num)) obj | | num .&. (num - 1) == 0 -> | CoreApp (CoreVar "GHC.Prim.iShiftL#") | [obj, CoreLit (CoreInt (lg2 num))] | obj (CoreLit (CoreInt num)) | | num .&. (num - 1) == 0 -> | CoreApp (CoreVar "GHC.Prim.iShiftL#") | [obj, CoreLit (CoreInt (lg2 num))] | #-} | | This would require reusing the TH infrastructure, and (depending on how | much we can abstract) might leak too many details of Core to be | useful; on the other hand it would allow some very interesting domain | optimizations to be done. | | Views might be nice here. | | Opinions on whether something like this is a good idea? | | Stefan

Simon Peyton-Jones wrote:
| Something I've pondered is adding a more-expressive form of RULES which | works using general pattern matching:
Yes, but it would need the rule-matcher in the Simplifier to be more sophisticated. Have a look in specialise/Rules.lhs.
No need to be so ambitious; just moving towards what you can do in PrelRules would be an improvement
Careful about concealing http://hackage.haskell.org/trac/ghc/ticket/1603 even more than it already is! (Not actually, but it would probably be even harder to track down and still be a bug in some really obscure cases. Maybe there's a flag to disable PrelRules-optimizations for a run of ghc.) Isaac

Hi
Other rules that could be interesting are:
forall a b. fromInteger a + fromInteger b = fromInteger (a + b) forall a b. fromInteger a * fromInteger b = fromInteger (a * b)
This is wrong, since the class function can do what it wants. Imagine: instance Num String where (+) = (++) fromInteger x = show x 1 + 2 :: String this expression now goes from "12" to "3" by applying this rule. You need to be incredibly careful if there are any classes floating around. Thanks Neil

Neil Mitchell wrote:
Other rules that could be interesting are:
forall a b. fromInteger a + fromInteger b = fromInteger (a + b) forall a b. fromInteger a * fromInteger b = fromInteger (a * b)
This is wrong, since the class function can do what it wants. Imagine:
instance Num String where (+) = (++) fromInteger x = show x
1 + 2 :: String
this expression now goes from "12" to "3" by applying this rule.
You need to be incredibly careful if there are any classes floating around.
Do we assume Num instances to obey Num axioms, just like Arrow instances to obey Arrow axioms? My impression is that the GHC de-sugaring of the proc notation contains an optimizing stage that uses arrow axioms. This is a good precedence. If Num obeys ring axioms, fromInteger is a perfectly fine ring-homomorphism. (It's also the first or second homomorphism taught.) If a and b are large integers and the target ring is small, fromInteger (a + b) is likely slower. That is my concern.

Hi
If Num obeys ring axioms, fromInteger is a perfectly fine ring-homomorphism. (It's also the first or second homomorphism taught.)
Does Int obey these axioms? I'm thinking that assuming properties about things such as numbers is very likely to go wrong very quickly. Monads you might be able to get away with, Numbers you probably can't. Thanks Neil

Neil Mitchell wrote:
If Num obeys ring axioms, fromInteger is a perfectly fine ring-homomorphism. (It's also the first or second homomorphism taught.)
Does Int obey these axioms? I'm thinking that assuming properties about things such as numbers is very likely to go wrong very quickly. Monads you might be able to get away with, Numbers you probably can't.
Now, see, you don't find this kind of discussion in other languages. If you go talk about Java, people just flame each other about whether we should have true MI or not... ;-) Ah, I *like* it here. :-D

On Wed, Aug 22, 2007 at 06:36:15PM +0100, Neil Mitchell wrote:
Hi
If Num obeys ring axioms, fromInteger is a perfectly fine ring-homomorphism. (It's also the first or second homomorphism taught.)
Does Int obey these axioms? I'm thinking that assuming properties about things such as numbers is very likely to go wrong very quickly. Monads you might be able to get away with, Numbers you probably can't.
Int does obey the axioms, it's the classical ring ℤ[4294967296]. Double, however, does not: stefan@stefans:~$ ghci GHCi, version 6.7.20070712: http://www.haskell.org/ghc/ :? for help Loading package base ... linking ... done. Prelude> let x = 10000000000000000000000 ; y = 1 - x Prelude> fromInteger (x + y) :: Double 1.0 Prelude> fromInteger x + fromInteger y :: Double 0.0 Stefan

Stefan O'Rear wrote:
On Wed, Aug 22, 2007 at 06:36:15PM +0100, Neil Mitchell wrote:
Hi
If Num obeys ring axioms, fromInteger is a perfectly fine ring-homomorphism. (It's also the first or second homomorphism taught.)
Does Int obey these axioms? I'm thinking that assuming properties about things such as numbers is very likely to go wrong very quickly. Monads you might be able to get away with, Numbers you probably can't.
Int does obey the axioms, it's the classical ring ℤ[4294967296]. Double, however, does not:
But Double is already quite badly behaved:
let x = 1e20 Prelude> 1 + (x - x) 1.0 Prelude> (1 + x) - x 0.0
Using the fromInteger (and fromRational) axioms should only *increase* precission, I don't see how that is such a bad thing. Also, as far as I can see GHC already does this optimizations if the type is specialized to Double. Except for the fact that the PrelRules rules don't seem to fire, because the constants get floated out. Twan

Using the fromInteger (and fromRational) axioms should only *increase* precission, I don't see how that is such a bad thing.
I think it's bad if the behaviour of your program depends on the
optimisation level.
On 22/08/07, Twan van Laarhoven
Stefan O'Rear wrote:
On Wed, Aug 22, 2007 at 06:36:15PM +0100, Neil Mitchell wrote:
Hi
If Num obeys ring axioms, fromInteger is a perfectly fine ring-homomorphism. (It's also the first or second homomorphism taught.)
Does Int obey these axioms? I'm thinking that assuming properties about things such as numbers is very likely to go wrong very quickly. Monads you might be able to get away with, Numbers you probably can't.
Int does obey the axioms, it's the classical ring ℤ[4294967296]. Double, however, does not:
But Double is already quite badly behaved:
let x = 1e20 Prelude> 1 + (x - x) 1.0 Prelude> (1 + x) - x 0.0
Using the fromInteger (and fromRational) axioms should only *increase* precission, I don't see how that is such a bad thing.
Also, as far as I can see GHC already does this optimizations if the type is specialized to Double. Except for the fact that the PrelRules rules don't seem to fire, because the constants get floated out.
Twan _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

overdrigzed:
Using the fromInteger (and fromRational) axioms should only *increase* precission, I don't see how that is such a bad thing.
I think it's bad if the behaviour of your program depends on the optimisation level.
Isn't this how the thread started? $ gcc t.c -o a $ ./a 100000000 zsh: segmentation fault (core dumped) ./a 100000000 $ gcc -O3 t.c -o b $ ./b 100000000 987459712 ;) -- Don

On Thu, 2007-08-23 at 10:17 +1000, Donald Bruce Stewart wrote:
overdrigzed:
Using the fromInteger (and fromRational) axioms should only *increase* precission, I don't see how that is such a bad thing.
I think it's bad if the behaviour of your program depends on the optimisation level.
Isn't this how the thread started?
$ gcc t.c -o a $ ./a 100000000 zsh: segmentation fault (core dumped) ./a 100000000
$ gcc -O3 t.c -o b $ ./b 100000000 987459712
Don't make me pull out sum.

On 8/22/07, Twan van Laarhoven
But Double is already quite badly behaved:
let x = 1e20 Prelude> 1 + (x - x) 1.0 Prelude> (1 + x) - x 0.0
Ewwww. Whilst that's understandable and unavoidable, that kindof rings alarm bells for folds of Doubles in an automatic threading situation. Ie, if we split our fold between 4 cores, we might get a different answer than if we split it between 8 cores?

Hugh Perkins wrote:
On 8/22/07, Twan van Laarhoven
wrote: But Double is already quite badly behaved:
let x = 1e20 Prelude> 1 + (x - x) 1.0 Prelude> (1 + x) - x 0.0
Ewwww. Whilst that's understandable and unavoidable, that kindof rings alarm bells for folds of Doubles in an automatic threading situation. Ie, if we split our fold between 4 cores, we might get a different answer than if we split it between 8 cores?
Yep, Double's (+) isn't associative. I was trying to come up with ANY common mathematical property that Double fulfills and this was what I came up with: --> for all x::Float. not (x > x + 1) I'm not even sure it's true in the presence of very large numbers, optimizations and extra precision being added in the CPU. Note that it's not (x <= x + 1) because of NaN. Is Double's (+) commutative? I don't know for sure. Its (==) isn't reflexive (is it transitive? probably, at least if there aren't too many optimizations, but floating-point transitive equality isn't very useful). --> Okay... so (+), (*), (==) are probably symmetric too. Any others? Isaac

Hi
Its (==) isn't reflexive (is it transitive? probably, at least if there aren't too many optimizations, but floating-point transitive equality isn't very useful).
It's not even referentially transparent in all cases. a == b may fail while the double's are in the high precision registers, and then succeed later on in the program once they are truncated. I think you have to specify -fexcess-precision with GHC to get this behaviour. Thanks Neil

On Thu, 23 Aug 2007, Neil Mitchell wrote:
Hi
Its (==) isn't reflexive (is it transitive? probably, at least if there aren't too many optimizations, but floating-point transitive equality isn't very useful).
It's not even referentially transparent in all cases. a == b may fail while the double's are in the high precision registers, and then succeed later on in the program once they are truncated. I think you have to specify -fexcess-precision with GHC to get this behaviour.
That's really bad. This will invalidate the clever algorithms for computing (+) and (*) with doubled precision (EFT - error free transformations). http://66.102.9.104/search?q=cache:8vni-CqtINkJ:www.ti3.tu-harburg.de/paper/rump/Ru05c.pdf+TwoSum&hl=de&ct=clnk&cd=1&gl=de

On 8/23/07, Neil Mitchell
It's not even referentially transparent in all cases. a == b may fail while the double's are in the high precision registers, and then succeed later on in the program once they are truncated. I think you have to specify -fexcess-precision with GHC to get this behaviour.
Also, if you tell GCC to use SSE (there are a number of flags, like -march and -msse, -msse2 etc which can do this). It can stop using the 80-bit x87 registers for floating point and switch to 64-bit SSE registers. Then you *don't* get this affect, but I'm not sure you can be sure that you don't ever get it. Floating point numbers make me sad :( AGL -- Adam Langley agl@imperialviolet.org http://www.imperialviolet.org 650-283-9641

Simon Peyton-Jones wrote:
GHC does some constant folding, but little by way of strength reduction, or using shifts instead of multiplication. It's pretty easy to add more: it's all done in a single module. Look at primOpRules in the module PrelRules.
Although it isn't done at the Core level as Simon says, GHC's native code generator does turn multiplies into shifts, amongst various other low-level optimisations. In fact it's not clear that this kind of thing *should* be done at a high level, since it's likely to be machine-dependent. Cheers, Simon
participants (17)
-
Adam Langley
-
Albert Y. C. Lai
-
Andrew Coppin
-
Brandon S. Allbery KF8NH
-
Derek Elkins
-
dons@cse.unsw.edu.au
-
Henning Thielemann
-
Hugh Perkins
-
Isaac Dupree
-
Neil Mitchell
-
Philip Armstrong
-
Rodrigo Queiro
-
Simon Marlow
-
Simon Peyton-Jones
-
Stefan O'Rear
-
Tim Chevalier
-
Twan van Laarhoven