
On Tue, Aug 21, 2007 at 05:25:49AM -0700, Stefan O'Rear wrote:
On Tue, Aug 21, 2007 at 01:14:20PM +0100, Rodrigo Queiro wrote:
On my system, the C version runs about 9x faster than the haskell version (with -O3 and -O2 -fvia-c -optc-O3 respectively). However, GCC seems to produce about 70 lines of assembly for the main loop, compared to about 10 from GHC. I suspect the speed difference is the result of some heavy optimisation by GCC, which would need to be hand-tuned for GHC. (I would be interested to see what this would be. Unfortunately I don't know x86 assembly well enough to understand the GCC output.)
GCC is carrying out two major optimisations that ghc is missing here: replacing the tail call with a jump directly into the function body (having stuffed the correct arguments into the appropriate registers) and unrolling the loop. That's pretty much it. Neither are what I'd call 'heavy' optimisations.
The fundamental problem is that GHC doesn't have enough registers to to a good job with Haskell. Normal Haskell code makes extensive use of the GHC stack for function calls, the C stack for signal handlers, the capability base pointer for thread state, and the heap for everything else. Which doesn't really leave us in a good state for optimizing. In particular, x86 ghc ALWAYS passes parameters on the stack, even for tail calls. I didn't actually bother to check, but I'm pretty sure that's what the OP was noticing - if you look carefully it's not actually pushing or popping anything, just using stack memory.
Yes, absolutely.
Situations are far better on x86_64 (16 registers) and ppc (32 registers). There is some work being done on the backend to improve this (in particular, a new and much better register allocator and a parameter-aware Cmm system).
<fires up ppc box> Ouch. That's even worse: $ ./sum 100000000 C version: 0.16s Haskell : 1.40s Looking at the generated assembler, the ppc version has exactly the same problem that the x86 version does. It carries out the calculation, the stores the result in some memory locations and calls f again so that the preamble to f can pull those same results out of the memory locations in order to put them back into the same registers again! (I'm using ghc 6.6.1 on Debian unstable btw for anyone following along.) cheers, Phil -- http://www.kantaka.co.uk/ .oOo. public key: http://www.kantaka.co.uk/gpg.txt