
On Tue, Aug 21, 2007 at 01:14:20PM +0100, Rodrigo Queiro wrote:
On my system, the C version runs about 9x faster than the haskell version (with -O3 and -O2 -fvia-c -optc-O3 respectively). However, GCC seems to produce about 70 lines of assembly for the main loop, compared to about 10 from GHC. I suspect the speed difference is the result of some heavy optimisation by GCC, which would need to be hand-tuned for GHC. (I would be interested to see what this would be. Unfortunately I don't know x86 assembly well enough to understand the GCC output.)
The fundamental problem is that GHC doesn't have enough registers to to a good job with Haskell. Normal Haskell code makes extensive use of the GHC stack for function calls, the C stack for signal handlers, the capability base pointer for thread state, and the heap for everything else. Which doesn't really leave us in a good state for optimizing. In particular, x86 ghc ALWAYS passes parameters on the stack, even for tail calls. I didn't actually bother to check, but I'm pretty sure that's what the OP was noticing - if you look carefully it's not actually pushing or popping anything, just using stack memory. Situations are far better on x86_64 (16 registers) and ppc (32 registers). There is some work being done on the backend to improve this (in particular, a new and much better register allocator and a parameter-aware Cmm system). Stefan