
On February 21, 2010 20:57:25 Don Stewart wrote:
I tried out some of the vector and uvector fusion benchmarks with the new LLVM backend
http://donsbot.wordpress.com/2010/02/21/smoking-fast-haskell-code-using-ghc s-new-llvm-codegen/
and got some great results for the tight loops generated through fusion. Up to 2x faster than gcc -O3 in some cases.
I had a quick scan through Davids thesis the other day and noted that he attributes a lot/at least some of the tight loops performance advantage to not having pinned the STG registers except at function entrance and exit. http://www.cse.unsw.edu.au/~pls/thesis/davidt-thesis.pdf According to what I understand from the bottom of page 42 and top of page 43, this was done through a custom calling convention whereby the first N arguments get passed in the N registers assigned to the STG virtual registers, and every function is extended to take the STG registers as their first N parameters. The net result is that, on entry to any function (there are only entries to worry about as everything is a tail call), the STG virtual registers are in the correct hardware registers, so the RTS is happy. What is interesting though, is LLVM is free to spill them between function calls. This can free up more registers for right loops, and from my understanding of the bottom of page 53 and top of page 54, this was likely crucial to getting the great tight-loop performance in some cases. I don't know if this even makes sense to ask, but could the same thing be done for the native code generator (i.e., implement global RTS registers as a calling convention instead what I presume is a don't touch approach)? Cheers! -Tyson PS: If you happen to read this list, that was a nice body of work David.