
On 25/07/2008, at 12:42 PM, Duncan Coutts wrote:
Of course then it means we need to have enough work to do. Indeed we need quite a bit just to break even because each core is relatively stripped down without all the out-of-order execution etc.
I don't think that will hurt too much. The code that GHC emits is very regular and the basic blocks tend to be small. A good proportion of it is just for copying data between the stack and the heap. On the upside, it's all very clean and amenable to some simple peephole optimization / compile time reordering. I remember someone telling me that one of the outcomes of the Itanium project was that they didn't get the (low level) compile-time optimizations to perform as well as they had hoped. The reasoning was that a highly speculative/out-of-order processor with all the trimmings had a lot more dynamic information about the state of the program, and could make decisions on the fly which were better than what you could ever get statically at compile time. -- does anyone have a reference for this? Anyway, this problem is moot with GHC code. There's barely any instruction level parallelism to exploit anyway, but adding an extra hardware thread is just a `par` away. To quote a talk from that paper earlier: "GHC programs turn an Athlon into a 486 with a high clock speed!" Ben.