
On 27 October 2005 01:33, John Meacham wrote:
I think I might have found why (or partially why) ghc is so slow on x86-64..
section 5.10 of the optimization manual
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/ 25112.PDF
(which has a whole lot of good info for any processor, including a whole chapter on how to write C code that optimizes well independent of the CPU)
"don't place code and data on the same cache line"
I'd be surprised if this is an issue. GHC doesn't normally touch the info tables during execution (with one exception - getting the tag from a constructor in a datatype with >8 constructors). It touches the info tables during GC, but it doesn't touch the code during GC. So we might push some code out of the cache on a GC, but that shouldn't have a large effect. It could be an alignment issue, I suppose. Or passing arguments in registers (we don't, at the moment, on x86_64). If you have any handy test programs, can you try fiddling with the alignment of code blocks and see if you get a measurable difference? (I'm still digesting your other message, I'll reply in due course). Cheers, Simon

On Thu, Oct 27, 2005 at 08:44:10AM +0100, Simon Marlow wrote:
I'd be surprised if this is an issue. GHC doesn't normally touch the info tables during execution (with one exception - getting the tag from a constructor in a datatype with >8 constructors). It touches the info tables during GC, but it doesn't touch the code during GC. So we might push some code out of the cache on a GC, but that shouldn't have a large effect.
Yeah, you are right. I realized this after some more thought, we don't make a new copy of the code for each thunk :)
It could be an alignment issue, I suppose. Or passing arguments in registers (we don't, at the moment, on x86_64).
I tried some experiments using regparm on jhc output on i386 and it didnot cause the dramatic effect noticed with x86_64, so I don't think it is just that. well, it is possible, the x86_64 core might be optimized assuming things are passed in registers while the i386 core might keep the top few stack members in phantom registers or something... but an alignment issue sounds more likely, if we are stradling 4 byte boundries with our 8 byte pointers and ints, that could affect things very much. it is the number one cause of performance problems according to the AMD optimization manual.
If you have any handy test programs, can you try fiddling with the alignment of code blocks and see if you get a measurable difference?
I will try that.
(I'm still digesting your other message, I'll reply in due course).
I am digesting the c-- papers at the moment :) John -- John Meacham - ⑆repetae.net⑆john⑈
participants (2)
-
John Meacham
-
Simon Marlow