
On Wed, 2007-04-18 at 08:34 -0700, Bryan O'Sullivan wrote:
Duncan Coutts wrote:
I'm currently exploring more design ideas for Data.Binary including how to deal with alignment. Eliminating unnecessary bounds checks and using aligned memory operations also significantly improves performance. I can get up to ~750Mb/s serialisation out of a peak memory bandwidth of ~1750Mb/s, though a Haskell word-writing loop can only get ~850Mb/s.
What are you using to measure the peak number? It seems very low to me. Even Opterons of a few years of vintage can manage more than 5GB/s.
I was using a C word writing loop with no unrolling. On x86 that turned into a 4 asm instruction loop. Reading words is indeed a good deal faster, and on my x86-64 machine reading 8-byte words is faster still.
It's also quite normal to get about half of your peak bandwidth unless you're going out of your way to use non-temporal loads and stores (e.g. via SSE2), which is something that e.g. gcc is not good at at all, if you're using -fvia-c.
I'm not interested so much in the peak throughput of the machine as what could be realistically achievable for binary serialisation. So comparing to a non-unrolled C loop seems fair to me. With sufficient improvements in the GHC backend we should be able to approach that. Then any difference is overhead in the Binary library, and that's what I'm really trying to measure. Duncan