
On Tuesday 17 June 2008, Simon Marlow wrote:
So I tried your examples and the Addr# version looks slower than the MBA# version:
Hmm...
I tried with 6.8.2 and 6.8.3, using -O2 in both cases. I tried the Ptr version with and without -fvia-C -optc-O2, no difference.
I had forgotten about the via-c in the pragma when I sent it, but I've tested it both via-c and with the new backend (and triple checked since your message), and I always come away with the Ptr version being faster. -fvia-c doesn't seem to affect the speed of the Addr# version much, while it improves the speed of the MBA# version. However, even with the improved speed, Addr# seems to edge it out here. With the new backend, I get the results I sent in my initial mail. The ByteArray version takes 11 - 12 seconds to reverse a size 10 array 250 million times, whereas the Addr# version takes around 7 seconds. (I also noticed a bug I'd missed before sending the ByteArray version. It should allocate based on w, but I left it hard coded to 4# when I was experimenting. This was causing segmentation faults on large arrays on my machine, since I'm running in 64-bit mode, and 8# is the correct value here. Are you running in 32-bit, and if so, could that be the source of our discrepancy?)
Are these exactly the same programs you measured? What parameters did you use?
Aside from the couple oversights above, yes. The actual fannkuch benchmark doesn't use very large arrays. The current test input is n = 11, and all the arrays it uses are length n. It gets its work from copying, reversing and shifting (portions of) those arrays n! or more times. So, I thought it'd be truer to the benchmark to reverse a small array many times. I've been running with command lines like './ByteArr 250000000 10', which says to reverse a size-10 array 250 million times. I tested with other sizes, and things seem to stay about the same increasing the array size and decreasing the iterations by the same factor, until I got to an array size of around 100,000, at which point there's a drop off for both (Addr# still being faster). I assume that's due to cache effects. Here's some example runs, using '--make -O2' for both (OPTIONS pragma changed to only have -fglasgow-exts for both, and the w bug fixed). ./ByteArr 250000000 10 +RTS -sstderr Done. 56,824 bytes allocated in the heap 552 bytes copied during GC (scavenged) 0 bytes copied during GC (not scavenged) 45,056 bytes maximum residency (1 sample(s)) 1 collections in generation 0 ( 0.00s) 1 collections in generation 1 ( 0.00s) 1 Mb total memory in use INIT time 0.00s ( 0.00s elapsed) MUT time 10.35s ( 11.15s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 10.36s ( 11.15s elapsed) %GC time 0.0% (0.0% elapsed) Alloc rate 5,486 bytes per MUT second Productivity 100.0% of total user, 92.9% of total elapsed ./Ptr 250000000 10 +RTS -sstderr Done. 57,840 bytes allocated in the heap 552 bytes copied during GC (scavenged) 0 bytes copied during GC (not scavenged) 45,056 bytes maximum residency (1 sample(s)) 1 collections in generation 0 ( 0.00s) 1 collections in generation 1 ( 0.00s) 1 Mb total memory in use INIT time 0.00s ( 0.00s elapsed) MUT time 6.53s ( 7.05s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 6.53s ( 7.05s elapsed) %GC time 0.0% (0.0% elapsed) Alloc rate 8,854 bytes per MUT second Productivity 100.0% of total user, 92.7% of total elapsed As I mentioned before, using -fvia-c -optc-O2 leaves Ptr unchanged, and speeds up ByteArr, but not enough to catch up with Ptr (here, at least). Anyhow, my apologies for the mistakes above, and thanks for your time and assistance. I'll try puzzling over the C-- some and probably open a trac ticket a bit later as the other Simon suggested (if that's still appropriate). Thanks again, -- Dan