Re: Low-level array performance

17 Jun 2008

      On Tuesday 17 June 2008, Simon Marlow wrote:
...
So I tried your examples and the Addr# version looks slower than the MBA#
version:
Hmm...
...
I tried with 6.8.2 and 6.8.3, using -O2 in both cases.  I tried the Ptr
version with and without -fvia-C -optc-O2, no difference.
I had forgotten about the via-c in the pragma when I sent it, but I've tested 
it both via-c and with the new backend (and triple checked since your 
message), and I always come away with the Ptr version being faster. -fvia-c 
doesn't seem to affect the speed of the Addr# version much, while it improves 
the speed of the MBA# version. However, even with the improved speed, Addr# 
seems to edge it out here.

With the new backend, I get the results I sent in my initial mail. The 
ByteArray version takes 11 - 12 seconds to reverse a size 10 array 250 
million times, whereas the Addr# version takes around 7 seconds.

(I also noticed a bug I'd missed before sending the ByteArray version. It 
should allocate based on w, but I left it hard coded to 4# when I was 
experimenting. This was causing segmentation faults on large arrays on my 
machine, since I'm running in 64-bit mode, and 8# is the correct value here. 
Are you running in 32-bit, and if so, could that be the source of our 
discrepancy?)
...
Are these exactly the same programs you measured?  What parameters did you
use?
Aside from the couple oversights above, yes. The actual fannkuch benchmark 
doesn't use very large arrays. The current test input is n = 11, and all the 
arrays it uses are length n. It gets its work from copying, reversing and 
shifting (portions of) those arrays n! or more times. So, I thought it'd be 
truer to the benchmark to reverse a small array many times. I've been running 
with command lines like './ByteArr 250000000 10', which says to reverse a 
size-10 array 250 million times.

I tested with other sizes, and things seem to stay about the same increasing 
the array size and decreasing the iterations by the same factor, until I got 
to an array size of around 100,000, at which point there's a drop off for 
both (Addr# still being faster). I assume that's due to cache effects.

Here's some example runs, using '--make -O2' for both (OPTIONS pragma changed 
to only have -fglasgow-exts for both, and the w bug fixed).

./ByteArr 250000000 10 +RTS -sstderr
Done.
     56,824 bytes allocated in the heap
        552 bytes copied during GC (scavenged)
          0 bytes copied during GC (not scavenged)
     45,056 bytes maximum residency (1 sample(s))

          1 collections in generation 0 (  0.00s)
          1 collections in generation 1 (  0.00s)

          1 Mb total memory in use

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time   10.35s  ( 11.15s elapsed)
  GC    time    0.00s  (  0.00s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time   10.36s  ( 11.15s elapsed)

  %GC time       0.0%  (0.0% elapsed)

  Alloc rate    5,486 bytes per MUT second

  Productivity 100.0% of total user, 92.9% of total elapsed

./Ptr 250000000 10 +RTS -sstderr
Done.
     57,840 bytes allocated in the heap
        552 bytes copied during GC (scavenged)
          0 bytes copied during GC (not scavenged)
     45,056 bytes maximum residency (1 sample(s))

          1 collections in generation 0 (  0.00s)
          1 collections in generation 1 (  0.00s)

          1 Mb total memory in use

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    6.53s  (  7.05s elapsed)
  GC    time    0.00s  (  0.00s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    6.53s  (  7.05s elapsed)

  %GC time       0.0%  (0.0% elapsed)

  Alloc rate    8,854 bytes per MUT second

  Productivity 100.0% of total user, 92.7% of total elapsed

As I mentioned before, using -fvia-c -optc-O2 leaves Ptr unchanged, and speeds 
up ByteArr, but not enough to catch up with Ptr (here, at least).

Anyhow, my apologies for the mistakes above, and thanks for your time and 
assistance. I'll try puzzling over the C-- some and probably open a trac 
ticket a bit later as the other Simon suggested (if that's still 
appropriate).

Thanks again,
-- Dan

Re: Low-level array performance

Dan Doel