Re: Low-level array performance

17 Jun 2008


      Am Dienstag, 17. Juni 2008 20:35 schrieb Dan Doel:
...
On Tuesday 17 June 2008, Daniel Fischer wrote:
...
I've experimented a bit and found that Ptr is faster for small arrays
(only very slightly so if compiled with -fvia-C -optc-O3), but ByteArr
performs much better for larger arrays
...
The GC time for the Addr# version is frightening
I had an entire e-mail written about what a bizarre and interesting result
you'd just found, but unfortunately, I then remembered exactly how the
array gets filled in the Ptr version. Namely:
(Ptr a) <- newArray [0..n-1]
Ouch. I should've looked at both sources, that would have been obvious then :)
...
Which, I assume does something terrible, like calling length to get the
size needed for the array, while also needing the values after array
creation to feed into it. For small arrays like the ones I'd been testing
with, this doesn't matter, because the work done on the list is negligible.
However, when you get to very large lists (100,000 elements and above,
apparently) this starts causing massive space leaks, which explains the
terrible GC behavior we were seeing.
If you change the benchmark like so:
bench :: Int -> Int -> IO ()
    bench k n = do (Ptr a) <- mallocArray n :: IO (Ptr Int)
                   fill a n
                   replicateM_ k (reverse a 0 (n-1))
fill :: Addr# -> Int -> IO ()
    fill a (I# n) = IO (go 0#)
     where
     go i s
| i <# n    = case writeIntOffAddr# a i i s of s -> go (i +# 1#) s
       | otherwise = (# s, () #)
The space leak goes away, and the runtimes stay consistent. Up to around
10,000 elements, Ptr hovers around 6s, and ByteArray (-fasm) stays around
11. At 100,000, Ptr jumps to 12-13s, and ByteArray goes to 14-16 and stays
there (again, I imagine due to running into bad cache effects at that
level). This is all for size * iterations = 2.5 billion on my machine.
Since my computer is slower, I confined my tests to size*iterations ~= 10^9.
Mostly, I find no noticeable difference, between 0.2 and 0.5 %, sometimes one 
faster, sometimes the other. I have the impression that the Ptr version is 
the faster more often than the ByteArr version, but the tendency isn't very 
strong. 
That applies to both compiled with -O2 -fvia-C -optc-O3. 
Compiling with -O2 -fasm doesn't make a noticeable difference for Ptr, but is 
about 13% slower for ByteArr when the arrays are large (too large for the 
cache, I suppose) and about 50% slower for small arrays.

So what I can read off that is that the native code generator still has to 
catch up for such code, not that either way of implementing arrays is 
inherently faster.
...
A good catch anyhow, though. That could explain why Simon Marlow was seeing
the Addr# version as slower, since he was using a large array, and thus
work done on the list could have contributed significantly (although MUT
time was higher with Ptr, so it would have had to contribute work there,
not just GC thrashing).
Thanks for the input,
-- Dan
Cheers,
Daniel