
Am Dienstag, 17. Juni 2008 20:35 schrieb Dan Doel:
On Tuesday 17 June 2008, Daniel Fischer wrote:
I've experimented a bit and found that Ptr is faster for small arrays (only very slightly so if compiled with -fvia-C -optc-O3), but ByteArr performs much better for larger arrays ... The GC time for the Addr# version is frightening
I had an entire e-mail written about what a bizarre and interesting result you'd just found, but unfortunately, I then remembered exactly how the array gets filled in the Ptr version. Namely:
(Ptr a) <- newArray [0..n-1]
Ouch. I should've looked at both sources, that would have been obvious then :)
Which, I assume does something terrible, like calling length to get the size needed for the array, while also needing the values after array creation to feed into it. For small arrays like the ones I'd been testing with, this doesn't matter, because the work done on the list is negligible. However, when you get to very large lists (100,000 elements and above, apparently) this starts causing massive space leaks, which explains the terrible GC behavior we were seeing.
If you change the benchmark like so:
bench :: Int -> Int -> IO () bench k n = do (Ptr a) <- mallocArray n :: IO (Ptr Int) fill a n replicateM_ k (reverse a 0 (n-1))
fill :: Addr# -> Int -> IO () fill a (I# n) = IO (go 0#) where go i s
| i <# n = case writeIntOffAddr# a i i s of s -> go (i +# 1#) s | otherwise = (# s, () #)
The space leak goes away, and the runtimes stay consistent. Up to around 10,000 elements, Ptr hovers around 6s, and ByteArray (-fasm) stays around 11. At 100,000, Ptr jumps to 12-13s, and ByteArray goes to 14-16 and stays there (again, I imagine due to running into bad cache effects at that level). This is all for size * iterations = 2.5 billion on my machine.
Since my computer is slower, I confined my tests to size*iterations ~= 10^9. Mostly, I find no noticeable difference, between 0.2 and 0.5 %, sometimes one faster, sometimes the other. I have the impression that the Ptr version is the faster more often than the ByteArr version, but the tendency isn't very strong. That applies to both compiled with -O2 -fvia-C -optc-O3. Compiling with -O2 -fasm doesn't make a noticeable difference for Ptr, but is about 13% slower for ByteArr when the arrays are large (too large for the cache, I suppose) and about 50% slower for small arrays. So what I can read off that is that the native code generator still has to catch up for such code, not that either way of implementing arrays is inherently faster.
A good catch anyhow, though. That could explain why Simon Marlow was seeing the Addr# version as slower, since he was using a large array, and thus work done on the list could have contributed significantly (although MUT time was higher with Ptr, so it would have had to contribute work there, not just GC thrashing).
Thanks for the input, -- Dan
Cheers, Daniel