
Greetings, Recently, due to scattered complaints I'd seen on the internet, I set about to rewrite the fannkuch [1] benchmark on the Great Computer Language Shootout. The current entry uses Ptr/Addr#, malloc, etc. so it's not particularly representative of code one would actually write in Haskell these days. Over the past few days, I've written several versions of the benchmark, and collaborated a bit with local speed guru Don Stewart, but an entry that bests the current entry does not seem to be in the cards currently. So, I thought I'd write about some of the issues, and hopefully get some encouraging news about the situation. Issue 1: STUArrays aren't optimized as fully as one might ideally expect. This isn't so much of an issue, I suppose, except possibly for an environment like the shootout. I wrote versions of the benchmark (and particular pieces of the overall benchmark, as well) for both STUArrays, and MUArrays from dons' new uvector [2] library. The STUArray versions are consistently slower, and a look at the core reveals that STUArray code contains significantly more boxing than uvector. This isn't a big deal, as there's no reason not to use uvector, aside from it not being blessed by being in the GHC distribution. However, I thought I'd bring it up in case anyone hasn't heard of the library yet. Issue 2: Reading from/writing to a MutableByteArray# is slower than an Addr# This is, I think, the crux of the issue. The main content of the benchmark is reversing/shifting items in an array. To get a somewhat easier look at the core, I boiled things down to a benchmark that just reverses a small array many times. In the interest of further reducing things, I wrote a version of the benchmark that uses raw Addr#s, and a version that uses raw MutableByteArray#s. I've attached both versions. The fannkuch benchmark (input 11) with byte arrays runs in around 12 seconds on my machine. To get the reversal benchmark to around that time, I can tell it to, for instance, reverse a 10-element array 250 million times. Those same inputs to the Addr# version of the reverse benchmark result in a runtime of 7 to 8 seconds, so there's a significant discrepancy at the level the benchmark actually runs at (the current fannkuch entry that uses Addr#/Ptr takes around 9 seconds, so there's other stuff going on in the benchmark evening them out, but this is likely the source of the difference between the two). uvector performance on the reversal benchmark is comparable with MutableByteArray#, so the practical slowdown using a library for actual public consumption seems to be in the fact that it's based on byte arrays (STUArray again falls further behind by retaining too many boxes). So, to wrap things up: dons and I were rather curious about the performance differences between MutableByteArray# and Addr#, as one might expect them to be comparable, both being low-level, raw pointer type structures; the code for using the two is nearly identical. (If one were to raise an Issue 3, it'd likely be that, even with Addr# and malloc, which is currently the fastest Haskell solution, we're not that fast. A glance at that shootout list reveals that we're beat by a myriad of Java-based languages, Python (Psyco), Ada, Ocaml, Eiffel, Lisp (SBCL), Clean, and of course C++, C, Fortran, etc. Is there any hope of getting performance improvements that will push us up in that list? I've heard that people are working on improved code generation, but I don't know if that can help with this sort of thing (or if it's more of a runtime system area). While writing this mail, it came to mind that this is also a potential source of slowness in the sorting 'library' I've been writing for uvector. I already have some routines that are competitive with/beat qsort from glibc, but GNU's C++ STL introsort implementation has remained untouched. However, if reads/writes simply take twice as much time in Haskell as they do in C++, my results aren't as surprising.) Anyhow, thanks for any light that can be shed on the subject. 1: http://shootout.alioth.debian.org/gp4/benchmark.php?test=fannkuch 2: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/uvector

Dan John Dias is indeed spending 6 months at Microsoft to work on GHC's back end. He's doing a pretty wholesale re-architecting job, so it will be a couple of months before we have the new setup glued together, but once we do I hope that we'll have a much more friendly framework in place for doing good optimizations. Meanwhile, concrete examples like yours are very useful. It would be still more useful if you and Don could pinpoint more accurately just what the difference (say) between Addr# and MutableByteArray# is. Is it array bounds checking, for example? Looking first at the Core that is generated and then (if it looks the same on both cases) at the C--, is often illuminating. If you can say "for this little loop, GHC generates this stupid code sequence" that's particularly helpful. (Don often does this.) Even if you don't, we'll still take a look in due course, but the more you can pin it down the more motivating it is for us to fix it! Do start a Trac ticket so your thoughts and code examples are not lost in the welter of email. Thanks Simon | -----Original Message----- | From: glasgow-haskell-users-bounces@haskell.org [mailto:glasgow-haskell- | users-bounces@haskell.org] On Behalf Of Dan Doel | Sent: 16 June 2008 20:52 | To: glasgow-haskell-users@haskell.org | Subject: Low-level array performance | | Greetings, | | Recently, due to scattered complaints I'd seen on the internet, I set about | to | rewrite the fannkuch [1] benchmark on the Great Computer Language Shootout. | The current entry uses Ptr/Addr#, malloc, etc. so it's not particularly | representative of code one would actually write in Haskell these days. Over | the past few days, I've written several versions of the benchmark, and | collaborated a bit with local speed guru Don Stewart, but an entry that bests | the current entry does not seem to be in the cards currently. So, I thought | I'd write about some of the issues, and hopefully get some encouraging news | about the situation. | | Issue 1: STUArrays aren't optimized as fully as one might ideally expect. | | This isn't so much of an issue, I suppose, except possibly for an environment | like the shootout. I wrote versions of the benchmark (and particular pieces | of the overall benchmark, as well) for both STUArrays, and MUArrays from | dons' new uvector [2] library. The STUArray versions are consistently slower, | and a look at the core reveals that STUArray code contains significantly more | boxing than uvector. This isn't a big deal, as there's no reason not to use | uvector, aside from it not being blessed by being in the GHC distribution. | However, I thought I'd bring it up in case anyone hasn't heard of the library | yet. | | Issue 2: Reading from/writing to a MutableByteArray# is slower than an Addr# | | This is, I think, the crux of the issue. The main content of the benchmark is | reversing/shifting items in an array. To get a somewhat easier look at the | core, I boiled things down to a benchmark that just reverses a small array | many times. In the interest of further reducing things, I wrote a version of | the benchmark that uses raw Addr#s, and a version that uses raw | MutableByteArray#s. I've attached both versions. | | The fannkuch benchmark (input 11) with byte arrays runs in around 12 seconds | on my machine. To get the reversal benchmark to around that time, I can tell | it to, for instance, reverse a 10-element array 250 million times. Those same | inputs to the Addr# version of the reverse benchmark result in a runtime of 7 | to 8 seconds, so there's a significant discrepancy at the level the benchmark | actually runs at (the current fannkuch entry that uses Addr#/Ptr takes around | 9 seconds, so there's other stuff going on in the benchmark evening them out, | but this is likely the source of the difference between the two). | | uvector performance on the reversal benchmark is comparable with | MutableByteArray#, so the practical slowdown using a library for actual | public consumption seems to be in the fact that it's based on byte arrays | (STUArray again falls further behind by retaining too many boxes). | | So, to wrap things up: dons and I were rather curious about the performance | differences between MutableByteArray# and Addr#, as one might expect them to | be comparable, both being low-level, raw pointer type structures; the code | for | using the two is nearly identical. | | (If one were to raise an Issue 3, it'd likely be that, even with Addr# and | malloc, which is currently the fastest Haskell solution, we're not that fast. | A glance at that shootout list reveals that we're beat by a myriad of | Java-based languages, Python (Psyco), Ada, Ocaml, Eiffel, Lisp (SBCL), Clean, | and of course C++, C, Fortran, etc. Is there any hope of getting performance | improvements that will push us up in that list? I've heard that people are | working on improved code generation, but I don't know if that can help with | this sort of thing (or if it's more of a runtime system area). | | While writing this mail, it came to mind that this is also a potential source | of slowness in the sorting 'library' I've been writing for uvector. I already | have some routines that are competitive with/beat qsort from glibc, but GNU's | C++ STL introsort implementation has remained untouched. However, if | reads/writes simply take twice as much time in Haskell as they do in C++, my | results aren't as surprising.) | | Anyhow, thanks for any light that can be shed on the subject. | | 1: http://shootout.alioth.debian.org/gp4/benchmark.php?test=fannkuch | 2: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/uvector

Dan Doel wrote:
Issue 2: Reading from/writing to a MutableByteArray# is slower than an Addr#
This is, I think, the crux of the issue. The main content of the benchmark is reversing/shifting items in an array. To get a somewhat easier look at the core, I boiled things down to a benchmark that just reverses a small array many times. In the interest of further reducing things, I wrote a version of the benchmark that uses raw Addr#s, and a version that uses raw MutableByteArray#s. I've attached both versions.
So I tried your examples and the Addr# version looks slower than the MBA# version: $ ./Ptr 100 1000000 +RTS -sstderr Done. 48,196,560 bytes allocated in the heap 27,381,764 bytes copied during GC (scavenged) 18,260,784 bytes copied during GC (not scavenged) 14,389,248 bytes maximum residency (5 sample(s)) 92 collections in generation 0 ( 0.09s) 5 collections in generation 1 ( 0.13s) 28 Mb total memory in use INIT time 0.00s ( 0.00s elapsed) MUT time 0.68s ( 0.69s elapsed) GC time 0.22s ( 0.28s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.90s ( 0.97s elapsed) $ ./ByteArr 100 1000000 +RTS -sstderr Done. 4,042,700 bytes allocated in the heap 1,272 bytes copied during GC (scavenged) 0 bytes copied during GC (not scavenged) 16,384 bytes maximum residency (1 sample(s)) 2 collections in generation 0 ( 0.00s) 1 collections in generation 1 ( 0.00s) 5 Mb total memory in use INIT time 0.00s ( 0.00s elapsed) MUT time 0.53s ( 0.54s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.53s ( 0.54s elapsed) I tried with 6.8.2 and 6.8.3, using -O2 in both cases. I tried the Ptr version with and without -fvia-C -optc-O2, no difference. Are these exactly the same programs you measured? What parameters did you use? Cheers, Simon

On Tuesday 17 June 2008, Simon Marlow wrote:
So I tried your examples and the Addr# version looks slower than the MBA# version:
Hmm...
I tried with 6.8.2 and 6.8.3, using -O2 in both cases. I tried the Ptr version with and without -fvia-C -optc-O2, no difference.
I had forgotten about the via-c in the pragma when I sent it, but I've tested it both via-c and with the new backend (and triple checked since your message), and I always come away with the Ptr version being faster. -fvia-c doesn't seem to affect the speed of the Addr# version much, while it improves the speed of the MBA# version. However, even with the improved speed, Addr# seems to edge it out here. With the new backend, I get the results I sent in my initial mail. The ByteArray version takes 11 - 12 seconds to reverse a size 10 array 250 million times, whereas the Addr# version takes around 7 seconds. (I also noticed a bug I'd missed before sending the ByteArray version. It should allocate based on w, but I left it hard coded to 4# when I was experimenting. This was causing segmentation faults on large arrays on my machine, since I'm running in 64-bit mode, and 8# is the correct value here. Are you running in 32-bit, and if so, could that be the source of our discrepancy?)
Are these exactly the same programs you measured? What parameters did you use?
Aside from the couple oversights above, yes. The actual fannkuch benchmark doesn't use very large arrays. The current test input is n = 11, and all the arrays it uses are length n. It gets its work from copying, reversing and shifting (portions of) those arrays n! or more times. So, I thought it'd be truer to the benchmark to reverse a small array many times. I've been running with command lines like './ByteArr 250000000 10', which says to reverse a size-10 array 250 million times. I tested with other sizes, and things seem to stay about the same increasing the array size and decreasing the iterations by the same factor, until I got to an array size of around 100,000, at which point there's a drop off for both (Addr# still being faster). I assume that's due to cache effects. Here's some example runs, using '--make -O2' for both (OPTIONS pragma changed to only have -fglasgow-exts for both, and the w bug fixed). ./ByteArr 250000000 10 +RTS -sstderr Done. 56,824 bytes allocated in the heap 552 bytes copied during GC (scavenged) 0 bytes copied during GC (not scavenged) 45,056 bytes maximum residency (1 sample(s)) 1 collections in generation 0 ( 0.00s) 1 collections in generation 1 ( 0.00s) 1 Mb total memory in use INIT time 0.00s ( 0.00s elapsed) MUT time 10.35s ( 11.15s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 10.36s ( 11.15s elapsed) %GC time 0.0% (0.0% elapsed) Alloc rate 5,486 bytes per MUT second Productivity 100.0% of total user, 92.9% of total elapsed ./Ptr 250000000 10 +RTS -sstderr Done. 57,840 bytes allocated in the heap 552 bytes copied during GC (scavenged) 0 bytes copied during GC (not scavenged) 45,056 bytes maximum residency (1 sample(s)) 1 collections in generation 0 ( 0.00s) 1 collections in generation 1 ( 0.00s) 1 Mb total memory in use INIT time 0.00s ( 0.00s elapsed) MUT time 6.53s ( 7.05s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 6.53s ( 7.05s elapsed) %GC time 0.0% (0.0% elapsed) Alloc rate 8,854 bytes per MUT second Productivity 100.0% of total user, 92.7% of total elapsed As I mentioned before, using -fvia-c -optc-O2 leaves Ptr unchanged, and speeds up ByteArr, but not enough to catch up with Ptr (here, at least). Anyhow, my apologies for the mistakes above, and thanks for your time and assistance. I'll try puzzling over the C-- some and probably open a trac ticket a bit later as the other Simon suggested (if that's still appropriate). Thanks again, -- Dan

Am Dienstag, 17. Juni 2008 18:32 schrieb Dan Doel:
On Tuesday 17 June 2008, Simon Marlow wrote:
So I tried your examples and the Addr# version looks slower than the MBA# version:
Hmm...
I tried with 6.8.2 and 6.8.3, using -O2 in both cases. I tried the Ptr version with and without -fvia-C -optc-O2, no difference.
I had forgotten about the via-c in the pragma when I sent it, but I've tested it both via-c and with the new backend (and triple checked since your message), and I always come away with the Ptr version being faster. -fvia-c doesn't seem to affect the speed of the Addr# version much, while it improves the speed of the MBA# version. However, even with the improved speed, Addr# seems to edge it out here.
With the new backend, I get the results I sent in my initial mail. The ByteArray version takes 11 - 12 seconds to reverse a size 10 array 250 million times, whereas the Addr# version takes around 7 seconds.
I've experimented a bit and found that Ptr is faster for small arrays (only very slightly so if compiled with -fvia-C -optc-O3), but ByteArr performs much better for larger arrays dafis@linux:~/Documents/haskell/move> ./PtrC +RTS -sstderr -RTS 20 10000000 ./PtrC 20 10000000 +RTS -sstderr Done. 481,596,836 bytes allocated in the heap 257,665,360 bytes copied during GC (scavenged) 171,919,440 bytes copied during GC (not scavenged) 117,149,696 bytes maximum residency (8 sample(s)) 919 collections in generation 0 ( 3.44s) 8 collections in generation 1 ( 24.99s) 226 Mb total memory in use INIT time 0.00s ( 0.00s elapsed) MUT time 8.16s ( 9.06s elapsed) GC time 28.43s ( 30.11s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 36.59s ( 39.16s elapsed) %GC time 77.7% (76.9% elapsed) Alloc rate 59,019,220 bytes per MUT second Productivity 22.3% of total user, 20.8% of total elapsed dafis@linux:~/Documents/haskell/move> ./ByteArrC +RTS -sstderr -RTS 20 10000000 ./ByteArrC 20 10000000 +RTS -sstderr Done. 40,041,976 bytes allocated in the heap 1,272 bytes copied during GC (scavenged) 0 bytes copied during GC (not scavenged) 16,384 bytes maximum residency (1 sample(s)) 2 collections in generation 0 ( 0.00s) 1 collections in generation 1 ( 0.00s) 40 Mb total memory in use INIT time 0.00s ( 0.02s elapsed) MUT time 5.03s ( 5.32s elapsed) GC time 0.00s ( 0.01s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 5.03s ( 5.35s elapsed) %GC time 0.0% (0.3% elapsed) Alloc rate 7,960,631 bytes per MUT second Productivity 100.0% of total user, 94.0% of total elapsed Using GHC 6.8.2 The GC time for the Addr# version is frightening

On Tuesday 17 June 2008, Daniel Fischer wrote:
I've experimented a bit and found that Ptr is faster for small arrays (only very slightly so if compiled with -fvia-C -optc-O3), but ByteArr performs much better for larger arrays ... The GC time for the Addr# version is frightening
I had an entire e-mail written about what a bizarre and interesting result you'd just found, but unfortunately, I then remembered exactly how the array gets filled in the Ptr version. Namely: (Ptr a) <- newArray [0..n-1] Which, I assume does something terrible, like calling length to get the size needed for the array, while also needing the values after array creation to feed into it. For small arrays like the ones I'd been testing with, this doesn't matter, because the work done on the list is negligible. However, when you get to very large lists (100,000 elements and above, apparently) this starts causing massive space leaks, which explains the terrible GC behavior we were seeing. If you change the benchmark like so: bench :: Int -> Int -> IO () bench k n = do (Ptr a) <- mallocArray n :: IO (Ptr Int) fill a n replicateM_ k (reverse a 0 (n-1)) fill :: Addr# -> Int -> IO () fill a (I# n) = IO (go 0#) where go i s | i <# n = case writeIntOffAddr# a i i s of s -> go (i +# 1#) s | otherwise = (# s, () #) The space leak goes away, and the runtimes stay consistent. Up to around 10,000 elements, Ptr hovers around 6s, and ByteArray (-fasm) stays around 11. At 100,000, Ptr jumps to 12-13s, and ByteArray goes to 14-16 and stays there (again, I imagine due to running into bad cache effects at that level). This is all for size * iterations = 2.5 billion on my machine. A good catch anyhow, though. That could explain why Simon Marlow was seeing the Addr# version as slower, since he was using a large array, and thus work done on the list could have contributed significantly (although MUT time was higher with Ptr, so it would have had to contribute work there, not just GC thrashing). Thanks for the input, -- Dan

Am Dienstag, 17. Juni 2008 20:35 schrieb Dan Doel:
On Tuesday 17 June 2008, Daniel Fischer wrote:
I've experimented a bit and found that Ptr is faster for small arrays (only very slightly so if compiled with -fvia-C -optc-O3), but ByteArr performs much better for larger arrays ... The GC time for the Addr# version is frightening
I had an entire e-mail written about what a bizarre and interesting result you'd just found, but unfortunately, I then remembered exactly how the array gets filled in the Ptr version. Namely:
(Ptr a) <- newArray [0..n-1]
Ouch. I should've looked at both sources, that would have been obvious then :)
Which, I assume does something terrible, like calling length to get the size needed for the array, while also needing the values after array creation to feed into it. For small arrays like the ones I'd been testing with, this doesn't matter, because the work done on the list is negligible. However, when you get to very large lists (100,000 elements and above, apparently) this starts causing massive space leaks, which explains the terrible GC behavior we were seeing.
If you change the benchmark like so:
bench :: Int -> Int -> IO () bench k n = do (Ptr a) <- mallocArray n :: IO (Ptr Int) fill a n replicateM_ k (reverse a 0 (n-1))
fill :: Addr# -> Int -> IO () fill a (I# n) = IO (go 0#) where go i s
| i <# n = case writeIntOffAddr# a i i s of s -> go (i +# 1#) s | otherwise = (# s, () #)
The space leak goes away, and the runtimes stay consistent. Up to around 10,000 elements, Ptr hovers around 6s, and ByteArray (-fasm) stays around 11. At 100,000, Ptr jumps to 12-13s, and ByteArray goes to 14-16 and stays there (again, I imagine due to running into bad cache effects at that level). This is all for size * iterations = 2.5 billion on my machine.
Since my computer is slower, I confined my tests to size*iterations ~= 10^9. Mostly, I find no noticeable difference, between 0.2 and 0.5 %, sometimes one faster, sometimes the other. I have the impression that the Ptr version is the faster more often than the ByteArr version, but the tendency isn't very strong. That applies to both compiled with -O2 -fvia-C -optc-O3. Compiling with -O2 -fasm doesn't make a noticeable difference for Ptr, but is about 13% slower for ByteArr when the arrays are large (too large for the cache, I suppose) and about 50% slower for small arrays. So what I can read off that is that the native code generator still has to catch up for such code, not that either way of implementing arrays is inherently faster.
A good catch anyhow, though. That could explain why Simon Marlow was seeing the Addr# version as slower, since he was using a large array, and thus work done on the list could have contributed significantly (although MUT time was higher with Ptr, so it would have had to contribute work there, not just GC thrashing).
Thanks for the input, -- Dan
Cheers, Daniel

I see that Dan Doel's post favoring Ptr/Addr# has the same allocation amounts (from +RTS -sstderr) for Ptr/Addr# and the MutableByteArray# Everyone else sees more allocation for Ptr/Addr# than MBA# and see MBA# as faster in these cases. I myself (on G4) see more allocation [just like Simon Marlow] for Ptr/Addr# and find it slower. If I boost the initial memory with "-A 100m" then Ptr still allocated more, but the timing difference becomes quite small: The Ptr/Addr# code now runs in: pamac-cek10:tmp chrisk$ time ./addr 100 1000000 +RTS -sstderr -A100m ./a 100 1000000 +RTS -sstderr -A100m Done. 48,182,068 bytes allocated in the heap 276 bytes copied during GC (scavenged) 0 bytes copied during GC (not scavenged) 20,480 bytes maximum residency (1 sample(s)) 1 collections in generation 0 ( 0.00s) 1 collections in generation 1 ( 0.00s) 97 Mb total memory in use INIT time 0.00s ( 0.01s elapsed) MUT time 1.54s ( 2.43s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 1.55s ( 2.44s elapsed) %GC time 0.2% (0.1% elapsed) Alloc rate 31,205,254 bytes per MUT second Productivity 99.6% of total user, 63.1% of total elapsed real 0m2.728s user 0m1.548s sys 0m0.207s And the MutableByteArray# code now runs in: pamac-cek10:tmp chrisk$ time ./mba 100 1000000 +RTS -sstderr -A100m ./m 100 1000000 +RTS -sstderr -A100m Done. 4,023,784 bytes allocated in the heap 276 bytes copied during GC (scavenged) 0 bytes copied during GC (not scavenged) 20,480 bytes maximum residency (1 sample(s)) 1 collections in generation 0 ( 0.00s) 1 collections in generation 1 ( 0.00s) 101 Mb total memory in use INIT time 0.00s ( 0.01s elapsed) MUT time 1.50s ( 2.30s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 1.51s ( 2.32s elapsed) %GC time 0.3% (0.2% elapsed) Alloc rate 2,668,201 bytes per MUT second Productivity 99.6% of total user, 65.0% of total elapsed real 0m2.335s user 0m1.513s sys 0m0.049s

On Tuesday 17 June 2008, haskell@list.mightyreason.com wrote:
I see that Dan Doel's post favoring Ptr/Addr# has the same allocation amounts (from +RTS -sstderr) for Ptr/Addr# and the MutableByteArray#
Everyone else sees more allocation for Ptr/Addr# than MBA# and see MBA# as faster in these cases.
I myself (on G4) see more allocation [just like Simon Marlow] for Ptr/Addr# and find it slower. If I boost the initial memory with "-A 100m" then Ptr still allocated more, but the timing difference becomes quite small:
Pardon my noise, but is this still with the version of Ptr.hs that would (in your case) allocate a 1 million element list and traverse it twice, or the revision that fills the array in a loop with an Int#? If it's the former, and Addr# is tying MutableByteArray# even with operations on a 40-some megabyte list (if the allocation is any indication), then the actual Addr# operations are probably faster for you, too. :) I'll attach new, hopefully bug-free versions of the benchmark to this message. Of course, without the list overhead, the ByteArr appears to allocate much more than Ptr for large arrays, because the n*w byte array shows up in the heap allocation, whereas the malloced memory does not. None of this should be a factor in the actual fannkuch benchmark, of course, which only allocates 3 arrays of size 11. Cheers, -- Dan

Am Dienstag, 17. Juni 2008 22:37 schrieb Dan Doel:
I'll attach new, hopefully bug-free versions of the benchmark to this message.
With -O2 -fvia-C -optc-O3, the difference is small (less than 1%), but today, ByteArr is faster more often.
Of course, without the list overhead, the ByteArr appears to allocate much more than Ptr for large arrays, because the n*w byte array shows up in the heap allocation, whereas the malloced memory does not. None of this should be a factor in the actual fannkuch benchmark, of course, which only allocates 3 arrays of size 11.
Cheers, -- Dan
Cheers, Daniel

On Wednesday 18 June 2008, Daniel Fischer wrote:
Am Dienstag, 17. Juni 2008 22:37 schrieb Dan Doel:
I'll attach new, hopefully bug-free versions of the benchmark to this message.
With -O2 -fvia-C -optc-O3, the difference is small (less than 1%), but today, ByteArr is faster more often.
Hmm, well, I'm a bit flummoxed. I still get Addr# outperforming MBA# by perhaps 10% - 15%, even with -fvia-C -optc-O3 (and before the slight speedup below). Perhaps gcc's optimizer isn't doing as good a job for me for some reason. In any case, I've entered a bug for this on the GHC trac: http://hackage.haskell.org/trac/ghc/ticket/2374 It contains a Ptr benchmark that performs slightly faster on very small arrays (under, say, 40 elements; I noticed such runs were taking more time than those with larger arrays with correspondingly fewer iterations, so I eliminated the replicateM_ in favor of an explicit loop. It gains a little time on the small arrays, but not enough to match the performance on the larger arrays, so I guess there are yet more factors. :) In any case, it makes it closer to being the same code as ByteArr). The bug is filed against the native code generator, since it shows up more clearly there. I haven't gotten to looking at C-- or assembly yet, but hopefully I will in the near future. I'll try to do further followup on the bug report, since that's probably easier for the developers to keep track of. Cheers, -- Dan
participants (5)
-
Dan Doel
-
Daniel Fischer
-
haskell@list.mightyreason.com
-
Simon Marlow
-
Simon Peyton-Jones