
Hi, I've recently implemented some benchmarks for my library, and while I expected a slowdown for 64-bit code, I'm a bit bit surprised by the results. In summary: with 64 bit ghc 6.6.1, my benchmark runs in ~160 seconds with 32 bit ghc 6.6, it runs in ~ 95 seconds Most of the time is traversing a list of elements, doing a few numerical calculations. Presumably this is due to increased code size due to 8-byte pointers? I'll add some more benchmarks, but just wondered whether this was to be expeced, and, if so, whether I perhaps should be running a 32 bit version of GHC? I tried to Google for other benchmark results, but couldn't find any. Are there any particular GHC options I should use for compiling 64-bit code? I'll install 6.8 RSN, perhaps that will improve things? Oh, and if anybody wants to play with it, it should be possible to install Data.Binary and HXT, and then: darcs get http://malde.org/~ketil/bio cd bio make bench -k -- If I haven't seen further, it is by standing in the footprints of giants

Ketil Malde wrote:
Hi,
I've recently implemented some benchmarks for my library, and while I expected a slowdown for 64-bit code, I'm a bit bit surprised by the results. In summary:
with 64 bit ghc 6.6.1, my benchmark runs in ~160 seconds with 32 bit ghc 6.6, it runs in ~ 95 seconds
here are my results (with reversed ghc versions) when running your code: with 64 bit ghc 6.6 254 secs with 32 bit ghc 6.6.1 146 secs Having noticed that the 64 bit ghc-6.6 is slower, I decided to install the 32 bit ghc-6.6.1 only. Cheers Christian my test machine: Dual Core AMD Opteron 2220Server 2800MHz 16GB RAM Linux pollux 2.6.16.27-0.6-xenlocal x86_64 GNU/Linux

Ketil Malde wrote:
I've recently implemented some benchmarks for my library, and while I expected a slowdown for 64-bit code, I'm a bit bit surprised by the results. In summary:
with 64 bit ghc 6.6.1, my benchmark runs in ~160 seconds with 32 bit ghc 6.6, it runs in ~ 95 seconds
Most of the time is traversing a list of elements, doing a few numerical calculations. Presumably this is due to increased code size due to 8-byte pointers?
Not so much code size, but data size (heap size, to be more precise). The amount of data shuffled around at runtime is doubled when running a 64-bit version of GHC - the GC has to do twice as much work. The cache hit rate drops, for a given cache size. It would be interesting to know how much time is spent in the GC - run the program with +RTS -sstderr.
I'll add some more benchmarks, but just wondered whether this was to be expeced, and, if so, whether I perhaps should be running a 32 bit version of GHC?
I guess it's moderately surprising, I don't expect to see that much difference usually. But I suppose if the memory demands of your program are high, then it could be reasonable. There are benefits to running on 64 bits: more registers in particular, but this doesn't outweight the extra memory overhead for us usually. Cheers, Simon

Simon Marlow
Not so much code size, but data size (heap size, to be more precise).
Of course. There was some talk about storing tags in pointers for 6.8, I couldn't find the reference, but I wonder if that would help my situation?
It would be interesting to know how much time is spent in the GC - run the program with +RTS -sstderr.
MUT time decreases a bit (131 to 127s) for x86_64, but GC time increases a lot (98 to 179s). i686 version: ---------------------------------------- 94,088,199,152 bytes allocated in the heap 22,294,740,756 bytes copied during GC (scavenged) 2,264,823,784 bytes copied during GC (not scavenged) 124,747,644 bytes maximum residency (4138 sample(s)) 179962 collections in generation 0 ( 67.33s) 4138 collections in generation 1 ( 30.92s) 248 Mb total memory in use INIT time 0.00s ( 0.00s elapsed) MUT time 131.53s (133.03s elapsed) GC time 98.25s (100.13s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 229.78s (233.16s elapsed) %GC time 42.8% (42.9% elapsed) Alloc rate 715,345,865 bytes per MUT second Productivity 57.2% of total user, 56.4% of total elapsed ---------------------------------------- x86_64 version: ---------------------------------------- 173,790,326,352 bytes allocated in the heap 59,874,348,560 bytes copied during GC (scavenged) 5,424,298,832 bytes copied during GC (not scavenged) 247,477,744 bytes maximum residency (9856 sample(s)) 331264 collections in generation 0 (111.51s) 9856 collections in generation 1 ( 67.80s) 582 Mb total memory in use INIT time 0.00s ( 0.00s elapsed) MUT time 127.20s (127.76s elapsed) GC time 179.32s (179.63s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 306.52s (307.39s elapsed) %GC time 58.5% (58.4% elapsed) Alloc rate 1,366,233,874 bytes per MUT second Productivity 41.5% of total user, 41.4% of total elapsed ---------------------------------------- I've also added results from the 64 bit ghc-6.8.20071011 binary snapshot, which shows some nice improvements, with one benchmark improving by 30%(!). ---------------------------------------- 151,807,589,712 bytes allocated in the heap 50,687,462,360 bytes copied during GC (scavenged) 4,472,003,520 bytes copied during GC (not scavenged) 256,532,480 bytes maximum residency (6805 sample(s)) 289342 collections in generation 0 ( 89.30s) 6805 collections in generation 1 ( 60.26s) 602 Mb total memory in use INIT time 0.00s ( 0.00s elapsed) MUT time 83.79s ( 84.36s elapsed) GC time 149.57s (151.10s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 233.35s (235.47s elapsed) %GC time 64.1% (64.2% elapsed) Alloc rate 1,811,779,785 bytes per MUT second Productivity 35.9% of total user, 35.6% of total elapsed ----------------------------------------
I'll add some more benchmarks
And I did. Below is a bit more detail from the log. The "rc hash counts" traverse a bytestring, hashing fixed-size words into Integers. As you can see, I haven't yet gotten the SPECIALIZE pragma to work correctly :-). The "global alignment" is the previous test, performing global (Needleman-Wunsch) alignment on pairs of sequences of length 100 (short) or 1000 (long), implementing the dynamic programming matrix as a list of lists. ==================== Start:Fri Oct 12 08:48:36 CEST 2007 Linux nmd9999 2.6.20-16-generic #2 SMP Fri Aug 31 00:55:27 UTC 2007 i686 GNU/Linux ghc 6.6 --- Sequence bench --- rc hash counts int (8) ..... OK, passed 10 tests, CPU time: 34.526157s rc hash counts int (16) ..... OK, passed 10 tests, CPU time: 34.746172s rc hash counts (16) ..... OK, passed 10 tests, CPU time: 34.642164s rc hash counts (32) ..... OK, passed 10 tests, CPU time: 35.378212s Sequence bench totals, CPU time: 139.292705s, wall clock: 139 secs --- Alignment bench --- global alignment, short ..... OK, passed 10 tests, CPU time: 2.696168s global alignment, long ...... OK, passed 10 tests, CPU time: 90.481655s Alignment bench totals, CPU time: 93.177823s, wall clock: 94 secs Total for all tests, CPU time: 232.474528s, wall clock: 233 secs End:Fri Oct 12 08:52:29 CEST 2007 ==================== Start:Fri Oct 12 09:52:33 CEST 2007 Linux nmd9999.imr.no 2.6.22-13-generic #1 SMP Thu Oct 4 17:52:26 GMT 2007 x86_64 GNU/Linux ghc 6.6.1 --- Sequence bench --- rc hash counts int (8) ..... OK, passed 10 tests, CPU time: 36.634289s rc hash counts int (16) ..... OK, passed 10 tests, CPU time: 36.590286s rc hash counts (16) ..... OK, passed 10 tests, CPU time: 36.946309s rc hash counts (32) ..... OK, passed 10 tests, CPU time: 37.402338s Sequence bench totals, CPU time: 147.577222s, wall clock: 148 secs --- Alignment bench --- global alignment, short ..... OK, passed 10 tests, CPU time: 3.564223s global alignment, long ...... OK, passed 10 tests, CPU time: 156.101756s Alignment bench totals, CPU time: 159.665979s, wall clock: 159 secs Total for all tests, CPU time: 307.247201s, wall clock: 307 secs End:Fri Oct 12 09:57:40 CEST 2007 ==================== Start:Fri Oct 12 10:51:27 CEST 2007 Linux nmd9999.imr.no 2.6.22-13-generic #1 SMP Thu Oct 4 17:52:26 GMT 2007 x86_64 GNU/Linux ghc 6.8.0.20071011 --- Sequence bench --- rc hash counts int (8) ..... OK, passed 10 tests, CPU time: 22.773423s rc hash counts int (16) ..... OK, passed 10 tests, CPU time: 22.657416s rc hash counts (16) ..... OK, passed 10 tests, CPU time: 22.513407s rc hash counts (32) ..... OK, passed 10 tests, CPU time: 23.009438s Sequence bench totals, CPU time: 90.953684s, wall clock: 91 secs --- Alignment bench --- global alignment, short ..... OK, passed 10 tests, CPU time: 3.168198s global alignment, long ...... OK, passed 10 tests, CPU time: 140.808799s Alignment bench totals, CPU time: 143.976997s, wall clock: 144 secs Total for all tests, CPU time: 234.930681s, wall clock: 235 secs End:Fri Oct 12 10:55:23 CEST 2007 -k -- If I haven't seen further, it is by standing in the footprints of giants

Ketil Malde
I've also added results from the 64 bit ghc-6.8.20071011 binary snapshot, which shows some nice improvements, with one benchmark improving by 30%(!).
One difference between these runs is that the ByteString library, on which this particular benchmark depends heavily, got upgraded from fps-0.7 to bytestring-0.9. I initially thought some of the performance increase could be due to that, but after some contortions, I find that 6.6.1 with bytestring-0.9 gives me slightly worse results(!) (I haven't yet entirely convinced myself that I got this properly set up, but at least ghci lets me :m Data.ByteString.Lazy.Internals, which I believe is a new addition) Here are the numbers: --- Sequence bench --- rc hash counts int (8) ..... OK, passed 10 tests, CPU time: 38.778423s rc hash counts int (16) ..... OK, passed 10 tests, CPU time: 38.522408s rc hash counts (16) ..... OK, passed 10 tests, CPU time: 38.694418s rc hash counts (32) ..... OK, passed 10 tests, CPU time: 39.170448s Sequence bench totals, CPU time: 155.165697s, wall clock: 155 secs --- Alignment bench --- global alignment, short ..... OK, passed 10 tests, CPU time: 3.492218s global alignment, long ...... OK, passed 10 tests, CPU time: 152.497531s Alignment bench totals, CPU time: 155.989749s, wall clock: 156 secs Total for all tests, CPU time: 311.155446s, wall clock: 311 secs -k -- If I haven't seen further, it is by standing in the footprints of giants

Ketil Malde wrote:
Simon Marlow
writes: Not so much code size, but data size (heap size, to be more precise).
Of course.
There was some talk about storing tags in pointers for 6.8, I couldn't find the reference, but I wonder if that would help my situation?
Unfortunately not, the pointer-tagging optimisation doesn't affect space usage, only runtime.
It would be interesting to know how much time is spent in the GC - run the program with +RTS -sstderr.
MUT time decreases a bit (131 to 127s) for x86_64, but GC time increases a lot (98 to 179s).
Right - GC time doubled, which is what we'd expect to see when the resident data size doubles. The decrease in MUT time is probably due to the extra registers available, but MUT time would also be affected by the increase in data size, because the cache hit rate should be lower. On the whole, I think these results are to be expected. There isn't much we can do to improve things in the short term, I'm afraid. Improvements in the pipeline will hopefully enable us to make better use of the extra registers on x86_64, and perhaps parallel GC in the future will get that GC time down again. Cheers, Simon

Hello Simon, Monday, October 15, 2007, 2:52:10 PM, you wrote:
Right - GC time doubled, which is what we'd expect to see when the resident data size doubles. The decrease in MUT time is probably due to the extra registers available, but MUT time would also be affected by the increase in data size, because the cache hit rate should be lower. On the whole, I think these results are to be expected.
There isn't much we can do to improve things in the short term, I'm afraid.
usimng just "+RTS -A10m" may help -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com
participants (4)
-
Bulat Ziganshin
-
Christian Maeder
-
Ketil Malde
-
Simon Marlow