Counting bits: Sanity Check

Hi All, Since it seems that real applications need more than just union, intersection, difference and complement to be fast to make EnumSet useful, I've been looking into the less naive approaches to the other things. In particular, size seems to find itself in the inner loop. I've made a comparison of various approaches to bit counting. It seems I was too hasty to declare Bulat's suggestion of table lookup (table,table32) the winner. It seems Robert's suggestion of Kernighan's (kern) method is faster. I also implemented the method described in pages 187-188 of Software Optimization Guide for AMD Athlon™ 64 and Opteron™ Processors. (ones32) It's slower on my powerbook, but may be the winner on 64bit processors. Here are the results: [Marcel:~/devl/EnumSet] david% time ./bits 2000000 3000000 ones32 21 1.788u 0.136s 0:03.30 57.8% 0+0k 0+0io 0pf+0w [Marcel:~/devl/EnumSet] david% time ./bits 2000000 3000000 table 21 2.404u 0.164s 0:04.96 51.6% 0+0k 0+1io 0pf+0w [Marcel:~/devl/EnumSet] david% time ./bits 2000000 3000000 table32 21 2.067u 0.140s 0:04.27 51.5% 0+0k 0+0io 0pf+0w [Marcel:~/devl/EnumSet] david% time ./bits 2000000 3000000 kern 21 1.729u 0.137s 0:03.25 56.9% 0+0k 0+1io 0pf+0w If you'd like to give it a whirl on your fancy modern computers, here's the code: ghc -O3 -optc-O3 -o bits BitTwiddle.hs Of course, if I've done something lame-brained and skewed the results, please let me know. Cheers, David -------------------------------- David F. Place mailto:d@vidplace.com

On Apr 11, 2006, at 10:09 AM, David F. Place wrote:
Hi All,
Since it seems that real applications need more than just union, intersection, difference and complement to be fast to make EnumSet useful, I've been looking into the less naive approaches to the other things. In particular, size seems to find itself in the inner loop. I've made a comparison of various approaches to bit counting. It seems I was too hasty to declare Bulat's suggestion of table lookup (table,table32) the winner. It seems Robert's suggestion of Kernighan's (kern) method is faster.
I also implemented the method described in pages 187-188 of Software Optimization Guide for AMD Athlon™ 64 and Opteron™ Processors. (ones32) It's slower on my powerbook, but may be the winner on 64bit processors. Here are the results:
[Marcel:~/devl/EnumSet] david% time ./bits 2000000 3000000 ones32 21 1.788u 0.136s 0:03.30 57.8% 0+0k 0+0io 0pf+0w [Marcel:~/devl/EnumSet] david% time ./bits 2000000 3000000 table 21 2.404u 0.164s 0:04.96 51.6% 0+0k 0+1io 0pf+0w [Marcel:~/devl/EnumSet] david% time ./bits 2000000 3000000 table32 21 2.067u 0.140s 0:04.27 51.5% 0+0k 0+0io 0pf+0w [Marcel:~/devl/EnumSet] david% time ./bits 2000000 3000000 kern 21 1.729u 0.137s 0:03.25 56.9% 0+0k 0+1io 0pf+0w
If you'd like to give it a whirl on your fancy modern computers, here's the code:
ghc -O3 -optc-O3 -o bits BitTwiddle.hs
I get similar results on my machine (PPC powerbook). 'ones32' barely edges out 'kern' and 'table' and 'table32' come in behind. Average 'user' time over three runs: ones32: 0.540s kern: 0.545s table: 0.730s table32: 0.632s
Of course, if I've done something lame-brained and skewed the results, please let me know.
Cheers, David
-------------------------------- David F. Place mailto:d@vidplace.com
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Rob Dockins Speak softly and drive a Sherman tank. Laugh hard; it's a long way to the bank. -- TMBG

On Wednesday 12 April 2006 02:09, David F. Place wrote:
If you'd like to give it a whirl on your fancy modern computers,
Averages of user time for three runs on an Athlon64 running 64bit linux: kern 0.29700 ones32 0.30733 table32 0.37333 table 0.38400 I ran a whole lot more of kern and ones32... kern was consistently faster than ones32. Curious. Daniel

On Apr 12, 2006, at 3:30 PM, Daniel McAllansmith wrote:
Averages of user time for three runs on an Athlon64 running 64bit linux:
kern 0.29700 ones32 0.30733 table32 0.37333 table 0.38400
I ran a whole lot more of kern and ones32... kern was consistently faster than ones32. Curious.
Yes, especially curious since the algorithm is taken from AMD's optimization guide for the Athlon and Opteron series. I'm not good enough at reading core syntax to be able to see what GHC is doing with it. I wonder how this other crazy algorithm I found will work on your 64 bit omputer. It is much slower on my 32 bit PPC powerbook for obvious reasons. If you'd like to try it, I'll include an updated BitTwiddle.hs . Usage: time ./bits 2000000 3000000 64 -------------------------------- David F. Place mailto:d@vidplace.com

On Thursday 13 April 2006 08:55, David F. Place wrote:
On Apr 12, 2006, at 3:30 PM, Daniel McAllansmith wrote:
Averages of user time for three runs on an Athlon64 running 64bit linux:
kern 0.29700 ones32 0.30733 table32 0.37333 table 0.38400
I ran a whole lot more of kern and ones32... kern was consistently faster than ones32. Curious.
Yes, especially curious since the algorithm is taken from AMD's optimization guide for the Athlon and Opteron series. I'm not good enough at reading core syntax to be able to see what GHC is doing with it.
I wonder how this other crazy algorithm I found will work on your 64 bit omputer. It is much slower on my 32 bit PPC powerbook for obvious reasons. If you'd like to try it, I'll include an updated BitTwiddle.hs .
Usage: time ./bits 2000000 3000000 64
Averages of user time of five runs on an Athlon64 running 64bit linux: 64 0.1974 kern 0.2980 ones32 0.3240 table32 0.3754 table 0.3798 64 looks to be a good bit faster. You didn't change anything in the ones32 algorithm did you? The other algorithms are taking roughly what they did last time, but ones32 seems consistently slower now. Daniel

On Apr 12, 2006, at 5:21 PM, Daniel McAllansmith wrote:
Averages of user time of five runs on an Athlon64 running 64bit linux:
64 0.1974 kern 0.2980 ones32 0.3240 table32 0.3754 table 0.3798
64 looks to be a good bit faster.
You didn't change anything in the ones32 algorithm did you? The other algorithms are taking roughly what they did last time, but ones32 seems consistently slower now.
Thanks for running it. Looks like "64" is worth the trouble. If the 32 bit wide algorithm is so fast, the 12 bit wide one must be blazingly fast. (See my other thread "The Marriage of Heaven and Hell.") I didn't make any changes to ones32 according to "diff." Cheers, David -------------------------------- David F. Place mailto:d@vidplace.com

Hello David, Thursday, April 13, 2006, 12:55:05 AM, you wrote:
Yes, especially curious since the algorithm is taken from AMD's optimization guide for the Athlon and Opteron series. I'm not good enough at reading core syntax to be able to see what GHC is doing with it.
optimization for GHC is far away from low-level asm optimization so it is not surprise that this don't work -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com
participants (4)
-
Bulat Ziganshin
-
Daniel McAllansmith
-
David F. Place
-
Robert Dockins