Hi Rahul,
The reason I thought that maybe GHC didn't call the CPU instruction was due to my benchmarking, which showed that the built-in popCount was slower than the pure Haskell Broadword implementation:
main = defaultMain
[ env setupEnv (\bv -> bgroup "Rank"
[ bench "PopCnt1 Broadword - Once" (nf (map (\n -> getCount (PC1BW.popCount1 (DVS.take n bv)))) [0, 1000..100000])
, bench "PopCnt1 GHC - Once" (nf (map (\n -> getCount (PC1GHC.popCount1 (DVS.take n bv)))) [0, 1000..100000])
] )
]
Benchmarking results follow:
benchmarking Rank/PopCnt1 Broadword - Once
time 18.49 ms (17.89 ms .. 19.08 ms)
0.994 R² (0.989 R² .. 0.998 R²)
mean 19.62 ms (19.22 ms .. 20.04 ms)
std dev 1.026 ms (846.2 μs .. 1.315 ms)
variance introduced by outliers: 21% (moderately inflated)
benchmarking Rank/PopCnt1 GHC - Once
time 36.80 ms (36.25 ms .. 37.57 ms)
0.998 R² (0.996 R² .. 1.000 R²)
mean 37.14 ms (36.72 ms .. 37.97 ms)
std dev 1.100 ms (650.6 μs .. 1.680 ms)
This makes the built-in nearly two times slower than the pure Haskell. That's how I got into the ghc-prim code.
This is the broadword implementation I was using:
popCount1 x0 = ((x3 * 0x0101010101010101) .>. 56)
where
x1 = x0 - ((x0 .&. 0xaaaaaaaaaaaaaaaa) .>. 1)
x2 = (x1 .&. 0x3333333333333333) + ((x1 .>. 2) .&. 0x3333333333333333)
x3 = (x2 + (x2 .>. 4)) .&. 0x0f0f0f0f0f0f0f0f
Maybe all I need to do is compile with some flags or switch the backend or something? Any ideas?
Cheers,
-John