
Justin Bailey wrote:
On Dec 20, 2007 7:42 PM, Sterling Clover
wrote: I'm curious how much of the unboxing helped performance and how much didn't. In my experience playing with this stuff, GHC's strictness analyzer has consistently been really excellent, given the right hints. Unboxed tuples are generally a win, but GHC was often smarter at unboxing ints than I was.
It really did help. I started with an implementation that used Ints, and this sped the program up by at least 2x. I think that's because of the bit-manipulation I'm doing. For example, Data.Bits defines the bitwise and operation on Ints as:
(I# x#) .&. (I# y#) = I# (word2Int# (int2Word# x# `and#` int2Word# y#))
Which you can see has 3 conversions in it.
I believe you're barking up the wrong tree here. Consider | module Q (f, g) where | | import GHC.Base | | f :: Word# -> Word# -> Word# | f a b = a `and#` b | | g :: Int# -> Int# -> Int# | g a b = word2Int# (int2Word# a `and#` int2Word# b) If you look at the generated machine code, you'll find that f and g are identical functions. The sole purpose of the int2Word# and word2Int# operations is to satisfy the type checker. (This is even true at the core level. The core language is typed, so you still need to do the conversions there. No code is generated for them though.)
The I# deconstruction/construction is also suspicious to me, but I don't know if there are performance costs there or not. Regardless, using Word# directly lets me write (assuming w1# and w2# are already words):
w1# `and#` w2#
The I# deconstruction has a cost, but with proper strictness annotations ghc should optimize those away - check the intermediate Core to see whether it actually does. In fact most of the speedup you got seems to come from the use of uncheckedShiftL# and uncheckedShiftRL# - just using shiftL, shiftR :: Int -> Int -> Int I# a `shiftR` I# b = I# (word2Int# (int2Word# a `uncheckedShiftRL#` b)) I# a `shiftL` I# b = I# (word2Int# (int2Word# a `uncheckedShiftL#` b)) speeds up the stepWithUArray code by a factor of 2 here, starting with the first program at http://hpaste.org/4151. The normal shiftL and shiftR implementations include bounds checks to get the results for shifts that are bigger than the word size correct. (For example 1 `shiftL` 65 = 0, while 1 `uncheckedShiftL#` 65 = 2 (probably. it depends on your architecture)). Eliminating those checks results in a major speedup. Apparently those bounds checks aren't eliminated even if the shifting width is constant. I wonder why, there's clearly some room for optimization here. (I used ghc 6.8.1 on an Athlon XP. I used ghc -O2, no fancy options.) enjoy, Bertram