Re: [Haskell-cafe] Optimizing cellular automata & the beauty of unlifted types

21 Dec 2007

      Justin Bailey wrote:
...
On Dec 20, 2007 7:42 PM, Sterling Clover  wrote:
...
I'm curious how much of the unboxing helped performance and how much
didn't. In my experience playing with this stuff, GHC's strictness
analyzer has consistently been really excellent, given the right
hints. Unboxed tuples are generally a win, but GHC was often smarter
at unboxing ints than I was.
It really did help. I started with an implementation that used Ints,
and this sped the program up by at least 2x. I think that's because of
the bit-manipulation I'm doing. For example, Data.Bits defines the
bitwise and operation on Ints as:
(I# x#) .&.   (I# y#)  = I# (word2Int# (int2Word# x# `and#` int2Word# y#))
Which you can see has 3 conversions in it.
I believe you're barking up the wrong tree here. Consider

| module Q (f, g) where
| 
| import GHC.Base
| 
| f :: Word# -> Word# -> Word#
| f a b = a `and#` b
| 
| g :: Int# -> Int# -> Int#
| g a b = word2Int# (int2Word# a `and#` int2Word# b)

If you look at the generated machine code, you'll find that f and g
are identical functions. The sole purpose of the int2Word# and
word2Int# operations is to satisfy the type checker. (This is
even true at the core level. The core language is typed, so you
still need to do the conversions there. No code is generated for
them though.)
...
The I# deconstruction/construction is also suspicious to me, but I
don't know if there are performance costs there or not. Regardless,
using Word# directly lets me write (assuming w1# and w2# are already
words):
w1# `and#` w2#
The I# deconstruction has a cost, but with proper strictness annotations
ghc should optimize those away - check the intermediate Core to see
whether it actually does.

In fact most of the speedup you got seems to come from the use of
uncheckedShiftL# and uncheckedShiftRL# - just using

  shiftL, shiftR :: Int -> Int -> Int
  I# a `shiftR` I# b = I# (word2Int# (int2Word# a `uncheckedShiftRL#` b))
  I# a `shiftL` I# b = I# (word2Int# (int2Word# a `uncheckedShiftL#` b))

speeds up the stepWithUArray code by a factor of 2 here, starting with
the first program at http://hpaste.org/4151.

The normal shiftL and shiftR implementations include bounds checks to
get the results for shifts that are bigger than the word size correct.
(For example  1 `shiftL` 65 = 0, while  1 `uncheckedShiftL#` 65 = 2
(probably. it depends on your architecture)). Eliminating those checks
results in a major speedup.

Apparently those bounds checks aren't eliminated even if the shifting
width is constant. I wonder why, there's clearly some room for
optimization here.

(I used ghc 6.8.1 on an Athlon XP. I used ghc -O2, no fancy options.)

enjoy,

Bertram

Re: [Haskell-cafe] Optimizing cellular automata & the beauty of unlifted types

Bertram Felgenhauer