Optimizing cellular automata & the beauty of unlifted types

I'm back with another version of my cellular automata simulator. Since the
last iteration I discovered GHC's unlifted types and the primitive
operations that go with them. Using these types, rather than Ints, sped my
code up by 2x or more.
http://hpaste.org/4151#a2 -- first half of program
http://hpaste.org/4151#a3 -- remaining (note 3 lines or so from first half
are repeated)
The key observation came from looking at the Core files, which showed a lot
of int2word# and word2int# conversions going on. Figuring out how to remove
those led me to the unlifted types. Coding with these types is not easy (for
example, I couldn't see a way to write a Word# value directly - I had to
write stuff like "int2Word# 1#"), but because I had an existing algorithm to
guide me, combined with type checking, it was a fairly straightforward
implementation.
At first I felt kind of bad about using these operations, but then I noticed
they are used pretty heavily by GHC itself. If it's good enough for GHC,
it's good enough for me. The 2x performance gain didn't hurt either.
Finally, the safety that comes from using the ST monad is just awesome. I've
got low-level bit munging combined with high-level abstraction where I need
it. So cool!
I was disappointed to find early on that using higher-order functions in
tight loops was a performance killer. It's unfortunate because I started
with a very elegant, short implementation based on a simple Ring buffer and
map. The current version is certainly harder to understand and has some
weird limitations. However, having the simple implementation let me use
quickcheck to compare their results on random rules and inputs, which gave
me high confidence that my complex implemenation is correct.
One thing I absolutely love about this program is its memory performance. It
manages to stay within 1 - 10 MB of memory, depending on how much output is
produced. How cool is that?
On Dec 3, 2007 2:44 AM, Mirko Rahn
It is interesting, that the naive implementation
... is only 3 times slower than your quite complex, hard to follow and hard
to debug implementation.
Now the naive implementation is 100x slower, so I don't feel so bad about this comment any more.
As always, I prefer to write most code in Haskell, quick, easy, nice, reasonable fast, ... If speed matters, I switch to some lower level language, as you did staying inside Haskell.
I have to take exception here - I *want* to write my code in Haskell. If Haskell isn't fast enough to beat the C implementation, I'd rather find a way to make my Haskell program faster than switch to some other language. Justin

I'm curious how much of the unboxing helped performance and how much didn't. In my experience playing with this stuff, GHC's strictness analyzer has consistently been really excellent, given the right hints. Unboxed tuples are generally a win, but GHC was often smarter at unboxing ints than I was. Also, higher-order functions can be used fine, assuming they're not awful recursive, as long as you set GHC's unfolding threshold high enough. You can also manually force their inlining, to really get control. I found there tended to be a "sweet spot" for any given program, where much higher or lower would degrade performance. As far as removing the word2int# and soforth, you could probably just use words throughout? Also, as we discovered the other week, you may want to be careful with bang patterns, as forcing strictness on already evaluated values takes some time. Although, SPJ did point out that this was significantly improved in the new GHC. But yes, I found that going through and manually unboxing a working implementation with the help of type errors was actually a sort of relaxing and zenlike exercise. --S On Dec 20, 2007, at 6:07 PM, Justin Bailey wrote:
I'm back with another version of my cellular automata simulator. Since the last iteration I discovered GHC's unlifted types and the primitive operations that go with them. Using these types, rather than Ints, sped my code up by 2x or more.
http://hpaste.org/4151#a2 -- first half of program http://hpaste.org/4151#a3 -- remaining (note 3 lines or so from first half are repeated)
The key observation came from looking at the Core files, which showed a lot of int2word# and word2int# conversions going on. Figuring out how to remove those led me to the unlifted types. Coding with these types is not easy (for example, I couldn't see a way to write a Word# value directly - I had to write stuff like "int2Word# 1#"), but because I had an existing algorithm to guide me, combined with type checking, it was a fairly straightforward implementation.
At first I felt kind of bad about using these operations, but then I noticed they are used pretty heavily by GHC itself. If it's good enough for GHC, it's good enough for me. The 2x performance gain didn't hurt either. Finally, the safety that comes from using the ST monad is just awesome. I've got low-level bit munging combined with high-level abstraction where I need it. So cool!
I was disappointed to find early on that using higher-order functions in tight loops was a performance killer. It's unfortunate because I started with a very elegant, short implementation based on a simple Ring buffer and map. The current version is certainly harder to understand and has some weird limitations. However, having the simple implementation let me use quickcheck to compare their results on random rules and inputs, which gave me high confidence that my complex implemenation is correct.
One thing I absolutely love about this program is its memory performance. It manages to stay within 1 - 10 MB of memory, depending on how much output is produced. How cool is that?
On Dec 3, 2007 2:44 AM, Mirko Rahn
wrote: It is interesting, that the naive implementation ...
is only 3 times slower than your quite complex, hard to follow and hard to debug implementation.
Now the naive implementation is 100x slower, so I don't feel so bad about this comment any more.
As always, I prefer to write most code in Haskell, quick, easy, nice, reasonable fast, ... If speed matters, I switch to some lower level language, as you did staying inside Haskell.
I have to take exception here - I *want* to write my code in Haskell. If Haskell isn't fast enough to beat the C implementation, I'd rather find a way to make my Haskell program faster than switch to some other language.
Justin _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Dec 20, 2007 7:42 PM, Sterling Clover
I'm curious how much of the unboxing helped performance and how much didn't. In my experience playing with this stuff, GHC's strictness analyzer has consistently been really excellent, given the right hints. Unboxed tuples are generally a win, but GHC was often smarter at unboxing ints than I was.
It really did help. I started with an implementation that used Ints, and this sped the program up by at least 2x. I think that's because of the bit-manipulation I'm doing. For example, Data.Bits defines the bitwise and operation on Ints as: (I# x#) .&. (I# y#) = I# (word2Int# (int2Word# x# `and#` int2Word# y#)) Which you can see has 3 conversions in it. The I# deconstruction/construction is also suspicious to me, but I don't know if there are performance costs there or not. Regardless, using Word# directly lets me write (assuming w1# and w2# are already words): w1# `and#` w2# Maybe better rewrite rules in the Data.Bits library would eliminate unnecessary conversions and this wouldn't be necessary/
Also, higher-order functions can be used fine, assuming they're not awful recursive, as long as you set GHC's unfolding threshold high enough. You can also manually force their inlining, to really get control. I found there tended to be a "sweet
I didn't know about unfolding, and I'll have to try that sometime. I found that inlining seemed to have negative affects as often as it did positive. I spent most of my time so far coming up with different algorithms - now that I've got a decent one, I'll have to play with inlining.
spot" for any given program, where much higher or lower would degrade performance. As far as removing the word2int# and soforth, you could probably just use words throughout?
I try to. I just couldn't figure out how to write a Word# value literally. For example, eqWord# compares Word# values. But this gives a type error: 1# `eqWord#` 1# I have to write int2word# 1# `eqWord#` int2word# 1#
Also, as we discovered the other week, you may want to be careful with bang patterns, as forcing strictness on already evaluated values takes some time. Although, SPJ did point out that this was significantly improved in the new GHC.
Good point, thanks. Justin

On Dec 21, 2007 3:00 PM, Justin Bailey
It really did help. I started with an implementation that used Ints, and this sped the program up by at least 2x. I think that's because of the bit-manipulation I'm doing. For example, Data.Bits defines the bitwise and operation on Ints as:
(I# x#) .&. (I# y#) = I# (word2Int# (int2Word# x# `and#` int2Word# y#))
Which you can see has 3 conversions in it. The I# deconstruction/construction is also suspicious to me, but I don't know if there are performance costs there or not. Regardless, using Word# directly lets me write (assuming w1# and w2# are already words):
w1# `and#` w2#
Maybe better rewrite rules in the Data.Bits library would eliminate unnecessary conversions and this wouldn't be necessary
Wouldn't it be sufficient to use Data.Word and rely on GHC unboxing capabilities? Compiling import Data.Bits import Data.Word main = do a <- getLine b <- getLine print (read a .&. read b :: Word) with GHC 6.6.1 gives me (case GHC.Read.read @ GHC.Word.Word GHC.Word.$f40 a_afE of wild1_a1tT { GHC.Word.W# x#_a1tV -> case GHC.Read.read @ GHC.Word.Word GHC.Word.$f40 a87_a1ub of wild11_a1tW { GHC.Word.W# y#_a1tY -> GHC.Word.$w$dmshow1 (GHC.Prim.and# x#_a1tV y#_a1tY) } }) which of course does the right thing. Cheers, -- Felipe.

Justin Bailey wrote:
On Dec 20, 2007 7:42 PM, Sterling Clover
wrote: I'm curious how much of the unboxing helped performance and how much didn't. In my experience playing with this stuff, GHC's strictness analyzer has consistently been really excellent, given the right hints. Unboxed tuples are generally a win, but GHC was often smarter at unboxing ints than I was.
It really did help. I started with an implementation that used Ints, and this sped the program up by at least 2x. I think that's because of the bit-manipulation I'm doing. For example, Data.Bits defines the bitwise and operation on Ints as:
(I# x#) .&. (I# y#) = I# (word2Int# (int2Word# x# `and#` int2Word# y#))
Which you can see has 3 conversions in it.
I believe you're barking up the wrong tree here. Consider | module Q (f, g) where | | import GHC.Base | | f :: Word# -> Word# -> Word# | f a b = a `and#` b | | g :: Int# -> Int# -> Int# | g a b = word2Int# (int2Word# a `and#` int2Word# b) If you look at the generated machine code, you'll find that f and g are identical functions. The sole purpose of the int2Word# and word2Int# operations is to satisfy the type checker. (This is even true at the core level. The core language is typed, so you still need to do the conversions there. No code is generated for them though.)
The I# deconstruction/construction is also suspicious to me, but I don't know if there are performance costs there or not. Regardless, using Word# directly lets me write (assuming w1# and w2# are already words):
w1# `and#` w2#
The I# deconstruction has a cost, but with proper strictness annotations ghc should optimize those away - check the intermediate Core to see whether it actually does. In fact most of the speedup you got seems to come from the use of uncheckedShiftL# and uncheckedShiftRL# - just using shiftL, shiftR :: Int -> Int -> Int I# a `shiftR` I# b = I# (word2Int# (int2Word# a `uncheckedShiftRL#` b)) I# a `shiftL` I# b = I# (word2Int# (int2Word# a `uncheckedShiftL#` b)) speeds up the stepWithUArray code by a factor of 2 here, starting with the first program at http://hpaste.org/4151. The normal shiftL and shiftR implementations include bounds checks to get the results for shifts that are bigger than the word size correct. (For example 1 `shiftL` 65 = 0, while 1 `uncheckedShiftL#` 65 = 2 (probably. it depends on your architecture)). Eliminating those checks results in a major speedup. Apparently those bounds checks aren't eliminated even if the shifting width is constant. I wonder why, there's clearly some room for optimization here. (I used ghc 6.8.1 on an Athlon XP. I used ghc -O2, no fancy options.) enjoy, Bertram

On Dec 21, 2007 2:55 PM, Bertram Felgenhauer
If you look at the generated machine code, you'll find that f and g are identical functions. The sole purpose of the int2Word# and word2Int# operations is to satisfy the type checker. (This is even true at the core level. The core language is typed, so you still need to do the conversions there. No code is generated for them though.)
Good to know. They are scary!
The I# deconstruction has a cost, but with proper strictness annotations ghc should optimize those away - check the intermediate Core to see whether it actually does.
If I see things like GHC.Prim.Intzh, is that a clue its the "unlifted" type?
In fact most of the speedup you got seems to come from the use of uncheckedShiftL# and uncheckedShiftRL# - just using
shiftL, shiftR :: Int -> Int -> Int I# a `shiftR` I# b = I# (word2Int# (int2Word# a `uncheckedShiftRL#` b)) I# a `shiftL` I# b = I# (word2Int# (int2Word# a `uncheckedShiftL#` b))
speeds up the stepWithUArray code by a factor of 2 here, starting with the first program at http://hpaste.org/4151.
That is great. I tried your speedup and you are right - just redefining those makes the "lifted" version faster than the unlifted. Too bad there isn't an "unsafe" version of the shifts available. Justin
participants (4)
-
Bertram Felgenhauer
-
Felipe Lessa
-
Justin Bailey
-
Sterling Clover