Speed of simple operations with Ptr Word32s

Hi all,
I was under the impression that simple code like the below, which swaps
the endianness of a block of data, ought to be near C speed:
-----8<----------8<----------8<----------8<-----
module Main (main) where
import Word (Word32)
import Foreign.Ptr (Ptr)
import Foreign.Marshal.Array (mallocArray, advancePtr)
import Foreign.Storable (peek, poke)
import Bits ((.|.), (.&.), shiftL, shiftR)
main :: IO ()
main = do p <- mallocArray 104857600
foo p 104857600
foo :: Ptr Word32 -> Int -> IO ()
foo p i | p `seq` i `seq` False = undefined
foo _ 0 = return ()
foo p n
= do x <- peek p
poke p (shiftL x 24 .|. shiftL (x .&. 0xff00) 8
.|. (shiftR x 8 .&. 0xff00)
.|. shiftR x 24)
foo (p `advancePtr` 1) (n - 1)
-----8<----------8<----------8<----------8<-----
However, against this equally simple C code it doesn't fair too well:
-----8<----------8<----------8<----------8<-----
#include

Ian Lynagh wrote:
Hi all,
I was under the impression that simple code like the below, which swaps the endianness of a block of data, ought to be near C speed:
[...] poke p (shiftL x 24 .|. shiftL (x .&. 0xff00) 8 .|. (shiftR x 8 .&. 0xff00) .|. shiftR x 24) [...]
The problem here is that the shiftL and shiftR operations don't get inlined properly. They get replaced by a call to shift, but that doesn't get inlined. The shift function also wastes some more time by checking the sign of the shift amount. A few well-placed INLINE pragmas in the libraries might help.
Is there anything I can do to get better performance in this sort of code without resorting to calling out to C?
You could import some private GHC modules and use the primop directly: import GHC.Prim import GHC.Word main :: IO () main = do p <- mallocArray 104857600 foo p 104857600 shiftL (W32# a) (I# b) = W32# (shiftL# a b) shiftR (W32# a) (I# b) = W32# (shiftRL# a b) Using those instead of the standard ones speeds up the program a lot; be aware however that you shouldn't use negative shift amounts with those (undefined result, no checking). Cheers, Wolfgang

wolfgang.thaller:
Ian Lynagh wrote:
Hi all,
I was under the impression that simple code like the below, which swaps the endianness of a block of data, ought to be near C speed:
[...] poke p (shiftL x 24 .|. shiftL (x .&. 0xff00) 8 .|. (shiftR x 8 .&. 0xff00) .|. shiftR x 24) [...]
The problem here is that the shiftL and shiftR operations don't get inlined properly. They get replaced by a call to shift, but that doesn't get inlined. The shift function also wastes some more time by checking the sign of the shift amount. A few well-placed INLINE pragmas in the libraries might help.
Is there anything I can do to get better performance in this sort of code without resorting to calling out to C?
You could import some private GHC modules and use the primop directly:
import GHC.Prim import GHC.Word
main :: IO () main = do p <- mallocArray 104857600 foo p 104857600
shiftL (W32# a) (I# b) = W32# (shiftL# a b) shiftR (W32# a) (I# b) = W32# (shiftRL# a b)
Using those instead of the standard ones speeds up the program a lot; be aware however that you shouldn't use negative shift amounts with those (undefined result, no checking).
Rewriting the bit twiddling code to use the unboxed primops will generate almost identical C code (in GHC, at least) to that which you would write yourself. You'll actually get the C ops generated if you do all the operations on Word#, instead of Int#: Here's a bit of ridiculously optimised code in yi that I wrote to look at how to get straight C code out of the backend: (I# (word2Int# ((int2Word# i `and#` int2Word# 0xffff#) `remWord#` int2Word# 52#))) This code is literally compiled to (-ddump-realC): ... _s6hk_ = (StgWord)((I_)(R1.p[1])); _s6hn_ = _s6hk_ & 0xffff; _s6hq_ = _s6hn_ % 0x34; _s6ht_ = (StgInt)(_s6hq_); ... -- Don
participants (3)
-
dons@cse.unsw.edu.au
-
Ian Lynagh
-
Wolfgang Thaller