
Am Sonntag, 22. April 2007 15:06 schrieb Dominic Steinitz:
I've been playing around some more trying improve the performance of the SHA1 implmentation in the crypto library. I've isolated one of the functions and implemented it using
a) unfold
and
b) STUArray
The STUArray implementation is about twice as fast but I was expecting an order of magnitude improvement given I thought I would have been allocating 16 x 80 new 32 bit words with unfold but nothing with the STUArray.
How do you calculate that? I'd think that in the unfoldred function (h) you'd refer to 15 Word32s you don't use for each test, so that would be roughly a 20% overhead, the tail of h's argument could be reused, so in each step one Word32 - or rather one thunk to calculate a Word32 - would be appended (if the compiler knows how to do that, otherwise there'd be a lot of copying Word32s). Warning: I know next to nothing about computers, so I'm quite naive in my expectations. What I've tested is whether strictifying h gives a performance gain, it does, roughly 10% on my computer. h [w0 .. w15] = let w16 = rotL 1 (w0 `xor` ..) in w16 `seq` Just (w0,[w1 .. w16]) A further speedup results from switching from lists to a custom datatype: data WW = WW { f0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f11,f12,f13,f14,f15 :: !Word32 } fromL :: [Word32] -> WW fromL (w0:w1:w2:w3:w4:w5:w6:w7:w8:w9:w10:w11:w12:w13:w14:w15:_) = WW w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15 next :: WW -> WW next (WW w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15) = WW w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15 (rotL 1 $ w0 `xor` w2 `xor` w8 `xor` w13) v4 :: a -> [Word32] -> [Word32] v4 ss xs = take 80 $ unfoldr (\ww -> Just (f0 ww, next ww)) $ fromL xs is about 16% faster than your v1 here. Very odd, now this is slower than just strictifying, though I haven't touched that code. I still have the binary where v4 as above is faster than v1 (with or without added strictness), but in all more recently compiled versions it's slower than strictified v1. How can that happen? The STUArray version is about 9% faster if you use unsafeRead/Write instead of read/writeArray (we know it's safe here), a further 2% I gained by a slightly different loop (we needn't use list-indexing since we copy it in order): v3 ss xs = ws where ws = runST us us = do w <- newArray (0,79) 0 :: ST s (STUArray s Int Word32) let put = unsafeWrite w look = unsafeRead w loop 80 = getElems w loop n = do wm16 <- look (n-16) wm14 <- look (n-14) wm8 <- look (n-8) wm3 <- look (n-3) put n (rotL 1 $ wm3 `xor` wm8 `xor` wm14 `xor` wm16) loop (n+1) ini n [] = loop n ini n (x:xs) = do put n x ini (n+1) xs ini 0 xs
Should I have been disappointed?
Don't know, I'm somewhat disappointed that a C-version is almost three times as fast as the STUArray version. And I'm really astonished that compiled with -O0 the Unfold version is more than twice as fast as the STUArray. Phrased differently, optimising (-O2) doesn't help the Unfold version much, but speeds the STUArray version up more than four times! Cheers, Daniel
Dominic.