
I have re-written the sha1 code so that it is (hopefully) easy to see that it faithfully implements the algorithm (see http://www.itl.nist.gov/fipspubs/fip180-1.htm). Having got rid of the space leak, I have been trying to improve performance. Currently, the haskell code is 2 orders of magnitude slower than the sha1sum that ships with my linux.
dom@heisenberg:~/sha1/testdist/sha1> time ./perfTest perfTest c7eae62ddabb653bb9ce4eb18fa8b94264f92a76 Success
real 0m2.152s user 0m2.112s sys 0m0.028s dom@heisenberg:~/sha1/testdist/sha1> time sha1sum perfTest c7eae62ddabb653bb9ce4eb18fa8b94264f92a76 perfTest
real 0m0.057s user 0m0.008s sys 0m0.004s
I've played around with profiling and doubled the performance of the haskell code but I'm nowhere near the C performance.
Sun Mar 11 19:32 2007 Time and Allocation Profiling Report (Final)
perfTest +RTS -p -RTS eg
total time = 6.75 secs (135 ticks @ 50 ms) total alloc = 1,483,413,752 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
oneBlock Data.Digest.SHA1 39.3 40.1 $& Data.Digest.SHA1 20.7 21.6 f Data.Digest.SHA1 13.3 6.2 getWord32s Data.Digest.SHA1 7.4 6.6 test2 Main 5.9 8.7 blockWord8sIn32 Data.Digest.SHA1 5.2 5.3 blockWord8sIn512 Data.Digest.SHA1 3.0 4.4 pad Data.Digest.SHA1 1.5 3.5 k Data.Digest.SHA1 1.5 0.0 fromBytes Data.Digest.SHA1 0.0 3.5
Here's the code that is taking the majority of the time.
($&) :: [Word32] -> [Word32] -> [Word32] a $& b = zipWith (+) a b
-- Word128 -> Word512 -> Word128 oneBlock ss xs = Word128 (as!!80) (bs!!80) (cs!!80) (ds!!80) (es!!80) where ws = xs ++ map (rotL 1) (zipWith4 xxxor wm3s wm8s wm14s ws) where xxxor a b c d = a `xor` b `xor` c `xor` d wm3s = drop (16-3) ws wm8s = drop (16-8) ws wm14s = drop (16-14) ws as = ai:ts bs = bi:as cs = ci:(map (rotL 30) bs) ds = di:cs es = ei:ds ts = (map (rotL 5) as) $& (zipWith4 f [0..] bs cs ds) $& es $& (map k [0..]) $& ws Word128 ai bi ci di ei = ss
Any help would be appreciated. I've put a copy of a working system here if anyone wants to experiment (http://www.haskell.org/crypto/downloads/sha1.tar.gz). Thanks, Dominic.

On Sunday 11 March 2007 20:46, Stefan O'Rear wrote:
On Sun, Mar 11, 2007 at 08:18:44PM +0000, Dominic Steinitz wrote:
Word128 ai bi ci di ei = ss
128 is not divisible by 5. You should probably rename that type :)
Stefan I must have been thinking of MD5. Yes Word160 would be better. Dominic.

Dominic Steinitz wrote:
Any help would be appreciated.
I notice that there's not much user-accessible documentation of what you can expect GHC (or some other Haskell implementation) to do and not do with a given piece of code. For example, you have a lot of little definitions that clearly traverse the same lists many times. I've no idea where I would look, except for the compiler source, to get a sense for when, if ever, the compiler might apply CSE, fusion, or any other techniques that come to mind. So transmitting folk wisdom on what "the compiler" might do with any given piece of code counts as another half chapter in the "Practical Haskell" book that ought to get written :-)

Hi
I notice that there's not much user-accessible documentation of what you can expect GHC (or some other Haskell implementation) to do and not do with a given piece of code.
Yhc/nhc/Hugs - nothing GHC - inlining, simplification, fusion if you use the build in functions in a specified way, strictness analysis, maybe some unboxing, let movement, some other tricks JHC - everything The problem is that things like strictness are a static analysis which is undecidable in general. Static analysis is probably too complex for the user to figure out what is going on, its much better to look at the Core generated and see what actually happened. Thanks Neil

Dominic Steinitz wrote:
I have re-written the sha1 code so that it is (hopefully) easy to see that it faithfully implements the algorithm (see http://www.itl.nist.gov/fipspubs/fip180-1.htm). Having got rid of the space leak, I have been trying to improve performance.
Currently, the haskell code is 2 orders of magnitude slower than the sha1sum that ships with my linux.
Here's the code that is taking the majority of the time.
($&) :: [Word32] -> [Word32] -> [Word32] a $& b = zipWith (+) a b
-- Word128 -> Word512 -> Word128 oneBlock ss xs = Word128 (as!!80) (bs!!80) (cs!!80) (ds!!80) (es!!80) where ws = xs ++ map (rotL 1) (zipWith4 xxxor wm3s wm8s wm14s ws) where xxxor a b c d = a `xor` b `xor` c `xor` d wm3s = drop (16-3) ws wm8s = drop (16-8) ws wm14s = drop (16-14) ws as = ai:ts bs = bi:as cs = ci:(map (rotL 30) bs) ds = di:cs es = ei:ds ts = (map (rotL 5) as) $& (zipWith4 f [0..] bs cs ds) $& es $& (map k [0..]) $& ws Word128 ai bi ci di ei = ss
Any help would be appreciated.
This code is clean Haskell without algorithmic flaws, optimizing it means to scrape the constant factor off. Of course, de- and constructing those lazy lists is quite expensive and the canonical answer is: deforestation, also known as "fusion". The goal is to avoid building intermediate lists like if they get consumed at some point, here by (!! 80). This is like transforming the factorial or the fibonacci numbers fac n = facs !! n where facs = 1 : zipWith (*) facs [1..] fib n = fibs !! n where fibs = 1:1: zipwWith (+) fibs (tail fibs) to their accumulating cousins fac n = fac' n 1 where fac' 0 x = x fac' !n !x = fac' (n-1) (x*n) fib n = fib' n (1,1) where fib' 0 ( x, y) = x fib' !n (!x,!y) = fib' (n-1) (y,x+y) The algorithm splits in two parts: calculating ws and accumulating the quintuple a,b,c,d,e over it. Fusing the quintuple is straightforward and already suggested by the imperative specification on the website you mentioned (use with -fbang-patterns): oneBlock ss xs = foldl' g (0,ss) ws where ws = ... g (!n,!Word160 a b c d e) w = (n+1, Word160 (rotL 5 a + f n b c d + e + w + k n) a (rotL 30 b) c d)) Together with -funbox-strict-fields, GHC should be able to generate decent assembly from that. Fusing the ws is trickier. Directly appealing to the fibonacci-number example is not recommended because this would mean to keep the last 16 ws in memory and shifting them right to left by hand. But as the "Alternate method of computation" on the website suggests, you can delegate the shifting to an index that shifts around mod 16. Having a mutable array is helpful, then. Of course, you can also fill a large static (boxed!) array of 80 Word8s via ws :: Data.Array.IArray.Array Int Word8 ws = accumArray 0 (0,80) (curry snd) $ zip [0..15] xs ++ [(i, xxor i) | i<-[16..80]] where xxor i = ws ! (i-16) `xor` ws ! (i-3) `xor` ws ! (i-8) `xor` ws ! (i-14) or something like that (I didn't care about correct indices and bounds). GHC can fuse such array accumulations. In general, keeping stuff in lists is not wrong, but ByteStrings are more adapted to current CPU and RAM architecture. Regards, apfelmus

On Sun, Mar 11, 2007 at 08:18:44PM +0000, Dominic Steinitz wrote:
I have re-written the sha1 code so that it is (hopefully) easy to see that it faithfully implements the algorithm (see http://www.itl.nist.gov/fipspubs/fip180-1.htm). Having got rid of the space leak, I have been trying to improve performance.
Currently, the haskell code is 2 orders of magnitude slower than the sha1sum that ships with my linux.
I don't know if this is useful to you, but darcs has some SHA1 code that IIRC is much closer to C's performance. It currently uses darcs' own FastPackedString library, but porting it to ByteString should be fairly easy. See SHA1.lhs in http://www.abridgegame.org/repos/darcs-unstable It might even be able to be made faster still by calling lower-level functions than {shift,rotate}{L,R} directly. Thanks Ian

On Sun, Mar 11, 2007 at 08:18:44PM +0000, Dominic Steinitz wrote:
I have re-written the sha1 code so that it is (hopefully) easy to see
Ian Lynagh wrote: that it
faithfully implements the algorithm (see http://www.itl.nist.gov/fipspubs/fip180-1.htm). Having got rid of the space leak, I have been trying to improve performance.
Currently, the haskell code is 2 orders of magnitude slower than the sha1sum that ships with my linux.
I don't know if this is useful to you, but darcs has some SHA1 code that IIRC is much closer to C's performance. It currently uses darcs' own FastPackedString library, but porting it to ByteString should be fairly easy.
See SHA1.lhs in http://www.abridgegame.org/repos/darcs-unstable
It might even be able to be made faster still by calling lower-level functions than {shift,rotate}{L,R} directly.
I ended up deciding to call SHA1 routines out of openssl. For applications where this is possible, it does very well, I got about 2.5 times the speed out of it, compared to ordinary C implementations. But, since harchive spends most of its CPU computing SHA1 hashes (and almost all of the rest in zlib), it is worth a complex binding there. Dave

On Friday 16 March 2007 21:29, David Brown wrote:
Ian Lynagh wrote:
On Sun, Mar 11, 2007 at 08:18:44PM +0000, Dominic Steinitz wrote:
I have re-written the sha1 code so that it is (hopefully) easy to see
that it
faithfully implements the algorithm (see http://www.itl.nist.gov/fipspubs/fip180-1.htm). Having got rid of the
space
leak, I have been trying to improve performance.
Currently, the haskell code is 2 orders of magnitude slower than the
sha1sum
that ships with my linux.
I don't know if this is useful to you, but darcs has some SHA1 code that IIRC is much closer to C's performance. It currently uses darcs' own FastPackedString library, but porting it to ByteString should be fairly easy.
See SHA1.lhs in http://www.abridgegame.org/repos/darcs-unstable
It might even be able to be made faster still by calling lower-level functions than {shift,rotate}{L,R} directly.
I ended up deciding to call SHA1 routines out of openssl. For applications where this is possible, it does very well, I got about 2.5 times the speed out of it, compared to ordinary C implementations.
But, since harchive spends most of its CPU computing SHA1 hashes (and almost all of the rest in zlib), it is worth a complex binding there.
Dave Ian, Dave,
Thanks. My goal is to have natural haskell code that's reasonably efficient. I don't have a problem to solve that needs an efficient implementation of SHA1. I'm going to try apfelmus' suggestions next and then (if I ever get yhc to build) start looking at what gets generated. Dominic.

On Friday 16 March 2007 21:24, Ian Lynagh wrote:
On Sun, Mar 11, 2007 at 08:18:44PM +0000, Dominic Steinitz wrote:
I have re-written the sha1 code so that it is (hopefully) easy to see that it faithfully implements the algorithm (see http://www.itl.nist.gov/fipspubs/fip180-1.htm). Having got rid of the space leak, I have been trying to improve performance.
Currently, the haskell code is 2 orders of magnitude slower than the sha1sum that ships with my linux.
I don't know if this is useful to you, but darcs has some SHA1 code that IIRC is much closer to C's performance. It currently uses darcs' own FastPackedString library, but porting it to ByteString should be fairly easy.
See SHA1.lhs in http://www.abridgegame.org/repos/darcs-unstable
It might even be able to be made faster still by calling lower-level functions than {shift,rotate}{L,R} directly.
Thanks Ian Ian,
Thanks. I'm trying to build just SHA1 but I am getting the following linker errors. Do you know what option I should be adding? Dominic. dom@heisenberg:~/sha11> ghc -o perfTest perfTest.hs -iIgloo/darcs-unstable/src --make -lz Linking perfTest ... Igloo/darcs-unstable/src/FastPackedString.o: In function `r4Lk_info': (.text+0x34c): undefined reference to `utf8_to_ints' Igloo/darcs-unstable/src/FastPackedString.o: In function `r4Lm_info': (.text+0x370): undefined reference to `first_nonwhite' Igloo/darcs-unstable/src/FastPackedString.o: In function `r4Lo_info': (.text+0x430): undefined reference to `first_white' Igloo/darcs-unstable/src/FastPackedString.o: In function `r4Lq_info': (.text+0x4f0): undefined reference to `has_funky_char' Igloo/darcs-unstable/src/FastPackedString.o: In function `r4LE_info': (.text+0x848): undefined reference to `my_mmap' Igloo/darcs-unstable/src/FastPackedString.o: In function `r4LM_info': (.text+0x8dc): undefined reference to `conv_to_hex' Igloo/darcs-unstable/src/FastPackedString.o: In function `r4LO_info': (.text+0x904): undefined reference to `conv_from_hex' collect2: ld returned 1 exit status

On Sat, Mar 24, 2007 at 01:46:33PM +0000, Dominic Steinitz wrote:
Thanks. I'm trying to build just SHA1 but I am getting the following linker errors. Do you know what option I should be adding?
dom@heisenberg:~/sha11> ghc -o perfTest perfTest.hs -iIgloo/darcs-unstable/src --make -lz Linking perfTest ... Igloo/darcs-unstable/src/FastPackedString.o: In function `r4Lk_info': (.text+0x34c): undefined reference to `utf8_to_ints'
You need to link with fpstring.o (which in turn you get by compiling fpstring.c). Thanks Ian
participants (7)
-
apfelmus@quantentunnel.de
-
Bryan O'Sullivan
-
David Brown
-
Dominic Steinitz
-
Ian Lynagh
-
Neil Mitchell
-
Stefan O'Rear