sha1 implementation thats "only" 12 times slower then C

So I tried implementing a more efficient sha1 in haskell, and i got to about 12 times slower as C. The darcs implementation is also around 10 to 12 times slower, and the crypto one is about 450 times slower. I haven't yet unrolled the loop like the darcs implementation does, so I can still get some improvement from that, but I want that to be the last thing i do. I think I've been getting speed improvements when minimizing unnecessary allocations. I went from 40 times slower to 12 times slower by converting a foldM to a mapM that modifies a mutable array. Anyone have any pointers on how to get hashElem and updateElem to run faster, or any insight on what exactly they are allocating. To me it seems that those functions should be able to do everything they need to without a malloc. This is the profiling statistics generated from my implementation COST CENTRE MODULE %time %alloc hashElem SHA1 42.9 66.2 updateElem SHA1 12.7 16.7 unboxW SHA1 10.6 0.0 hashA80 SHA1 5.2 0.3 temp SHA1 4.6 0.0 sRotateL SHA1 4.6 0.0 ffkk SHA1 3.3 2.6 hashA16IntoA80 SHA1 3.1 0.1 sXor SHA1 2.9 0.0 do60 SHA1 2.9 2.6 sAdd SHA1 2.3 0.0 do20 SHA1 1.3 2.6 splitByN SHA1 1.2 2.3 do80 SHA1 0.8 2.6 do40 SHA1 0.4 2.6 Thanks, Anatoly

aeyakovenko:
So I tried implementing a more efficient sha1 in haskell, and i got to about 12 times slower as C. The darcs implementation is also around 10 to 12 times slower, and the crypto one is about 450 times slower. I haven't yet unrolled the loop like the darcs implementation does, so I can still get some improvement from that, but I want that to be the last thing i do.
I think I've been getting speed improvements when minimizing unnecessary allocations. I went from 40 times slower to 12 times slower by converting a foldM to a mapM that modifies a mutable array.
Anyone have any pointers on how to get hashElem and updateElem to run faster, or any insight on what exactly they are allocating. To me it seems that those functions should be able to do everything they need to without a malloc.
Try inlining key small functions, and check the core. -O2 -ddump-simpl | less -- Don

inlining some of the functions definitely gave me a boost, so i am
about 8.5 times slower then openssl sha1sum. I dont really understand
the core output, but after inlining i got a completely different
profile output, i am guessing its because the cost of the inlined
functions is spread to the callers.
COST CENTRE MODULE %time %alloc
updateElem SHA1 13.4 0.0
sRotateL SHA1 13.4 0.0
hashElem SHA1 12.5 0.0
sXor SHA1 10.9 0.0
unboxW SHA1 10.0 0.0
temp SHA1 8.1 0.0
sAdd SHA1 7.8 0.0
sAnd SHA1 5.0 0.0
do20 SHA1 4.1 18.0
hashA16IntoA80 SHA1 2.8 0.9
do60 SHA1 2.5 18.0
splitByN SHA1 2.2 15.6
ffkk SHA1 2.2 0.0
sOr SHA1 1.6 0.0
do40 SHA1 0.9 18.0
hashPtrIntoA80 SHA1 0.6 2.7
hashA80 SHA1 0.6 1.8
do80 SHA1 0.6 18.0
joinTail SHA1 0.0 2.1
main Main 0.0 4.8
On 6/30/07, Donald Bruce Stewart
aeyakovenko:
So I tried implementing a more efficient sha1 in haskell, and i got to about 12 times slower as C. The darcs implementation is also around 10 to 12 times slower, and the crypto one is about 450 times slower. I haven't yet unrolled the loop like the darcs implementation does, so I can still get some improvement from that, but I want that to be the last thing i do.
I think I've been getting speed improvements when minimizing unnecessary allocations. I went from 40 times slower to 12 times slower by converting a foldM to a mapM that modifies a mutable array.
Anyone have any pointers on how to get hashElem and updateElem to run faster, or any insight on what exactly they are allocating. To me it seems that those functions should be able to do everything they need to without a malloc.
Try inlining key small functions, and check the core.
-O2 -ddump-simpl | less
-- Don

On 03/07/07, Anatoly Yakovenko
inlining some of the functions definitely gave me a boost, so i am about 8.5 times slower then openssl sha1sum. I dont really understand the core output, but after inlining i got a completely different profile output, i am guessing its because the cost of the inlined functions is spread to the callers.
Are you using -auto, or -auto-all? Because it makes a difference to the generated core, and the extent to which inlining takes place. I've noticed that -auto permits more inlining than -auto-all, so try -auto if you can. Also, follow the advice in the GHC manual, and only export the functions you need to. This will aid both the inliner and specialiser enormously. As for reading core (well, actually simplifier output; core has less "punctuation"), these links might help: 4.16.3. How to read Core syntax http://www.haskell.org/ghc/docs/latest/html/users_guide/options-debugging.ht... (and the Encoding module has the actual rules for the Unique names) http://darcs.haskell.org/ghc/compiler/utils/Encoding.hs 6.2. Faster: producing a program that runs quicker http://www.haskell.org/ghc/docs/latest/html/users_guide/faster.html (see "How do I find out a function's strictness?") Alistair

Are you using -auto, or -auto-all? Because it makes a difference to the generated core, and the extent to which inlining takes place. I've noticed that -auto permits more inlining than -auto-all, so try -auto
-auto doesn't generate any meaningfull profiling info for me
if you can. Also, follow the advice in the GHC manual, and only export the functions you need to. This will aid both the inliner and specialiser enormously.
cool, this actually helped quite a bit, now only 7.5 times slower :)
As for reading core (well, actually simplifier output; core has less "punctuation"), these links might help:
4.16.3. How to read Core syntax http://www.haskell.org/ghc/docs/latest/html/users_guide/options-debugging.ht...
(and the Encoding module has the actual rules for the Unique names) http://darcs.haskell.org/ghc/compiler/utils/Encoding.hs
6.2. Faster: producing a program that runs quicker http://www.haskell.org/ghc/docs/latest/html/users_guide/faster.html (see "How do I find out a function's strictness?")
thanks for the tip, ill take a look at those.

aeyakovenko:
inlining some of the functions definitely gave me a boost, so i am about 8.5 times slower then openssl sha1sum. I dont really understand the core output, but after inlining i got a completely different profile output, i am guessing its because the cost of the inlined functions is spread to the callers.
COST CENTRE MODULE %time %alloc
updateElem SHA1 13.4 0.0 sRotateL SHA1 13.4 0.0 hashElem SHA1 12.5 0.0 sXor SHA1 10.9 0.0 unboxW SHA1 10.0 0.0
So I'd now dive in and seriously look at the Core for these guys. Work out what they're doing, and how they differ from the C version. -- Don

Hello Anatoly, Sunday, July 1, 2007, 3:58:24 AM, you wrote:
Anyone have any pointers on how to get hashElem and updateElem to run faster, or any insight on what exactly they are allocating. To me it seems that those functions should be able to do everything they need to without a malloc.
haskell allocations isn't a malloc, it's just a pointer increment, so it's very fast. any temporary data created in haskell code need to be allocated so the only case when you don't have allocations is cycle on unboxed values in your particular case you should try the following trick: aa <- unsafeRead a5 0 return $! aa bb <- unsafeRead a5 1 return $! bb currently, your code implies that unsafeRead may return boxed value. 'let' by itself doesn't enforce unboxing, the compiler implies that value assigned in 'let' may be actually not used. you can use either 'case' or above-mentioned trick with '$!' (or seq) to avoid boxing -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Just an outsider's reaction, and for all I know unsafeRead is actually safe, but.... if the point of using Haskell (and I'm still trying to discover what that is ;-) ) is either to be able to rigorously mathematically prove that your program will work perfectly (target usage 1), or to carry out threading easily and automatically (what I'm interested specifically in), ... then why do we have to throw unsafe functions around the place to get any decent speed??? Ok, I'll go back to my hole; just a reaction. I know everything has to start somewhere, and build up, just be aware that having to use unsafe stuff to get decent speed is not good PR ;-)

so using mseq didn't seem to make any difference, i still had the same
performance.
On 7/1/07, Benja Fallenstein
Hi,
2007/7/1, Bulat Ziganshin
: aa <- unsafeRead a5 0 return $! aa bb <- unsafeRead a5 1 return $! bb
If this is a useful pattern, would it make sense to have a function to encapsulate it?
mseq :: Monad m => m a -> m a mseq m = m >>= (return $!)
- Benja

Anatoly Yakovenko
So I tried implementing a more efficient sha1 in haskell, and i got to about 12 times slower as C. The darcs implementation is also around 10 to 12 times slower, and the crypto one is about 450 times slower. I haven't yet unrolled the loop like the darcs implementation does, so I can still get some improvement from that, but I want that to be the last thing i do.
I've been meaning to reply to this for some time but work and domestic duties have intervened. 1. Very good. 2. It has type hash::BS.ByteString -> IO [Word] but hash is a pure function. Can this be changed? 3. I haven't tried but I assume it only runs with ghc and not hugs? I guess if point 1 could be addressed then we could put it in the crypto library (assuming you are happy with this) with some sort of conditional flag to use your code if the library is being built for ghc but to use the slow version for other compilers / interpreters. On a more discursive note, I still don't think we have found the holy grail here: idiomatic functional programming (probably not using StorableArray and unsafeRead and unsafeWrite) and lightning fast speed. Dominic. PS I noticed you have: splitByN::Int -> BS.ByteString -> [BS.ByteString] splitByN nn ll = st : (if (BS.null en) then [] else (splitByN nn en)) where (st,en) = BS.splitAt nn ll It's a function I often use: splitByN n = unfoldr k where k [] = Nothing k p = Just (splitAt n p) Maybe it should be a standard part of List and ByteString?
participants (7)
-
Alistair Bayley
-
Anatoly Yakovenko
-
Benja Fallenstein
-
Bulat Ziganshin
-
Dominic Steinitz
-
dons@cse.unsw.edu.au
-
Hugh Perkins