Re: [Haskell-cafe] Performance: MD5

18 May 2008

      Andrew Coppin wrote:
...
If anybody has any interesting insights as to why my version is so 
damned slow, I'd like to hear them.
OK, so I am now officially talking to myself.

Profiling indicates that 60% of my program's time is spent inside one 
function:

  make_block_bs :: ByteString -> Block

In other words, the program spends 60% of its running time converting an 
array of Word8 into an array of Word32. [Note that the MD5 algorithm 
incorrectly specifies that little-endian ordering should be used, so I 
do the byte-swapping in here too.] The code for breading the file into 
correctly-sized pieces and adding padding is a ful 3% of the runtime, so 
it's not even the copying overhead. It's *all* in the array conversions.

How much do you want to bet that the C version just reads a chunk of 
data into RAM and then does a simple type-cast? (If I'm not very much 
mistaken, in C a type-cast is an O(0) operation.)

Anyway, wading through the 3-page Core output for this function, I'm 
seeing a lot of stuff of the form

  case GHC.Prim.<=# 0 whuhrjhsd of wild3784_hguhse {
    GHC.Base.False ->
      (GHC.Err.error @ GHC.Base.Int MD5.Types.lvl
      `cast` (CoUnsafe ...

I'd be lying if I had I had any real idea what that does. But it occurs 
to me that the code is probably performing runtime checks that every 
single array access is within bounds, and that every single bit shift 
operation is similarly within bounds. This is obviously very *safe*, but 
possibly not very performant.

Isn't there some function kicking around somewhere for acessing arrays 
without the bounds check? Similarly, I'm sure I remember the Data.Bits 
bounds check being mentioned before, and somebody saying they had [or 
were going to?] make an unsafe variant available. I've looked around the 
library documentation and I'm not seeing anything... any hints?

(Obviously I could attempt to dig through GHC.Prim and find what I'm 
after there, but... it looks pretty scary in there! I'm hoping for 
something slightly higher-level if possible.)

The alternative, obviously, is to just perform an evil type cast. 
Trouble is, that'll work just fine on any architecture with a 
little-endian data order [conspicuously IA32 and AMD64], and malfunction 
horribly on any architecture that works correctly. (At which point I 
start wondering how C manages to overcome this problem...)