
Andrew Coppin wrote:
If anybody has any interesting insights as to why my version is so damned slow, I'd like to hear them.
OK, so I am now officially talking to myself. Profiling indicates that 60% of my program's time is spent inside one function: make_block_bs :: ByteString -> Block In other words, the program spends 60% of its running time converting an array of Word8 into an array of Word32. [Note that the MD5 algorithm incorrectly specifies that little-endian ordering should be used, so I do the byte-swapping in here too.] The code for breading the file into correctly-sized pieces and adding padding is a ful 3% of the runtime, so it's not even the copying overhead. It's *all* in the array conversions. How much do you want to bet that the C version just reads a chunk of data into RAM and then does a simple type-cast? (If I'm not very much mistaken, in C a type-cast is an O(0) operation.) Anyway, wading through the 3-page Core output for this function, I'm seeing a lot of stuff of the form case GHC.Prim.<=# 0 whuhrjhsd of wild3784_hguhse { GHC.Base.False -> (GHC.Err.error @ GHC.Base.Int MD5.Types.lvl `cast` (CoUnsafe ... I'd be lying if I had I had any real idea what that does. But it occurs to me that the code is probably performing runtime checks that every single array access is within bounds, and that every single bit shift operation is similarly within bounds. This is obviously very *safe*, but possibly not very performant. Isn't there some function kicking around somewhere for acessing arrays without the bounds check? Similarly, I'm sure I remember the Data.Bits bounds check being mentioned before, and somebody saying they had [or were going to?] make an unsafe variant available. I've looked around the library documentation and I'm not seeing anything... any hints? (Obviously I could attempt to dig through GHC.Prim and find what I'm after there, but... it looks pretty scary in there! I'm hoping for something slightly higher-level if possible.) The alternative, obviously, is to just perform an evil type cast. Trouble is, that'll work just fine on any architecture with a little-endian data order [conspicuously IA32 and AMD64], and malfunction horribly on any architecture that works correctly. (At which point I start wondering how C manages to overcome this problem...)