MD5 performance optimizations, and GHC -via-C producing segfaulting binary

2008/5/17 Andrew Coppin
Hi folks.
OK, try this:
darcs get http://darcs.orphi.me.uk/MyMD5 cd MyMD5 ghc -O2 --make md5sum md5sum <some large filename>
I've got some time to take a look at the code. It's very nice, readable and declarative, but obviously not optimized for "raw speed". There're a few things to do to have a speed-up of 2x, without going "low-level". 1) There's a lazyness leak in Pad.hs. The sum in: else make_block_bs bs0 : work (n + block_size_bytes) bs1 is not strict. With a very large file (e.g. 100 mb) it means stack overflow. [roxas:~/Desktop/test2/MyMD5] kirby% ghc -O2 --make md5sum.hs [roxas:~/Desktop/test2/MyMD5] kirby% time ./md5sum ../../jboss-4.2.2.GA.zip Stack space overflow: current size 8388608 bytes. Use `+RTS -Ksize' to increase it. To solve it just add a bang (!) before the n parameter: work !n bs = You've got to add {-# LANGUAGE BangPatterns #-} at the top of the file too. There're solutions that don't imply BangPatterns, and use only seq, but I like bang patterns! Now it works: [roxas:~/Desktop/test2/MyMD5] kirby% time ./md5sum ../../jboss-4.2.2.GA.zip E6542028D538BAEC180A96A5D1E6EC3A 11.958u 0.210s 0:12.19 99.7% 0+0k 0+2io 0pf+0w 2) make_block_bs is sub-optimal, and very critical to performance. I decided to use Data.Binary for it (it's also more readable, and you get rid of the unsafeWrite): import Data.Binary import Data.Binary.Get import Control.Monad // ... instance Binary Block where put _ = undefined get = do xs <- replicateM 16 getWord32le return $ Block $ listArray (0, 15) xs make_block_bs :: B.ByteString -> Block make_block_bs = decode Now we are getting faster: [roxas:~/Desktop/test2/MyMD5] kirby% time ./md5sum ../../jboss-4.2.2.GA.zip E6542028D538BAEC180A96A5D1E6EC3A 8.063u 0.174s 0:08.23 100.0% 0+0k 0+0io 0pf+0w 3) You are doing a lot of access to fields of a strict data type (State). You can at least ask the compiler to help you a bit with -funbox-strict-fields. Now we have: [roxas:~/Desktop/test2/MyMD5] kirby% time ./md5sum ../../jboss-4.2.2.GA.zip E6542028D538BAEC180A96A5D1E6EC3A 6.780u 0.133s 0:06.91 100.0% 0+0k 0+1io 0pf+0w We have got a good, nearly 2x, speed-up with very few optimizations, and we run in a very small constant amount of memory. Probably compiling with -fvia-C could help even more, but strangely: [roxas:~/Desktop/test2/MyMD5] kirby% ghc -fvia-C -funbox-strict-fields -O2 --make md5sum.hs [roxas:~/Desktop/test2/MyMD5] kirby% time ./md5sum ../../jboss-4.2.2.GA.zip Segmentation fault Is this supposed to happen? There're no "unsafe" functions or imports used in the program. Maybe a bug in GHC? Salvatore

Thanks for taking a look at this. kirby81:
2008/5/17 Andrew Coppin
: Hi folks.
OK, try this:
darcs get http://darcs.orphi.me.uk/MyMD5 cd MyMD5 ghc -O2 --make md5sum md5sum <some large filename>
I've got some time to take a look at the code. It's very nice, readable and declarative, but obviously not optimized for "raw speed". There're a few things to do to have a speed-up of 2x, without going "low-level".
1) There's a lazyness leak in Pad.hs. The sum in: else make_block_bs bs0 : work (n + block_size_bytes) bs1 is not strict. With a very large file (e.g. 100 mb) it means stack overflow.
[roxas:~/Desktop/test2/MyMD5] kirby% ghc -O2 --make md5sum.hs
[roxas:~/Desktop/test2/MyMD5] kirby% time ./md5sum ../../jboss-4.2.2.GA.zip Stack space overflow: current size 8388608 bytes. Use `+RTS -Ksize' to increase it.
To solve it just add a bang (!) before the n parameter:
work !n bs =
You've got to add {-# LANGUAGE BangPatterns #-} at the top of the file too. There're solutions that don't imply BangPatterns, and use only seq, but I like bang patterns!
Now it works: [roxas:~/Desktop/test2/MyMD5] kirby% time ./md5sum ../../jboss-4.2.2.GA.zip E6542028D538BAEC180A96A5D1E6EC3A 11.958u 0.210s 0:12.19 99.7% 0+0k 0+2io 0pf+0w
Good idea. I saw some lazy left folds in there too, which look suspicious.
2) make_block_bs is sub-optimal, and very critical to performance. I decided to use Data.Binary for it (it's also more readable, and you get rid of the unsafeWrite): import Data.Binary import Data.Binary.Get import Control.Monad
// ...
instance Binary Block where put _ = undefined get = do xs <- replicateM 16 getWord32le return $ Block $ listArray (0, 15) xs
make_block_bs :: B.ByteString -> Block make_block_bs = decode
Now we are getting faster: [roxas:~/Desktop/test2/MyMD5] kirby% time ./md5sum ../../jboss-4.2.2.GA.zip E6542028D538BAEC180A96A5D1E6EC3A 8.063u 0.174s 0:08.23 100.0% 0+0k 0+0io 0pf+0w
Definitely a good idea.
3) You are doing a lot of access to fields of a strict data type (State). You can at least ask the compiler to help you a bit with -funbox-strict-fields.
Now we have: [roxas:~/Desktop/test2/MyMD5] kirby% time ./md5sum ../../jboss-4.2.2.GA.zip E6542028D538BAEC180A96A5D1E6EC3A 6.780u 0.133s 0:06.91 100.0% 0+0k 0+1io 0pf+0w
We have got a good, nearly 2x, speed-up with very few optimizations, and we run in a very small constant amount of memory.
Great.
Probably compiling with -fvia-C could help even more, but strangely:
[roxas:~/Desktop/test2/MyMD5] kirby% ghc -fvia-C -funbox-strict-fields -O2 --make md5sum.hs [roxas:~/Desktop/test2/MyMD5] kirby% time ./md5sum ../../jboss-4.2.2.GA.zip Segmentation fault
Is this supposed to happen? There're no "unsafe" functions or imports used in the program. Maybe a bug in GHC?
That could be a gcc bug, in fact. Can you provide the final source? Does changing the gcc optimisation level help? (i.e. -fvia-C -optc-O2 ) Andrew, I hope you look carefully at the suggestions here, and use them to improve the code you were working on. -- Don

Don Stewart wrote:
Andrew, I hope you look carefully at the suggestions here, and use them to improve the code you were working on.
Indeed, this is just the sort of stuff I was hoping to get... [Top marks for spotting the laziness leak in "pad" BTW! I totally missed that.] While I'm here, can I just clarify *exactly* what -funbox-strict-fields actually does? I was under the impression that if you have a single-constructor type with all strict fields, GHC would automatically try to avoid boxing and unboxing it. However, the existence of this flag seems to indicate that it only performs this particular optimisation if you ask for it. But on the third hand, turning this flag on or off makes no measurable performance difference. (At least, in the tests I did. Maybe that's because it was swamped by all the other inefficiencies mentioned?) Some clarification?

andrewcoppin:
Don Stewart wrote:
Andrew, I hope you look carefully at the suggestions here, and use them to improve the code you were working on.
Indeed, this is just the sort of stuff I was hoping to get... [Top marks for spotting the laziness leak in "pad" BTW! I totally missed that.]
While I'm here, can I just clarify *exactly* what -funbox-strict-fields actually does?
If in doubt, ask the user's guide: http://www.haskell.org/ghc/docs/latest/html/users_guide/pragmas.html#unpack-... -- Don

Don Stewart wrote:
andrewcoppin:
While I'm here, can I just clarify *exactly* what -funbox-strict-fields actually does?
If in doubt, ask the user's guide:
http://www.haskell.org/ghc/docs/latest/html/users_guide/pragmas.html#unpack-...
...so making a field strict causes it to be automatically forced when the constructor is forced, but doesn't actually unbox it? So there's still a level of indirection, and no register allocation?

2008/5/20 Don Stewart
Probably compiling with -fvia-C could help even more, but strangely:
[roxas:~/Desktop/test2/MyMD5] kirby% ghc -fvia-C -funbox-strict-fields -O2 --make md5sum.hs [roxas:~/Desktop/test2/MyMD5] kirby% time ./md5sum ../../jboss-4.2.2.GA.zip Segmentation fault
Is this supposed to happen? There're no "unsafe" functions or imports used in the program. Maybe a bug in GHC?
That could be a gcc bug, in fact. Can you provide the final source? Does changing the gcc optimisation level help? (i.e. -fvia-C -optc-O2 )
I tried with and without gcc optimizations, but it always segfaults. It happened to me even before my optimizations. I've got GHC 6.8.2 on Mac OS X 10.5.2, GCC 4.0.1. -fvia-C works well for other libraries and applications, so I don't think it's a problem of my specific setting. Salvatore

Woo! Salvatore kindly sent me a Darcs patch, and applying it does indeed make it run faster. Yay! [Note that -fvia-c works just fine for me. It doesn't appear to produce a huge speed difference, but it compiles just fine.] Thanks for the tips, guys! :-D The changes are in the online Darcs repo.

2008/5/21 Andrew Coppin
Woo!
Salvatore kindly sent me a Darcs patch, and applying it does indeed make it run faster. Yay!
Hi Andrew, I'm glad that -fvia-c works for you: maybe it's a Mac OS X specific bug? Anyway, did you compile with -fvia-c -optc-O3? I expect register-intensive code like this to be at least 20% faster with gcc. Salvatore

Salvatore Insalaco wrote:
I've got some time to take a look at the code. It's very nice, readable and declarative, but obviously not optimized for "raw speed". There're a few things to do to have a speed-up of 2x, without going "low-level".
You have my individed attention...
1) There's a lazyness leak in Pad.hs. The sum in: else make_block_bs bs0 : work (n + block_size_bytes) bs1 is not strict. With a very large file (e.g. 100 mb) it means stack overflow.
Oops! Well spotted...
To solve it just add a bang (!) before the n parameter:
work !n bs =
You've got to add {-# LANGUAGE BangPatterns #-} at the top of the file too. There're solutions that don't imply BangPatterns, and use only seq, but I like bang patterns!
Ah, so *that's* what bang patterns do?
2) make_block_bs is sub-optimal, and very critical to performance.
Yeah, so I see. (60% time spent here... ouch!) I'll bet that's where C is beating me...
I decided to use Data.Binary for it
I'm not familiar with that library (i.e. what it does or how you use it).
import Data.Binary import Data.Binary.Get import Control.Monad
// ...
instance Binary Block where put _ = undefined get = do xs <- replicateM 16 getWord32le return $ Block $ listArray (0, 15) xs
make_block_bs :: B.ByteString -> Block make_block_bs = decode
Mmm, OK. Doesn't look too bad...
3) You are doing a lot of access to fields of a strict data type (State). You can at least ask the compiler to help you a bit with -funbox-strict-fields.
I did try that, but it didn't seem to make any difference for me. [Maybe it does now because of your other improvements? Which version of GHC and which platform are you running on? I'm GHC 6.8.2 on Windows...]
We have got a good, nearly 2x, speed-up with very few optimizations, and we run in a very small constant amount of memory.
I'm liking the sound of that. :-D
Probably compiling with -fvia-C could help even more
...oh yeah, that's no longer default for -O2. I forgot about that!
but strangely:
Segmentation fault
Is this supposed to happen? There're no "unsafe" functions or imports used in the program. Maybe a bug in GHC?
GAH! o_O That's not good(tm)... Any chance you could email me a Darcs patch (or 2 seperate ones?) I'll add it to the repo on my website, and test to see what numbers I get on my PC.

Hello Andrew, Tuesday, May 20, 2008, 11:05:52 PM, you wrote:
-funbox-strict-fields.
I did try that, but it didn't seem to make any difference for me. [Maybe
it may be that ghc just not recompiled program when you supplied this switch. as i wrote, this switch by itself made your original program 1.5x faster on my box. try to delete .o/.exe before rebuilding and, without this switch representation for !Int32 is the same as for Int32 - only difference is that when data is assigned to such field they are evaluated first (and then boxed) it is not enabled by default, because for *non-primitive* datatypes such as B below automatic unboxing of strict fields of this type may decrease sharing and thus memory/performance. imagine for example: data A = A !B !B data B = B !Int !Int !Int !Int !Int b = B 1 1 1 1 1 a = A b b jhc automatically unboxes strict fields of primitive datatypes, which is guaranteed to provide only positive effects. may be somesay the same will be added to ghc -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin wrote:
Hello Andrew,
Tuesday, May 20, 2008, 11:05:52 PM, you wrote:
-funbox-strict-fields.
I did try that, but it didn't seem to make any difference for me. [Maybe
it may be that ghc just not recompiled program when you supplied this switch.
Possibly. I notice that adding *some* switches makes GHC recompile everything, while others don't. [It's probably documented in the manual somewhere...]
and, without this switch representation for !Int32 is the same as for Int32 - only difference is that when data is assigned to such field they are evaluated first (and then boxed)
Right, OK - so strict evaluation, but some bitpattern in RAM.
it is not enabled by default, because for *non-primitive* datatypes such as B below automatic unboxing of strict fields of this type may decrease sharing and thus memory/performance.
Yeah, I can imagine...

Hello Andrew, Wednesday, May 21, 2008, 12:09:39 AM, you wrote:
Right, OK - so strict evaluation, but some bitpattern in RAM.
yes. in particular this allows to designate fields that should de initialized: data A = A {a :: Int, b :: !Int} main = print A{}
it is not enabled by default, because for *non-primitive* datatypes such as B below automatic unboxing of strict fields of this type may decrease sharing and thus memory/performance.
Yeah, I can imagine...
sometime i've seriously optimized my program by removing this switch and adding manual unboxing directives to the sources. it will be great if ghc will catch jhc here -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

bulat.ziganshin:
Hello Andrew,
Wednesday, May 21, 2008, 12:09:39 AM, you wrote:
Right, OK - so strict evaluation, but some bitpattern in RAM.
yes. in particular this allows to designate fields that should de initialized:
data A = A {a :: Int, b :: !Int} main = print A{}
it is not enabled by default, because for *non-primitive* datatypes such as B below automatic unboxing of strict fields of this type may decrease sharing and thus memory/performance.
Yeah, I can imagine...
sometime i've seriously optimized my program by removing this switch and adding manual unboxing directives to the sources. it will be great if ghc will catch jhc here
Would you like to file a feature request? http://hackage.haskell.org/trac/ghc/newticket?type=bug Particularly if you have an example where JHC has done a better job optimising some simple example. So far I've seen no ticke Unless tickets are opened, progress will not be made. -- Don

bulat.ziganshin:
Hello Andrew,
Tuesday, May 20, 2008, 11:05:52 PM, you wrote:
-funbox-strict-fields.
I did try that, but it didn't seem to make any difference for me. [Maybe
it may be that ghc just not recompiled program when you supplied this switch. as i wrote, this switch by itself made your original program 1.5x faster on my box. try to delete .o/.exe before rebuilding
and, without this switch representation for !Int32 is the same as for Int32 - only difference is that when data is assigned to such field they are evaluated first (and then boxed)
it is not enabled by default, because for *non-primitive* datatypes such as B below automatic unboxing of strict fields of this type may decrease sharing and thus memory/performance. imagine for example:
data A = A !B !B data B = B !Int !Int !Int !Int !Int
b = B 1 1 1 1 1 a = A b b
jhc automatically unboxes strict fields of primitive datatypes, which is guaranteed to provide only positive effects. may be somesay the same will be added to ghc
I'm confused. GHC of course unboxes strict fields of primitive data types. {-# OPTIONS -O2 -fvia-C -optc-O2 -funbox-strict-fields #-} data A = A !B !B data B = B !Int !Int !Int !Int !Int b = B 1 2 3 4 5 a = A b b go :: Int -> A go 0 = a go n = go (n-1) main = print $ case go 10 of A (B a b c d e) (B a' b' c' d' e') -> a + b + c + d + e + a' + b' + c' + d' + e' For example, go compiles to: $wgo :: Int# -> (# Int#, Int#, Int#, Int#, Int#, Int#, Int#, Int#, Int#, Int# #) $wgo = \ (ww_svf :: Int#) -> case ww_svf of ds_Xmw { __DEFAULT -> $wgo (-# ds_Xmw 1); 0 -> (# 1, 2, 3, 4, 5, 1, 2, 3, 4, 5 #) Values are passed in registers, the structures are entirely unpacked into registers or stack for return. Were you not aware of this Bulat? -- Don

On May 20, 2008, at 16:13 , Don Stewart wrote:
bulat.ziganshin:
Hello Andrew,
Tuesday, May 20, 2008, 11:05:52 PM, you wrote:
-funbox-strict-fields.
I did try that, but it didn't seem to make any difference for me. [Maybe
jhc automatically unboxes strict fields of primitive datatypes, which is guaranteed to provide only positive effects. may be somesay the same will be added to ghc
I'm confused. GHC of course unboxes strict fields of primitive data types.
{-# OPTIONS -O2 -fvia-C -optc-O2 -funbox-strict-fields #-}
I think "automatically" means "without -funbox-strict-fields".... -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

Don Stewart wrote:
I'm confused. GHC of course unboxes strict fields of primitive data types.
{-# OPTIONS -O2 -fvia-C -optc-O2 -funbox-strict-fields #-}
... but only when you give -funbox-strict-fields, as there, or UNPACK. The point is that it never loses sharing to unbox a strict Int field [1], so it should do that with -O, even without either of the above overrides. [1] I'm not sure if this is true... if it has to rebox the Int, you get another copy floating around, not the original, right? -Isaac

isaacdupree:
Don Stewart wrote:
I'm confused. GHC of course unboxes strict fields of primitive data types.
{-# OPTIONS -O2 -fvia-C -optc-O2 -funbox-strict-fields #-}
... but only when you give -funbox-strict-fields, as there, or UNPACK. The point is that it never loses sharing to unbox a strict Int field [1], so it should do that with -O, even without either of the above overrides.
[1] I'm not sure if this is true... if it has to rebox the Int, you get another copy floating around, not the original, right?
See section 8.12.10. UNPACK pragma: "if the T constructor is scrutinised and the floats passed to a non-strict function for example, they will have to be reboxed" That said, the strictness analyser is beefy enough, and strict data structures commone enough, that this seems to almost ubiquitously be a win. Enabling it at -O or -O2 makes perfect sense. -- Don

| > I'm confused. GHC of course unboxes strict fields of primitive data types. | > | > {-# OPTIONS -O2 -fvia-C -optc-O2 -funbox-strict-fields #-} | | ... but only when you give -funbox-strict-fields, as there, or UNPACK. | The point is that it never loses sharing to unbox a strict Int field | [1], so it should do that with -O, even without either of the above | overrides. | | [1] I'm not sure if this is true... if it has to rebox the Int, you get | another copy floating around, not the original, right? Correct. If you have data T = MkT {-# UNPACK #-} !Int then given case x of { MkT y -> h y } then GHC must re-box the 'y' before passing it to h. That's why unpacking strict fields isn't an unambiguous win. Perhaps it should be the default, though, which can be switched off, rather than the reverse. Simon

Simon Peyton-Jones wrote:
| [1] I'm not sure if this is true... if it has to rebox the Int, you get | another copy floating around, not the original, right?
Correct. If you have
data T = MkT {-# UNPACK #-} !Int then given case x of { MkT y -> h y } then GHC must re-box the 'y' before passing it to h. That's why unpacking strict fields isn't an unambiguous win. Perhaps it should be the default, though, which can be switched off, rather than the reverse.
perhaps there are certain types that the compiler almost always manages to pass around unboxed -- Ints, for example? (Or at least that the extra copies aren't likely to persist for a long time, wasting some memory. Especially not if they're always unboxed in strict fields -- except for strict polymorphic fields, unfortunately.) And other types that aren't usually passed around unboxed? If this is clear, then maybe it's useful to do it by default for those types. Perhaps there should be a {-# NOUNPACK #-} pragma just in case someone wants to make sure a strict field isn't represented unboxed (not even if you give -funbox-strict-fields). -Isaac

Hello Don, Wednesday, May 21, 2008, 12:13:14 AM, you wrote:
it is not enabled by default, because for *non-primitive* datatypes
I'm confused. GHC of course unboxes strict fields of primitive data types.
{-# OPTIONS -O2 -fvia-C -optc-O2 -funbox-strict-fields #-}
it is not enabled by default... -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

2008/5/20 Andrew Coppin
Ah, so *that's* what bang patterns do?
Exactly. If you have a pattern (even a simple one like 'n'), and you prefix it with a bang (!), the function will force its evaluation.
2) make_block_bs is sub-optimal, and very critical to performance.
Yeah, so I see. (60% time spent here... ouch!) I'll bet that's where C is beating me...
Maybe that's a bit optimistic, but it's a start :).
I decided to use Data.Binary for it
I'm not familiar with that library (i.e. what it does or how you use it).
It's a library to serialize / deserialize haskell values. It uses ByteStrings, and it's very fast. I suggest to take a look at it (it's on Hackage).
I did try that, but it didn't seem to make any difference for me. [Maybe it does now because of your other improvements? Which version of GHC and which platform are you running on? I'm GHC 6.8.2 on Windows...]
If you recompiled the executable correctly, maybe you were doing profiling? Profiling slows some simple operations so much, that the benefits of a low-level optimization like this can be lost. Salvatore
participants (7)
-
Andrew Coppin
-
Brandon S. Allbery KF8NH
-
Bulat Ziganshin
-
Don Stewart
-
Isaac Dupree
-
Salvatore Insalaco
-
Simon Peyton-Jones