[GHC] #14644: Improve cmm/assembly for pattern matches with two constants.

#14644: Improve cmm/assembly for pattern matches with two constants. -------------------------------------+------------------------------------- Reporter: AndreasK | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.2 (CodeGen) | Keywords: Codegen, CMM, | Operating System: Unknown/Multiple Patterns, Pattern Matching | Architecture: | Type of failure: None/Unknown Unknown/Multiple | Test Case: | Blocked By: Blocking: | Related Tickets: Differential Rev(s): | Wiki Page: -------------------------------------+------------------------------------- For a pattern like: {{{ f_test :: Int#->Int# f_test a = case a of 1# -> 33# 7# -> 34# _ -> -1# }}} GHC currently generates code that works best if the default branch is taken most often. Pseudo-Code {{{ if (a >= 8) return -1; else { if (a < 7) { if(a != 1) return -1; else return 33; } else return 34; } }}} CMM: {{{ c1cr: // global if (%MO_S_Ge_W64(R2, 8)) goto c1co; else goto u1cu; u1cu: // global if (%MO_S_Lt_W64(R2, 7)) goto u1cv; else goto c1cq; u1cv: // global if (R2 != 1) goto c1co; else goto c1cp; c1co: // global R1 = (-1); call (P64[Sp])(R1) args: 8, res: 0, upd: 8; c1cp: // global R1 = 33; call (P64[Sp])(R1) args: 8, res: 0, upd: 8; c1cq: // global R1 = 34; call (P64[Sp])(R1) args: 8, res: 0, upd: 8; } }}} Wouldn't the following be better? {{{ if(a == 1) return 33 else if(a == 7) return 34 else return -1 }}} I would expect that to * Improve the cases: * a = 1 * a < 7 * Be the same for: * a = 7 * Be worse for * a > 8 This would be especially nice for the cases where the default branch is raising a pattern match exception. Which is a code path I wouldn't expect to be taken often nor be very performance sensitive. Even if we keep the current logic there is room for improvement. GHC currently creates the following assembly: {{{ _c1cr: cmpq $8,%r14 jge _c1co _u1cu: cmpq $7,%r14 jl _u1cv _c1cq: movl $34,%ebx jmp *(%rbp) _u1cv: cmpq $1,%r14 jne _c1co _c1cp: movl $33,%ebx jmp *(%rbp) _c1co: movq $-1,%rbx jmp *(%rbp) }}} It would be nice if we could remove one comparison at least. {{{ _c1cr: cmpq $7,%r14 jg _c1co _u1cu: ;No longer neccesary cmpq $7,%r14 jl _u1cv _c1cq: movl $34,%ebx jmp *(%rbp) _u1cv: cmpq $1,%r14 jne _c1co _c1cp: movl $33,%ebx jmp *(%rbp) _c1co: movq $-1,%rbx jmp *(%rbp) }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14644 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14644: Improve cmm/assembly for pattern matches with two constants. -------------------------------------+------------------------------------- Reporter: AndreasK | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.2 (CodeGen) | Keywords: Codegen, CMM, Resolution: | Patterns, Pattern Matching Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): Looks plausible to me. I wonder how we'd measure any potential benefit? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14644#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14644: Improve cmm/assembly for pattern matches with two constants. -------------------------------------+------------------------------------- Reporter: AndreasK | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.2 (CodeGen) | Keywords: Codegen, CMM, Resolution: | Patterns, Pattern Matching Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by AndreasK): It should also be possible to see the difference using criterion or instruction counters when running affected code in a tight loop. I will take a stab at changing the logic to `if ... else if ... else` for these cases. I think that's a good starting point to learn how Cmm works inside GHC. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14644#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14644: Improve cmm/assembly for pattern matches with two constants. -------------------------------------+------------------------------------- Reporter: AndreasK | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.2 (CodeGen) | Keywords: Codegen, CMM, Resolution: | Patterns, Pattern Matching Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by svenpanne): I am not sure if things are that easy: To do these kind of switches in a really good way, one would need a probability distribution how often the individual cases actually happen. If we e.g. assume that all Ints happen equally often in the example above, it would be best to check for <1 and >7 first, so we would get roughly 1.5 comparisons on average. Depending on the number of registers available, you can even get away with 1 subtraction and 1 unsigned comparison for this range check, a classic C hack (ab)using wrap-around for unsigned ints. If we have some hints, e.g. raising a pattern matching exception, we could do better, e.g. assume 0% probability for this case. If we have more detailed (estimated) probabilities, we could do a Huffman-like decision tree. This is where profile-guided optimization shines. Additional things to consider: Performance in tight loops is often vastly different, because branch prediction/caching will most likely kick in visibly. Correctly predicted branches will cost you almost nothing, while unknown/incorrectly predicted branches will be much more costly. In the absence of more information from their branch predictor, quite a few processors assume that backward branches are taken and forward branches are assumed to be not taken. So code layout has a non-trivial performance impact. Instruction counts are quite misleading nowadays... -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14644#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14644: Improve cmm/assembly for pattern matches with two constants. -------------------------------------+------------------------------------- Reporter: AndreasK | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.2 (CodeGen) | Keywords: Codegen, CMM, Resolution: | Patterns, Pattern Matching Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by AndreasK): Replying to [comment:3 svenpanne]:
I am not sure if things are that easy: To do these kind of switches in a really good way, one would need a probability distribution how often the individual cases actually happen.
Additional things to consider: Performance in tight loops is often vastly different, because branch prediction/caching will most likely kick in visibly. Correctly predicted branches will cost you almost nothing, while unknown/incorrectly predicted branches will be much more costly. In
This is by no means meant as a "this is optimal" idea. As far as I'm aware there is currently no way to propagate probabilities through GHC's various stages. So even in cases where we can make a judgement on the probability we can't really use it at the Cmm level. In the end I suggested it because it: * I think it has a good chance of being faster * Leads to smaller code * Seems to be what gcc/clang do (using conditional moves instead of jumps but still). the absence of more information from their branch predictor, quite a few processors assume that backward branches are taken and forward branches are assumed to be not taken. So code layout has a non-trivial performance impact.
Instruction counts are quite misleading nowadays...
Indeed :( -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14644#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14644: Improve cmm/assembly for pattern matches with two constants. -------------------------------------+------------------------------------- Reporter: AndreasK | Owner: AndreasK Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.2 (CodeGen) | Keywords: Codegen, CMM, Resolution: | Patterns, Pattern Matching Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D4294 Wiki Page: | -------------------------------------+------------------------------------- Changes (by AndreasK): * owner: (none) => AndreasK * differential: => Phab:D4294 Comment: I ran a simple Benchmark and the results seem almost too good with >10% gains in almost all cases. If it really is that good then, well, I guess good for us. Assuming I didn't introduce an error somewhere I assume the combination of smaller code and less instructions on some paths is just better in general.
If we e.g. assume that all Ints happen equally often in the example above, it would be best to check for <1 and >7 first, so we would get roughly 1.5 comparisons on average. Depending on the number of registers available, you can even get away with 1 subtraction and 1 unsigned comparison for this range check, a classic C hack (ab)using wrap-around for unsigned ints.
I thought about this a bit. While true it also means the first comparison has a hard time with prediction if the numbers are a random distribution over the range. So two 99.9% correctly predicted comparisons might still be better even without considering code size. `range check, jg, je, cmp, je, j <default>` could still be better. But not sure if I will look into that and it might warrant it's own ticket as GHC also uses this for range checks on jump tables I think. Code used for the test: {{{ f_test :: Int -> Int f_test 111 = 1111 f_test 222 = 2222 f_test _ = -1 run = print . sum . map f_test --main = print . sum . map f_test $ ([(-1500000000::Int) .. 0]) main = do args <- getArgs :: IO [String] if null args then putStrLn "Usage: pos|neg|both" else case (head args) of "pos" -> run [0 .. (1500000000::Int)] "neg" -> run [-1500000000::Int .. 0] "both" -> run [-750000000::Int .. 750000000::Int] "negrep" -> run . concat . replicate 100000000 $ [-311, -322, -333, -444] "posrep" -> run . concat . replicate 100000000 $ [311, 322, 333, 444] "unpre" -> run . concat . replicate 100000000 $ [111, 322, -333, 222] }}} I get these results for the current approach (range) and mine (if) on an 6700K === With inlining: {{{ benchmarking execute: range pos mean 1.196 s (1.196 s .. 1.198 s) benchmarking execute: if pos mean 806.0 ms (803.5 ms .. 807.3 ms) benchmarking execute: range neg mean 2.384 s (2.383 s .. 2.385 s) benchmarking execute: if neg mean 1.841 s (1.841 s .. 1.842 s) benchmarking execute: range negrep mean 840.1 ms (838.1 ms .. 841.3 ms) benchmarking execute: if negrep mean 728.6 ms (728.3 ms .. 728.8 ms) benchmarking execute: range unpre mean 852.2 ms (851.1 ms .. 853.1 ms) benchmarking execute: if unpre mean 789.7 ms (789.3 ms .. 789.9 ms) }}} === With inlining disabled on f_test {{{ benchmarking execute: range pos mean 2.385 s (2.383 s .. 2.386 s) benchmarking execute: if pos mean 2.383 s (2.382 s .. 2.384 s) benchmarking execute: range neg mean 2.845 s (2.839 s .. 2.848 s) benchmarking execute: if neg mean 2.047 s (2.041 s .. 2.053 s) benchmarking execute: range negrep mean 1.204 s (1.201 s .. 1.205 s) benchmarking execute: if negrep mean 829.4 ms (828.4 ms .. 830.1 ms) benchmarking execute: range unpre mean 1.165 s (1.164 s .. 1.166 s) benchmarking execute: if unpre mean 1.118 s (1.117 s .. 1.119 s) }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14644#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14644: Improve cmm/assembly for pattern matches with two constants. -------------------------------------+------------------------------------- Reporter: AndreasK | Owner: AndreasK Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.2 (CodeGen) | Keywords: Codegen, CMM, Resolution: | Patterns, Pattern Matching Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D4294 Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): A 10% improvement is amazing. If you commit this, please include a Note explaining the strategy, and pointing to the ticket. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14644#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14644: Improve cmm/assembly for pattern matches with two constants. -------------------------------------+------------------------------------- Reporter: AndreasK | Owner: AndreasK Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.2.2 (CodeGen) | Keywords: Codegen, CMM, Resolution: | Patterns, Pattern Matching Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D4294 Wiki Page: | -------------------------------------+------------------------------------- Comment (by AndreasK): Replying to [comment:6 simonpj]:
A 10% improvement is amazing.
Indeed! Nothing in nofib gets up to 10% but wheel-sieve1 comes close. * 17 programs are changed (contain that kind of case statement) * 15 Show no significant runtime difference. (It's fair to assume it's an improvement but not mensurable with the noise on my machine) * k-nucleotide improves by ~1.2% * wheel-sieve1 has the code in a inner loop. It improves by ~4.3% in mode=normal. Increasing the problem size makes the difference even bigger up to somewhere around 8-9%
If you commit this, please include a Note explaining the strategy, and
pointing to the ticket. Will do. **** The patch is up on phab. If someone can run this on an older intel (or even better amd) processor and check the result that would be great to make sure it's not just a pecularity of my i7-6700K. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14644#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14644: Improve cmm/assembly for pattern matches with two constants. -------------------------------------+------------------------------------- Reporter: AndreasK | Owner: AndreasK Type: task | Status: patch Priority: normal | Milestone: Component: Compiler | Version: 8.2.2 (CodeGen) | Keywords: Codegen, CMM, Resolution: | Patterns, Pattern Matching Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D4294 Wiki Page: | -------------------------------------+------------------------------------- Changes (by AndreasK): * status: new => patch -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14644#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14644: Improve cmm/assembly for pattern matches with two constants. -------------------------------------+------------------------------------- Reporter: AndreasK | Owner: AndreasK Type: task | Status: patch Priority: normal | Milestone: Component: Compiler | Version: 8.2.2 (CodeGen) | Keywords: Codegen, CMM, Resolution: | Patterns, Pattern Matching Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D4294 Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): Didn't you say "I ran a simple Benchmark and the results seem almost too good with >10% gains in almost all cases.". But now you say "mostly no change; one gets 5-9%". Were the earlier measurements wrong? Anyway the patch looks good. Thanks -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14644#comment:9 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14644: Improve cmm/assembly for pattern matches with two constants. -------------------------------------+------------------------------------- Reporter: AndreasK | Owner: AndreasK Type: task | Status: patch Priority: normal | Milestone: Component: Compiler | Version: 8.2.2 (CodeGen) | Keywords: Codegen, CMM, Resolution: | Patterns, Pattern Matching Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D4294 Wiki Page: | -------------------------------------+------------------------------------- Comment (by AndreasK): Replying to [comment:9 simonpj]:
Didn't you say "I ran a simple Benchmark and the results seem almost too good with >10% gains in almost all cases.". But now you say "mostly no change; one gets 5-9%". Were the earlier measurements wrong?
No these are right. But the simple benchmark was running the code I posted above in [https://ghc.haskell.org/trac/ghc/ticket/14644?replyto=9#comment:5 comment 5]. All cases referred to the pos/neg/.. branches in that case. Nofib seems to be in a permanent state of breakdown on windows so I stopped using it in my initial benchmarks. And it was the [https://ghc.haskell.org/trac/ghc/ticket/14656 right] [https://ghc.haskell.org/trac/ghc/ticket/14655 thing] [https://ghc.haskell.org/trac/ghc/ticket/14654 to do ] as it took some work to get it running. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14644#comment:10 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

Additional things to consider: Performance in tight loops is often vastly different, because branch prediction/caching will most likely kick in visibly. Correctly predicted branches will cost you almost nothing, while unknown/incorrectly predicted branches will be much more costly. In
#14644: Improve cmm/assembly for pattern matches with two constants. -------------------------------------+------------------------------------- Reporter: AndreasK | Owner: AndreasK Type: task | Status: patch Priority: normal | Milestone: Component: Compiler | Version: 8.2.2 (CodeGen) | Keywords: Codegen, CMM, Resolution: | Patterns, Pattern Matching Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D4294 Wiki Page: | -------------------------------------+------------------------------------- Comment (by AndreasK): Replying to [comment:3 svenpanne]: the absence of more information from their branch predictor, quite a few processors assume that backward branches are taken and forward branches are assumed to be not taken. So code layout has a non-trivial performance impact. I went over Agners guide and it seems like this is only for Netburst CPU's, the last of which was released in 2001 so I'm not too worried about these. And even if you have on of these according to Agner:
It is rarely worth the effort to take static prediction into account. Almost any branch that is executed sufficiently often for its timing to have any significant effect is likely to stay in the BTB so that only the dynamic prediction counts.
All other architectures he lists default to not taken if they use static prediction at all. ---- What might help explain the difference is that jumps not taken should be faster than taken jumps on both modern Intel and AMD CPU's. If someone wants to dig deeper Agner probably has enough info in the guides to explain the change completely based on the assembly generated. But I don't think that is necessary. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14644#comment:11 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14644: Improve cmm/assembly for pattern matches with two constants. -------------------------------------+------------------------------------- Reporter: AndreasK | Owner: AndreasK Type: task | Status: patch Priority: normal | Milestone: Component: Compiler | Version: 8.2.2 (CodeGen) | Keywords: Codegen, CMM, Resolution: | Patterns, Pattern Matching Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D4294 Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): It would be great to ensure that there is a Note to explain the thinking before you tie the bow on this. Thanks! Simon -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14644#comment:12 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14644: Improve cmm/assembly for pattern matches with two constants. -------------------------------------+------------------------------------- Reporter: AndreasK | Owner: AndreasK Type: task | Status: patch Priority: normal | Milestone: Component: Compiler | Version: 8.2.2 (CodeGen) | Keywords: Codegen, CMM, Resolution: | Patterns, Pattern Matching Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D4294 Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): Are we good to go on this? I.e. commit? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14644#comment:13 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14644: Improve cmm/assembly for pattern matches with two constants. -------------------------------------+------------------------------------- Reporter: AndreasK | Owner: AndreasK Type: task | Status: patch Priority: normal | Milestone: Component: Compiler | Version: 8.2.2 (CodeGen) | Keywords: Codegen, CMM, Resolution: | Patterns, Pattern Matching Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D4294 Wiki Page: | -------------------------------------+------------------------------------- Comment (by AndreasK): Replying to [comment:13 simonpj]:
Are we good to go on this? I.e. commit?
From my side the patch is good to go as it is. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14644#comment:14 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#14644: Improve cmm/assembly for pattern matches with two constants.
-------------------------------------+-------------------------------------
Reporter: AndreasK | Owner: AndreasK
Type: task | Status: patch
Priority: normal | Milestone:
Component: Compiler | Version: 8.2.2
(CodeGen) | Keywords: Codegen, CMM,
Resolution: | Patterns, Pattern Matching
Operating System: Unknown/Multiple | Architecture:
| Unknown/Multiple
Type of failure: None/Unknown | Test Case:
Blocked By: | Blocking:
Related Tickets: | Differential Rev(s): Phab:D4294
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by Ben Gamari

#14644: Improve cmm/assembly for pattern matches with two constants. -------------------------------------+------------------------------------- Reporter: AndreasK | Owner: AndreasK Type: task | Status: closed Priority: normal | Milestone: 8.6.1 Component: Compiler | Version: 8.2.2 (CodeGen) | Keywords: Codegen, CMM, Resolution: fixed | Patterns, Pattern Matching Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D4294 Wiki Page: | -------------------------------------+------------------------------------- Changes (by bgamari): * status: patch => closed * resolution: => fixed * milestone: => 8.6.1 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14644#comment:16 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC