Branchless implementation for literal case – is it worth it?

Dear devs, in https://ghc.haskell.org/trac/ghc/ticket/10124 bgamari suggested that code like f :: Int -> Bool f a = case a of 1 -> True 5 -> True 8 -> True 9 -> True 11 -> True 19 -> True _ -> False {-# NOINLINE f #-} should not compile to a series of conditional jumps, but rather a branchless code akin to let !p = a ==# 1 `orI#` a ==# 5 `orI#` a ==# 8 `orI#` a ==# 9 `orI#` a ==# 11 `orI#` a ==# 19 case p of 1 -> do_something 0 -> do_something_else Subsequently, I refactored the way we produce Cmm code from STG, opening the possibility to implement this optimization at that stage¹. But when I then added this implementation, I could not measure any runtime improvement, see https://ghc.haskell.org/trac/ghc/ticket/10124#comment:15 So my question to the general audience: Is such branchless code really better than the current, branching code? Can someone provide us with an example that shows that it is better? Do I need to produce different branchless assembly? If it is not better, I can again refactor the switch generation and simplify it a bit again. Hmm, only now I see that rwbarton has replied there. Not sure why I missed that. Anyways, more voices are welcome :-) Greetings, Joachim ¹ This should not preclude an implementation on the Core level, which SPJ preferred. But I improved other aspects of the code generation as well, so this is worthwhile on its own. -- Joachim “nomeata” Breitner mail@joachim-breitner.de • http://www.joachim-breitner.de/ Jabber: nomeata@joachim-breitner.de • GPG-Key: 0xF0FBF51F Debian Developer: nomeata@debian.org

On Apr 19, 2015 09:44, "Joachim Breitner"
Dear devs,
in https://ghc.haskell.org/trac/ghc/ticket/10124 bgamari suggested that code like
f :: Int -> Bool f a = case a of 1 -> True 5 -> True 8 -> True 9 -> True 11 -> True 19 -> True _ -> False {-# NOINLINE f #-}
should not compile to a series of conditional jumps, but rather a branchless code akin to
let !p = a ==# 1 `orI#` a ==# 5 `orI#` a ==# 8 `orI#` a ==# 9 `orI#` a ==# 11 `orI#` a ==# 19 case p of 1 -> do_something 0 -> do_something_else
Subsequently, I refactored the way we produce Cmm code from STG, opening the possibility to implement this optimization at that stage¹.
But when I then added this implementation, I could not measure any runtime improvement, see https://ghc.haskell.org/trac/ghc/ticket/10124#comment:15
So my question to the general audience: Is such branchless code really better than the current, branching code? Can someone provide us with an example that shows that it is better? Do I need to produce different branchless assembly?
If it is not better, I can again refactor the switch generation and simplify it a bit again.
Hmm, only now I see that rwbarton has replied there. Not sure why I missed that. Anyways, more voices are welcome :-)
Greetings, Joachim
¹ This should not preclude an implementation on the Core level, which SPJ preferred. But I improved other aspects of the code generation as well, so this is worthwhile on its own.
-- Joachim “nomeata” Breitner mail@joachim-breitner.de • http://www.joachim-breitner.de/ Jabber: nomeata@joachim-breitner.de • GPG-Key: 0xF0FBF51F Debian Developer: nomeata@debian.org
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Em domingo, 19 de abril de 2015, Joachim Breitner
Dear devs,
in https://ghc.haskell.org/trac/ghc/ticket/10124 bgamari suggested that code like
f :: Int -> Bool f a = case a of 1 -> True 5 -> True 8 -> True 9 -> True 11 -> True 19 -> True _ -> False {-# NOINLINE f #-}
should not compile to a series of conditional jumps, but rather a branchless code akin to
let !p = a ==# 1 `orI#` a ==# 5 `orI#` a ==# 8 `orI#` a ==# 9 `orI#` a ==# 11 `orI#` a ==# 19 case p of 1 -> do_something 0 -> do_something_else
I'd suggest to compile this particular case as a bittest in a 32-bit word. Gabor
Subsequently, I refactored the way we produce Cmm code from STG, opening the possibility to implement this optimization at that stage¹.
But when I then added this implementation, I could not measure any runtime improvement, see https://ghc.haskell.org/trac/ghc/ticket/10124#comment:15
So my question to the general audience: Is such branchless code really better than the current, branching code? Can someone provide us with an example that shows that it is better? Do I need to produce different branchless assembly?
If it is not better, I can again refactor the switch generation and simplify it a bit again.
Hmm, only now I see that rwbarton has replied there. Not sure why I missed that. Anyways, more voices are welcome :-)
Greetings, Joachim
¹ This should not preclude an implementation on the Core level, which SPJ preferred. But I improved other aspects of the code generation as well, so this is worthwhile on its own.
-- Joachim “nomeata” Breitner mail@joachim-breitner.de javascript:; • http://www.joachim-breitner.de/ Jabber: nomeata@joachim-breitner.de javascript:; • GPG-Key: 0xF0FBF51F Debian Developer: nomeata@debian.org javascript:;

Joachim Breitner
Dear devs,
in https://ghc.haskell.org/trac/ghc/ticket/10124 bgamari suggested that code like
f :: Int -> Bool f a = case a of 1 -> True 5 -> True 8 -> True 9 -> True 11 -> True 19 -> True _ -> False {-# NOINLINE f #-}
should not compile to a series of conditional jumps, but rather a branchless code akin to
let !p = a ==# 1 `orI#` a ==# 5 `orI#` a ==# 8 `orI#` a ==# 9 `orI#` a ==# 11 `orI#` a ==# 19 case p of 1 -> do_something 0 -> do_something_else
Hi Joachim, Thanks for trying this and I'm sorry to hear that the optimization hasn't yielded the improvement I would have expected. I've been a bit silent on the issue as I've been working on finishing up my PhD. Hopefully in a couple of months after I'm settled in Germany I'll have time to delve into this more deeply.
Subsequently, I refactored the way we produce Cmm code from STG, opening the possibility to implement this optimization at that stage¹.
But when I then added this implementation, I could not measure any runtime improvement, see https://ghc.haskell.org/trac/ghc/ticket/10124#comment:15
So my question to the general audience: Is such branchless code really better than the current, branching code? Can someone provide us with an example that shows that it is better? Do I need to produce different branchless assembly?
Have you had a look at the Intel Optimization manual? I'll admit that I haven't looked myself but I'd imagine they'd probably have a few things to say about this. Cheers, - Ben

the optimization manual ben mentions is
http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-...
it does a very good job laying out the scope of both *generality* and
*impact* for a huge range of systemsy tricks.
there are certain conditions when that transformation can be a win, but
they generally correspond with *inner loop* scenarios where bad branch
prediction heavily impacts performance. One class of algorithms that might
provide some grist for the optimization mill would be those in the hacker's
delight book http://www.hackersdelight.org/
I have written some code that has that bit fiddling vs bad branch
prediction encoding tradeoff before, i'll see if i can dig any of it up if
you'd like!
On Sun, Apr 19, 2015 at 10:41 AM, Ben Gamari
Joachim Breitner
writes: Dear devs,
in https://ghc.haskell.org/trac/ghc/ticket/10124 bgamari suggested that code like
f :: Int -> Bool f a = case a of 1 -> True 5 -> True 8 -> True 9 -> True 11 -> True 19 -> True _ -> False {-# NOINLINE f #-}
should not compile to a series of conditional jumps, but rather a branchless code akin to
let !p = a ==# 1 `orI#` a ==# 5 `orI#` a ==# 8 `orI#` a ==# 9 `orI#` a ==# 11 `orI#` a ==# 19 case p of 1 -> do_something 0 -> do_something_else
Hi Joachim,
Thanks for trying this and I'm sorry to hear that the optimization hasn't yielded the improvement I would have expected. I've been a bit silent on the issue as I've been working on finishing up my PhD. Hopefully in a couple of months after I'm settled in Germany I'll have time to delve into this more deeply.
Subsequently, I refactored the way we produce Cmm code from STG, opening the possibility to implement this optimization at that stage¹.
But when I then added this implementation, I could not measure any runtime improvement, see https://ghc.haskell.org/trac/ghc/ticket/10124#comment:15
So my question to the general audience: Is such branchless code really better than the current, branching code? Can someone provide us with an example that shows that it is better? Do I need to produce different branchless assembly?
Have you had a look at the Intel Optimization manual? I'll admit that I haven't looked myself but I'd imagine they'd probably have a few things to say about this.
Cheers,
- Ben
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

2015-04-19 9:44 GMT+02:00 Joachim Breitner
[...] So my question to the general audience: Is such branchless code really better than the current, branching code? Can someone provide us with an example that shows that it is better? Do I need to produce different branchless assembly? [...]
Just a few war stories regarding this from the trenches (= Chrome's JavaScript JIT): * Branchless code in itself is a non-goal. What you care about is performance and/or code size, but both don't have a direct relationship to using branches or not. * "Hacker's Delight" is a cool book, but don't use the bit fiddling tricks in there blindly. We actually reverted to straightforward code with branches from some seemingly "better" branchless code, because even with branches the performance was better. * Even within a processor family like x64, performance characteristics vary vastly. What can be a performance improvement for the modern beefy Xeon machine you're benchmarking on, can make things worse for a low-end Atom. The same holds in the other direction. * The same holds for different architectures, i.e. an "optimization" which makes things fast on most Intel cores could make things worse on e.g. ARM cores (and vice versa). * On more powerful cores with heavy out-of-order execution, it's hard to beat a well-predicted branch. * On Linux, the perf tool is your best friend. Without it you don't have a clue what's making your code slower than expected, it could be bad branch prediction, stalled units with the CPU, bad luck with caches, etc. * Micro-benchmarks can be highly misleading, e.g. due to totally different branching patterns, cache usage, etc. In a nutshell: If you don't know the details of the architecture you're compiling for, you simply don't know if the "optimization" you have in mind actually makes things better or worse. Therefore these kind of decision have to be pushed very far towards the end of the compiler pipeline. Having some kind of feedback about previous runs of the same code is very helpful, too, but this is a bit complicated in a batch compiler (but doable).

Hi, Am Montag, den 20.04.2015, 16:29 +0200 schrieb Sven Panne:
2015-04-19 9:44 GMT+02:00 Joachim Breitner
: [...] So my question to the general audience: Is such branchless code really better than the current, branching code? Can someone provide us with an example that shows that it is better? Do I need to produce different branchless assembly? [...]
Just a few war stories regarding this from the trenches (= Chrome's JavaScript JIT):
thanks a lot. The conclusion I draw from your mail, at last for our case, is: Don’t bother (and keep the compiler code simple). Is that a correct reading? Greetings, Joachim -- Joachim “nomeata” Breitner mail@joachim-breitner.de • http://www.joachim-breitner.de/ Jabber: nomeata@joachim-breitner.de • GPG-Key: 0xF0FBF51F Debian Developer: nomeata@debian.org

2015-04-20 16:41 GMT+02:00 Joachim Breitner
The conclusion I draw from your mail, at last for our case, is: Don’t bother (and keep the compiler code simple). Is that a correct reading?
Yes, I think that's the right approach. Do simple things like e.g. a distinction between sparse cases (=> "if" cascade), lots of sparse cases (=> some kind of decision tree) and a non-trivial number of dense cases (jump table). For the jump table, consider computed jumps (e.g. to a table of relative jumps, which is a common implementation for PIC) and indirect jumps (the usual address table) in the backend. The former could be faster than the latter (for some ARM cores IIRC), which is not exactly intuitive. Your mileage may vary. :-) As I said: This is our experience in a JIT, so I'd be interested in other people's experience, too, especially for real-world programs, not micro-benchmarks.

On Mon, Apr 20, 2015 at 10:29 AM, Sven Panne
* On more powerful cores with heavy out-of-order execution, it's hard to beat a well-predicted branch.
With regard to this, I was wondering if we have a way of arranging for good branch prediction in GHC. For instance (and I think I've discussed this with Carter before), we've had issues filed against vector (I think) saying that we do branches for certain things that could be branchless. Bounds checks are the first thing that comes to mind, for instance (I.E. we don't use orI# for the lower and upper bound checks). However, so long as branch prediction decides that the bounds checks should succeed, I'd expect them to not be much of a cost anyway---and ideally it will always decide that, because we're inclined not to care about the performance when the checks fail and the program is bombing. However, from what I've read, there's basically no easy hinting that can ensure this. The heuristics require moving generated code around if you want to try and influence the branch predictor one way. And we also have no way of indicating to GHC that branches of a case should be preferred anyway. Unless writing things in a certain order accomplishes this? -- Dan

I've gotten substantial speedups (>30%) in the past (e.g. in the hashtables
package) by replacing search loops with branchless code, but the technique
works best in situations where you have a long run of straight-line integer
code for the CPU to chew on; for branchless to be a win you need to feed
enough instructions into the pipeline such that the savings of getting 3
instructions per cycle and avoiding paying for branch mispredictions
outweighs the savings you would get from not doing the work. I don't think
the use case you posted falls into that category.
On Sun, Apr 19, 2015 at 12:44 AM, Joachim Breitner wrote: Dear devs, in https://ghc.haskell.org/trac/ghc/ticket/10124 bgamari suggested that
code like f :: Int -> Bool
f a = case a of
1 -> True
5 -> True
8 -> True
9 -> True
11 -> True
19 -> True
_ -> False
{-# NOINLINE f #-} should not compile to a series of conditional jumps, but rather a
branchless code akin to let !p = a ==# 1 `orI#` a ==# 5 `orI#` a ==# 8 `orI#` a ==# 9
`orI#` a ==# 11 `orI#` a ==# 19
case p of
1 -> do_something
0 -> do_something_else Subsequently, I refactored the way we produce Cmm code from STG, opening
the possibility to implement this optimization at that stage¹. But when I then added this implementation, I could not measure any
runtime improvement, see
https://ghc.haskell.org/trac/ghc/ticket/10124#comment:15 So my question to the general audience: Is such branchless code really
better than the current, branching code? Can someone provide us with an
example that shows that it is better? Do I need to produce different
branchless assembly? If it is not better, I can again refactor the switch generation and
simplify it a bit again. Hmm, only now I see that rwbarton has replied there. Not sure why I
missed that. Anyways, more voices are welcome :-) Greetings,
Joachim ¹ This should not preclude an implementation on the Core level, which
SPJ preferred. But I improved other aspects of the code generation as
well, so this is worthwhile on its own. --
Joachim “nomeata” Breitner
mail@joachim-breitner.de • http://www.joachim-breitner.de/
Jabber: nomeata@joachim-breitner.de • GPG-Key: 0xF0FBF51F
Debian Developer: nomeata@debian.org _______________________________________________
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs --
Gregory Collins
participants (8)
-
Ben Gamari
-
Carter Schonwald
-
Dan Doel
-
Gabor Greif
-
Gregory Collins
-
Joachim Breitner
-
Johan Tibell
-
Sven Panne