
2015-04-19 9:44 GMT+02:00 Joachim Breitner
[...] So my question to the general audience: Is such branchless code really better than the current, branching code? Can someone provide us with an example that shows that it is better? Do I need to produce different branchless assembly? [...]
Just a few war stories regarding this from the trenches (= Chrome's JavaScript JIT): * Branchless code in itself is a non-goal. What you care about is performance and/or code size, but both don't have a direct relationship to using branches or not. * "Hacker's Delight" is a cool book, but don't use the bit fiddling tricks in there blindly. We actually reverted to straightforward code with branches from some seemingly "better" branchless code, because even with branches the performance was better. * Even within a processor family like x64, performance characteristics vary vastly. What can be a performance improvement for the modern beefy Xeon machine you're benchmarking on, can make things worse for a low-end Atom. The same holds in the other direction. * The same holds for different architectures, i.e. an "optimization" which makes things fast on most Intel cores could make things worse on e.g. ARM cores (and vice versa). * On more powerful cores with heavy out-of-order execution, it's hard to beat a well-predicted branch. * On Linux, the perf tool is your best friend. Without it you don't have a clue what's making your code slower than expected, it could be bad branch prediction, stalled units with the CPU, bad luck with caches, etc. * Micro-benchmarks can be highly misleading, e.g. due to totally different branching patterns, cache usage, etc. In a nutshell: If you don't know the details of the architecture you're compiling for, you simply don't know if the "optimization" you have in mind actually makes things better or worse. Therefore these kind of decision have to be pushed very far towards the end of the compiler pipeline. Having some kind of feedback about previous runs of the same code is very helpful, too, but this is a bit complicated in a batch compiler (but doable).