Switch/Case optimisations

I thought Andreas Klebinger in particular might find this interesting. The talk is here: https://youtu.be/IAdLwUXRUvg?t=1867 (timestamp is 31:07) basically, if you have an interpreter consisting of a switch statement inside an infinite loop, there's a mechanical transformation using GOTOs that causes a 15% speedup in the switch statement blob, there's n + 1 basic blocks, each of which has a jmp back to the first basic block with probability 1, and the first basic block has a jmp that will go to one of the other n basic blocks with probability 1 / n whereas in the goto statement setup, there's n basic blocks, each of which has a jmp that will go to one of the other n - 1 basic blocks with probability 1 / (n - 1) so if your sequence of instructions is produced by a markov decision process, then the branch predictor will efficiently predict in the latter situation but not necessarily the former in the goto setup the callgraph looks like a complete graph on n vertices whereas in the switch setup it looks like a bipartite graph on (1, n) vertices (where the edges represent both "is called by" and "calls") and because of all those extra edges (n^2 rather than 2n) there's more spots for the branch predictor to build up data At 39:13 he explains that you get per-address jmp predictions. If you only share a single relative jmp, the cpu only knows how common every opcode is. But, if you have a jmp from each opcode, to the next, the cpu can determine how common an opcode A is followed an opcode B. He references godbolt (https://godbolt.org/), which seems like a useful tool. and mentions that ruby, jvm (on android only?), and some arm compilers implement this. There are some tradeoffs - code size does increase. According to a GCC developer in the audience, GCC is not currently able to do this. I have no idea about Clang. Thanks
participants (1)
-
chessai .