(timestamp is 31:07)
basically, if you have an interpreter consisting of a switch statement inside an infinite loop, there's a mechanical transformation using GOTOs that causes a 15% speedup
in the switch statement blob, there's n + 1 basic blocks, each of which has a jmp back to the first basic block with probability 1, and the first basic block has a jmp that will go to one of the other n basic blocks with probability 1 / n
whereas in the goto statement setup, there's n basic blocks, each of which has a jmp that will go to one of the other n - 1 basic blocks with probability 1 / (n - 1)
so if your sequence of instructions is produced by a markov decision process, then the branch predictor will efficiently predict in the latter situation but not necessarily the former
in the goto setup the callgraph looks like a complete graph on n vertices
whereas in the switch setup it looks like a bipartite graph on (1, n) vertices (where the edges represent both "is called by" and "calls")
and because of all those extra edges (n^2 rather than 2n) there's more spots for the branch predictor to build up data
At 39:13 he explains that you get per-address jmp predictions. If you only share a single relative jmp, the cpu only knows how common every opcode is. But, if you have a jmp from each opcode, to the next, the cpu can determine how common an opcode A is followed an opcode B.
He references godbolt (
https://godbolt.org/), which seems like a useful tool. and mentions that ruby, jvm (on android only?), and some arm compilers implement this.
There are some tradeoffs - code size does increase. According to a GCC developer in the audience, GCC is not currently able to do this. I have no idea about Clang.
Thanks