Switch/Case optimisations

14 Oct 2019

      I thought Andreas Klebinger in particular might find this interesting.

The talk is here: https://youtu.be/IAdLwUXRUvg?t=1867

(timestamp is 31:07)

basically, if you have an interpreter consisting of a switch statement
inside an infinite loop, there's a mechanical transformation using GOTOs
that causes a 15% speedup
in the switch statement blob, there's n + 1 basic blocks, each of which has
a jmp back to the first basic block with probability 1, and the first basic
block has a jmp that will go to one of the other n basic blocks with
probability 1 / n
whereas in the goto statement setup, there's n basic blocks, each of which
has a jmp that will go to one of the other n - 1 basic blocks with
probability 1 / (n - 1)
so if your sequence of instructions is produced by a markov decision
process, then the branch predictor will efficiently predict in the latter
situation but not necessarily the former
in the goto setup the callgraph looks like a complete graph on n vertices
whereas in the switch setup it looks like a bipartite graph on (1, n)
vertices (where the edges represent both "is called by" and "calls")
and because of all those extra edges (n^2 rather than 2n) there's more
spots for the branch predictor to build up data

At 39:13 he explains that you get per-address jmp predictions. If you only
share a single relative jmp, the cpu only knows how common every opcode is.
But, if you have a jmp from each opcode, to the next, the cpu can determine
how common an opcode A is followed an opcode B.

He references godbolt (https://godbolt.org/), which seems like a useful
tool. and mentions that ruby, jvm (on android only?), and some arm
compilers implement this.

There are some tradeoffs - code size does increase. According to a GCC
developer in the audience, GCC is not currently able to do this. I have no
idea about Clang.

Thanks

chessai .

tags

participants (1)