Hi,
I'm interested in execution performance.
Maybe modern hardware (which implement IA64, ARMv8) is able to predict a long chain of jumps [1].
But prediction accuracy for indirect jump is low,
especially dynamic addressed indirect jumps.
By the way, Ryan's example code will be fast by following optimization:
(If c3aX is most fast path, c3aX is reached without taken-branch.)
// Skip to the chase if it's already evaluated:
start:
// if (R2 & 7 != 0) goto fastpath; else goto slowpath;
if (R2 & 7 == 0) goto slowpath; // *** (1) remove branch for fastpath
fastpath: // Formerly c3aO // *** (1) move fastpath here
// if (R1 & 7 >= 2) goto c3aW; else goto c3aX;
if (R1 & 7 >= 2) goto c3aW; // *** (2) remove branch for prior path(c3aX)
c3aX: // *** (2) move else path to here(without branch)
R1 = PicBaseReg + lvl_r39S_closure;
call (I64[R1])(R1) args: 8, res: 0, upd: 8; // *** indirect jump, but fixed address (100% hit)
c3aW:
R1 = P64[R1 + 6] & (-8);
call (I64[R1])(R1) args: 8, res: 0, upd: 8; // *** indirect jump, dynamic address (hit or miss)
//c3aX:
// R1 = PicBaseReg + lvl_r39S_closure;
// call (I64[R1])(R1) args: 8, res: 0, upd: 8;
slowpath: // Formerly c3aY
if ((Sp + -8) < SpLim) goto c3aZ; else goto c3b0;
c3aZ:
// nop
R1 = PicBaseReg + foo_closure;
call (I64[BaseReg - 8])(R2, R1) args: 8, res: 0, upd: 8;
c3b0:
I64[Sp - 8] = PicBaseReg + block_c3aO_info;
R1 = R2;
Sp = Sp - 8;
call (I64[R1])(R1) returns to fastpath, args: 8, res: 8, upd: 8;
// Sp bump moved to here so it's separate from "fastpath"
Sp = Sp + 8;
goto fastpath; // ***
//fastpath: // Formerly c3aO
// if (R1 & 7 >= 2) goto c3aW; else goto c3aX;
//c3aW:
// R1 = P64[R1 + 6] & (-8);
// call (I64[R1])(R1) args: 8, res: 0, upd: 8;
//c3aX:
// R1 = PicBaseReg + lvl_r39S_closure;
// call (I64[R1])(R1) args: 8, res: 0, upd: 8;
[1]: Intel64 and IA-32 Architectures Optimization Reference Manual
3.4 OPTIMIZING THE FRONT END
2.3.2.3 Branch Prediction
I'm just studying and drawing about lazy evaluation.
This thread is helpful to me :)
Regards,
Takenobu