1. Small tweaks: The CMM code above seems to be betting than the thunk is unevaluated, because it does the stack check and stack write before the predicate test that checks if the thunk is evaluated (if (R1 & 7 != 0) goto c3aO; else goto c3aP;).  With a bang-pattern function, couldn't it make the opposite bet?  That is, branch on whether the thunk is evaluated first, and then the wasted computation is only a single correctly predicted branch (and a read of a tag that we need to read anyway). 
Oh, a small further addition would be needed for this tweak.  In the generated code above "Sp = Sp + 8;" happens late, but I think it could happen right after the call to the thunk.  In general, does it seem feasible to separate the slowpath from fastpath as in the following tweak of the example CMM?


  // Skip to the chase if it's already evaluated:
  start:
      if (R2 & 7 != 0) goto fastpath; else goto slowpath;

  slowpath:   // Formerly c3aY
      if ((Sp + -8) < SpLim) goto c3aZ; else goto c3b0;
  c3aZ:
      // nop
      R1 = PicBaseReg + foo_closure;
      call (I64[BaseReg - 8])(R2, R1) args: 8, res: 0, upd: 8;
  c3b0:
      I64[Sp - 8] = PicBaseReg + block_c3aO_info;
      R1 = R2;
      Sp = Sp - 8;

      call (I64[R1])(R1) returns to fastpath, args: 8, res: 8, upd: 8;
      // Sp bump moved to here so it's separate from "fastpath"
      Sp = Sp + 8;

  fastpath: // Formerly c3aO
      if (R1 & 7 >= 2) goto c3aW; else goto c3aX;
  c3aW:
      R1 = P64[R1 + 6] & (-8);
      call (I64[R1])(R1) args: 8, res: 0, upd: 8;
  c3aX:
      R1 = PicBaseReg + lvl_r39S_closure;
      call (I64[R1])(R1) args: 8, res: 0, upd: 8;