Better calling conventions for strict functions (bang patterns)?

Hi all, With module-level Strict and StrictData pragmas coming soon, one obvious question is what kind of the code quality GHC can achieve for strict programs. When it came up in discussion in our research group we realized we didn't actually know whether the bang patterns, `f !x`, on function arguments were enforced by caller or callee. Here's a Gist that shows the compilation of a trivial function: foo :: Maybe Int -> Intfoo !x = case x of Just y -> y https://gist.github.com/rrnewton/1ac722189c65f26fe9ac If that function is compiled to *assume* its input is in WHNF, it should be just as efficient as the isomorphic MLton/OCaml code, right? It only needs to branch on the tag, do a field dereference, and return. But as you can see from the STG and CMM generated, foo *does indeed* enter the thunk, adding an extra indirect jump. Here's the body: c3aY: if ((Sp + -8) < SpLim) goto c3aZ; else goto c3b0; c3aZ: // nop R1 = PicBaseReg + foo_closure; call (I64[BaseReg - 8])(R2, R1) args: 8, res: 0, upd: 8; c3b0: I64[Sp - 8] = PicBaseReg + block_c3aO_info; R1 = R2; Sp = Sp - 8; if (R1 & 7 != 0) goto c3aO; else goto c3aP; c3aP: call (I64[R1])(R1) returns to c3aO, args: 8, res: 8, upd: 8; c3aO: if (R1 & 7 >= 2) goto c3aW; else goto c3aX; c3aW: R1 = P64[R1 + 6] & (-8); Sp = Sp + 8; call (I64[R1])(R1) args: 8, res: 0, upd: 8; c3aX: R1 = PicBaseReg + lvl_r39S_closure; Sp = Sp + 8; call (I64[R1])(R1) args: 8, res: 0, upd: 8; The call inside c3aP is entering "x" as a thunk, which also incurs all of the stack limit check code. I believe that IF the input could be assumed to be in WHNF, everything above the label "c3aO" could be omitted. So... if GHC is going to be a fabulous pure *and* imperative language, and a fabulous lazy *and* strict compiler/runtime.. is there some work we can do here to improve this situation? Would the following make sense: - Put together a benchmark suite of all-strict programs with Strict/StrictData (compare a few benchmark's generated code to MLton, if time allows) - Modify GHC to change calling conventions for bang patterns -- caller enforces WHNF rather than callee. Existing strictness/demand/cardinality analysis would stay the same. Unless there's something I'm really missing here, the result should be that you can have a whole chain of strict function calls, each of which knows its arguments and the arguments it passes to its callees are all in WHNF, without ever generating thunk-entry sequences. Thanks for your time, -Ryan

It’s absolutely the case that bang patterns etc tell the caller what to do, but the function CANNOT ASSUME that its argument is evaluated. Reason: higher order functions. I think that the way to allow functions that can assume their arg is evaluated is through types: see Type are calling conventionshttp://research.microsoft.com/~simonpj/papers/strict-core/tacc-hs09.pdf. But it’d be a fairly big deal to implement. Simon From: ghc-devs [mailto:ghc-devs-bounces@haskell.org] On Behalf Of Ryan Newton Sent: 23 October 2015 14:54 To: ghc-devs@haskell.org; Ömer Sinan Ağacan; Ryan Scott; Chao-Hong Chen; Johan Tibell Subject: Better calling conventions for strict functions (bang patterns)? Hi all, With module-level Strict and StrictData pragmas coming soon, one obvious question is what kind of the code quality GHC can achieve for strict programs. When it came up in discussion in our research group we realized we didn't actually know whether the bang patterns, `f !x`, on function arguments were enforced by caller or callee. Here's a Gist that shows the compilation of a trivial function: foo :: Maybe Int -> Int foo !x = case x of Just y -> y https://gist.github.com/rrnewton/1ac722189c65f26fe9achttps://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fgist.github.com%2frrnewton%2f1ac722189c65f26fe9ac&data=01%7c01%7csimonpj%40064d.mgd.microsoft.com%7cb006dcdbfe834ebb6c1e08d2dbb16c03%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=qxrT8r1VSP97xQUF2qqkLlxEtSGi9VOzfmORl25W%2fWY%3d If that function is compiled to *assume* its input is in WHNF, it should be just as efficient as the isomorphic MLton/OCaml code, right? It only needs to branch on the tag, do a field dereference, and return. But as you can see from the STG and CMM generated, foo does indeed enter the thunk, adding an extra indirect jump. Here's the body: c3aY: if ((Sp + -8) < SpLim) goto c3aZ; else goto c3b0; c3aZ: // nop R1 = PicBaseReg + foo_closure; call (I64[BaseReg - 8])(R2, R1) args: 8, res: 0, upd: 8; c3b0: I64[Sp - 8] = PicBaseReg + block_c3aO_info; R1 = R2; Sp = Sp - 8; if (R1 & 7 != 0) goto c3aO; else goto c3aP; c3aP: call (I64[R1])(R1) returns to c3aO, args: 8, res: 8, upd: 8; c3aO: if (R1 & 7 >= 2) goto c3aW; else goto c3aX; c3aW: R1 = P64[R1 + 6] & (-8); Sp = Sp + 8; call (I64[R1])(R1) args: 8, res: 0, upd: 8; c3aX: R1 = PicBaseReg + lvl_r39S_closure; Sp = Sp + 8; call (I64[R1])(R1) args: 8, res: 0, upd: 8; The call inside c3aP is entering "x" as a thunk, which also incurs all of the stack limit check code. I believe that IF the input could be assumed to be in WHNF, everything above the label "c3aO" could be omitted. So... if GHC is going to be a fabulous pure and imperative language, and a fabulous lazy and strict compiler/runtime.. is there some work we can do here to improve this situation? Would the following make sense: * Put together a benchmark suite of all-strict programs with Strict/StrictData (compare a few benchmark's generated code to MLton, if time allows) * Modify GHC to change calling conventions for bang patterns -- caller enforces WHNF rather than callee. Existing strictness/demand/cardinality analysis would stay the same. Unless there's something I'm really missing here, the result should be that you can have a whole chain of strict function calls, each of which knows its arguments and the arguments it passes to its callees are all in WHNF, without ever generating thunk-entry sequences. Thanks for your time, -Ryan

Ah, yes, so just to give a concrete example in this thread, if we take the
`foo` function above and say `map foo ls`, we may well get unevaluated
arguments to foo. (And this is almost precisely the same as the first
example that Strict-Core paper!)
Thanks for the paper reference. I read it and it's great -- just what I
was looking for. An approach that eliminates any jealousy of ML/Scheme
compiler techniques vis a vis calling conventions ;-). I'm also wondering
if there are some incremental steps that can be taken, short of what is
proposed in the paper.
1. Small tweaks: The CMM code above seems to be *betting* than the thunk
is unevaluated, because it does the stack check and stack write *before* the
predicate test that checks if the thunk is evaluated (if (R1 & 7 != 0)
goto c3aO; else goto c3aP;). With a bang-pattern function, couldn't it
make the opposite bet? That is, branch on whether the thunk is evaluated
first, and then the wasted computation is only a single correctly predicted
branch (and a read of a tag that we need to read anyway).
2. The option of multiple entrypoints which is considered and discarded
as fragile in the beginning of the paper (for direct call vs indirect / 1st
order vs higher order). That fragile option is along the lines of what I
wanted to discuss on this thread. It does seem like a tricky phase
ordering concern, but how bad is it exactly? The conflict with the a
case-expr rewrite is illustrated clearly in the paper, but that just means
that such optimizations must happen *before *the final choice of which
function entrypoint to call, doesn't it? I'm not 100% sure where it could
go in the current compiler pipeline, but couldn't the adjustment of call
target from "foo" to "foo_with_known_whnf_args" happen quite late?
Cheers,
-Ryan
P.S. One of the students CC'd, Ryan Scott, is currently on internship at
Intel labs and is working to (hopefully) liberate the Intell Haskell
Research Compiler as open source. Like the 2009 paper, it also uses a
strict IR, and I think it will be interesting to see exactly how it handles
the conversion from Core to its IR. (Probably the same as Fig 10 in the
paper.)
On Fri, Oct 23, 2015 at 10:11 AM, Simon Peyton Jones
It’s absolutely the case that bang patterns etc tell the caller what to do, but the function CANNOT ASSUME that its argument is evaluated. Reason: higher order functions.
I think that the way to allow functions that can assume their arg is evaluated is through types: see Type are calling conventions http://research.microsoft.com/~simonpj/papers/strict-core/tacc-hs09.pdf. But it’d be a fairly big deal to implement.
Simon
*From:* ghc-devs [mailto:ghc-devs-bounces@haskell.org] *On Behalf Of *Ryan Newton *Sent:* 23 October 2015 14:54 *To:* ghc-devs@haskell.org; Ömer Sinan Ağacan; Ryan Scott; Chao-Hong Chen; Johan Tibell *Subject:* Better calling conventions for strict functions (bang patterns)?
Hi all,
With module-level Strict and StrictData pragmas coming soon, one obvious question is what kind of the code quality GHC can achieve for strict programs.
When it came up in discussion in our research group we realized we didn't actually know whether the bang patterns, `f !x`, on function arguments were enforced by caller or callee.
Here's a Gist that shows the compilation of a trivial function:
foo :: Maybe Int -> Int
foo !x =
case x of
Just y -> y
https://gist.github.com/rrnewton/1ac722189c65f26fe9ac https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fgist.github.com%2frrnewton%2f1ac722189c65f26fe9ac&data=01%7c01%7csimonpj%40064d.mgd.microsoft.com%7cb006dcdbfe834ebb6c1e08d2dbb16c03%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=qxrT8r1VSP97xQUF2qqkLlxEtSGi9VOzfmORl25W%2fWY%3d
If that function is compiled to *assume* its input is in WHNF, it should be just as efficient as the isomorphic MLton/OCaml code, right? It only needs to branch on the tag, do a field dereference, and return.
But as you can see from the STG and CMM generated, foo *does indeed* enter the thunk, adding an extra indirect jump. Here's the body:
c3aY:
if ((Sp + -8) < SpLim) goto c3aZ; else goto c3b0;
c3aZ:
// nop
R1 = PicBaseReg + foo_closure;
call (I64[BaseReg - 8])(R2, R1) args: 8, res: 0, upd: 8;
c3b0:
I64[Sp - 8] = PicBaseReg + block_c3aO_info;
R1 = R2;
Sp = Sp - 8;
if (R1 & 7 != 0) goto c3aO; else goto c3aP;
c3aP:
call (I64[R1])(R1) returns to c3aO, args: 8, res: 8, upd: 8;
c3aO:
if (R1 & 7 >= 2) goto c3aW; else goto c3aX;
c3aW:
R1 = P64[R1 + 6] & (-8);
Sp = Sp + 8;
call (I64[R1])(R1) args: 8, res: 0, upd: 8;
c3aX:
R1 = PicBaseReg + lvl_r39S_closure;
Sp = Sp + 8;
call (I64[R1])(R1) args: 8, res: 0, upd: 8;
The call inside c3aP is entering "x" as a thunk, which also incurs all of the stack limit check code. I believe that IF the input could be assumed to be in WHNF, everything above the label "c3aO" could be omitted.
So... if GHC is going to be a fabulous pure *and* imperative language, and a fabulous lazy *and* strict compiler/runtime.. is there some work we can do here to improve this situation? Would the following make sense:
- Put together a benchmark suite of all-strict programs with Strict/StrictData (compare a few benchmark's generated code to MLton, if time allows) - Modify GHC to change calling conventions for bang patterns -- caller enforces WHNF rather than callee. Existing strictness/demand/cardinality analysis would stay the same.
Unless there's something I'm really missing here, the result should be that you can have a whole chain of strict function calls, each of which knows its arguments and the arguments it passes to its callees are all in WHNF, without ever generating thunk-entry sequences.
Thanks for your time,
-Ryan

1. Small tweaks: The CMM code above seems to be *betting* than the thunk is unevaluated, because it does the stack check and stack write *before* the predicate test that checks if the thunk is evaluated (if (R1 & 7 != 0) goto c3aO; else goto c3aP;). With a bang-pattern function, couldn't it make the opposite bet? That is, branch on whether the thunk is evaluated first, and then the wasted computation is only a single correctly predicted branch (and a read of a tag that we need to read anyway).
Oh, a small further addition would be needed for this tweak. In the
generated code above "Sp = Sp + 8;" happens *late*, but I think it could happen right after the call to the thunk. In general, does it seem feasible to separate the slowpath from fastpath as in the following tweak of the example CMM? * // Skip to the chase if it's already evaluated:* * start:* * if (R2 & 7 != 0) goto fastpath; else goto slowpath;* * slowpath: // Formerly c3aY* * if ((Sp + -8) < SpLim) goto c3aZ; else goto c3b0;* * c3aZ:* * // nop* * R1 = PicBaseReg + foo_closure;* * call (I64[BaseReg - 8])(R2, R1) args: 8, res: 0, upd: 8;* * c3b0:* * I64[Sp - 8] = PicBaseReg + block_c3aO_info;* * R1 = R2;* * Sp = Sp - 8;* * call (I64[R1])(R1) returns to fastpath, args: 8, res: 8, upd: 8;* * // Sp bump moved to here so it's separate from "fastpath"* * Sp = Sp + 8;* * fastpath: // Formerly c3aO* * if (R1 & 7 >= 2) goto c3aW; else goto c3aX;* * c3aW:* * R1 = P64[R1 + 6] & (-8);* * call (I64[R1])(R1) args: 8, res: 0, upd: 8;* * c3aX:* * R1 = PicBaseReg + lvl_r39S_closure;* * call (I64[R1])(R1) args: 8, res: 0, upd: 8;*

Doesn't modern hardware have pretty good branch prediction? In which case
the order of the branches may not matter unless it's a long chain of calls?
Vs say an inner loop that hasn't been inlined?
Either way, I'd love be stay in the loop on this topic, for work I'm
building a strongly normalizing language that supports both strict and call
by need evaluation strategies.
On Friday, October 23, 2015, Ryan Newton
1. Small tweaks: The CMM code above seems to be *betting* than the thunk is unevaluated, because it does the stack check and stack write *before* the predicate test that checks if the thunk is evaluated (if (R1 & 7 != 0) goto c3aO; else goto c3aP;). With a bang-pattern function, couldn't it make the opposite bet? That is, branch on whether the thunk is evaluated first, and then the wasted computation is only a single correctly predicted branch (and a read of a tag that we need to read anyway).
Oh, a small further addition would be needed for this tweak. In the generated code above "Sp = Sp + 8;" happens *late*, but I think it could happen right after the call to the thunk. In general, does it seem feasible to separate the slowpath from fastpath as in the following tweak of the example CMM?
* // Skip to the chase if it's already evaluated:* * start:* * if (R2 & 7 != 0) goto fastpath; else goto slowpath;*
* slowpath: // Formerly c3aY* * if ((Sp + -8) < SpLim) goto c3aZ; else goto c3b0;* * c3aZ:* * // nop* * R1 = PicBaseReg + foo_closure;* * call (I64[BaseReg - 8])(R2, R1) args: 8, res: 0, upd: 8;* * c3b0:* * I64[Sp - 8] = PicBaseReg + block_c3aO_info;* * R1 = R2;* * Sp = Sp - 8;*
* call (I64[R1])(R1) returns to fastpath, args: 8, res: 8, upd: 8;* * // Sp bump moved to here so it's separate from "fastpath"* * Sp = Sp + 8;*
* fastpath: // Formerly c3aO* * if (R1 & 7 >= 2) goto c3aW; else goto c3aX;* * c3aW:* * R1 = P64[R1 + 6] & (-8);* * call (I64[R1])(R1) args: 8, res: 0, upd: 8;* * c3aX:* * R1 = PicBaseReg + lvl_r39S_closure;* * call (I64[R1])(R1) args: 8, res: 0, upd: 8;*

Hi,
I'm interested in execution performance.
Maybe modern hardware (which implement IA64, ARMv8) is able to predict a
long chain of jumps [1].
But prediction accuracy for indirect jump is low,
especially dynamic addressed indirect jumps.
By the way, Ryan's example code will be fast by following optimization:
(If c3aX is most fast path, c3aX is reached without taken-branch.)
// Skip to the chase if it's already evaluated:
start:
// if (R2 & 7 != 0) goto fastpath; else goto slowpath;
if (R2 & 7 == 0) goto slowpath; // *** (1) remove branch for
fastpath
fastpath: // Formerly c3aO // *** (1) move fastpath here
// if (R1 & 7 >= 2) goto c3aW; else goto c3aX;
if (R1 & 7 >= 2) goto c3aW; // *** (2) remove branch for
prior path(c3aX)
c3aX: // *** (2) move else path to
here(without branch)
R1 = PicBaseReg + lvl_r39S_closure;
call (I64[R1])(R1) args: 8, res: 0, upd: 8; // *** indirect jump,
but fixed address (100% hit)
c3aW:
R1 = P64[R1 + 6] & (-8);
call (I64[R1])(R1) args: 8, res: 0, upd: 8; // *** indirect jump,
dynamic address (hit or miss)
//c3aX:
// R1 = PicBaseReg + lvl_r39S_closure;
// call (I64[R1])(R1) args: 8, res: 0, upd: 8;
slowpath: // Formerly c3aY
if ((Sp + -8) < SpLim) goto c3aZ; else goto c3b0;
c3aZ:
// nop
R1 = PicBaseReg + foo_closure;
call (I64[BaseReg - 8])(R2, R1) args: 8, res: 0, upd: 8;
c3b0:
I64[Sp - 8] = PicBaseReg + block_c3aO_info;
R1 = R2;
Sp = Sp - 8;
call (I64[R1])(R1) returns to fastpath, args: 8, res: 8, upd: 8;
// Sp bump moved to here so it's separate from "fastpath"
Sp = Sp + 8;
goto fastpath; // ***
//fastpath: // Formerly c3aO
// if (R1 & 7 >= 2) goto c3aW; else goto c3aX;
//c3aW:
// R1 = P64[R1 + 6] & (-8);
// call (I64[R1])(R1) args: 8, res: 0, upd: 8;
//c3aX:
// R1 = PicBaseReg + lvl_r39S_closure;
// call (I64[R1])(R1) args: 8, res: 0, upd: 8;
[1]: Intel64 and IA-32 Architectures Optimization Reference Manual
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32...
3.4 OPTIMIZING THE FRONT END
2.3.2.3 Branch Prediction
I'm just studying and drawing about lazy evaluation.
This thread is helpful to me :)
Regards,
Takenobu
2015-10-25 5:53 GMT+09:00 Carter Schonwald
Doesn't modern hardware have pretty good branch prediction? In which case the order of the branches may not matter unless it's a long chain of calls? Vs say an inner loop that hasn't been inlined?
Either way, I'd love be stay in the loop on this topic, for work I'm building a strongly normalizing language that supports both strict and call by need evaluation strategies.
On Friday, October 23, 2015, Ryan Newton
wrote: 1. Small tweaks: The CMM code above seems to be *betting* than the thunk is unevaluated, because it does the stack check and stack write *before* the predicate test that checks if the thunk is evaluated (if (R1 & 7 != 0) goto c3aO; else goto c3aP;). With a bang-pattern function, couldn't it make the opposite bet? That is, branch on whether the thunk is evaluated first, and then the wasted computation is only a single correctly predicted branch (and a read of a tag that we need to read anyway).
Oh, a small further addition would be needed for this tweak. In the generated code above "Sp = Sp + 8;" happens *late*, but I think it could happen right after the call to the thunk. In general, does it seem feasible to separate the slowpath from fastpath as in the following tweak of the example CMM?
* // Skip to the chase if it's already evaluated:* * start:* * if (R2 & 7 != 0) goto fastpath; else goto slowpath;*
* slowpath: // Formerly c3aY* * if ((Sp + -8) < SpLim) goto c3aZ; else goto c3b0;* * c3aZ:* * // nop* * R1 = PicBaseReg + foo_closure;* * call (I64[BaseReg - 8])(R2, R1) args: 8, res: 0, upd: 8;* * c3b0:* * I64[Sp - 8] = PicBaseReg + block_c3aO_info;* * R1 = R2;* * Sp = Sp - 8;*
* call (I64[R1])(R1) returns to fastpath, args: 8, res: 8, upd: 8;* * // Sp bump moved to here so it's separate from "fastpath"* * Sp = Sp + 8;*
* fastpath: // Formerly c3aO* * if (R1 & 7 >= 2) goto c3aW; else goto c3aX;* * c3aW:* * R1 = P64[R1 + 6] & (-8);* * call (I64[R1])(R1) args: 8, res: 0, upd: 8;* * c3aX:* * R1 = PicBaseReg + lvl_r39S_closure;* * call (I64[R1])(R1) args: 8, res: 0, upd: 8;*
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Yes, I think we can probably do a better job of compiling case expressions. The current compilation strategy optimises for code size, but at the expense of performance in the fast path. We can tweak this tradeoff, perhaps under the control of a flag. Ideally the sequence should start by assuming that the closure is already evaluated, e.g. loop: tag = R2 & 7; if (tag == 1) then // code for [] else if (tag == 2) then // code for (:) else evaluate; jump back to loop The nice thing is that now that we don't need proc points, "loop" is just a label that we can directly jump to. Unfortunately this only works when using the NCG, not with LLVM, because LLVM requires proc points, so we might need to fall back to a different strategy for LLVM. Similar topics came up here: https://ghc.haskell.org/trac/ghc/ticket/8905 and I think there was another ticket but I can't find it now. Cheers Simon On 23/10/2015 19:00, Ryan Newton wrote:
1. Small tweaks: The CMM code above seems to be /betting/ than the thunk is unevaluated, because it does the stack check and stack write /before/ the predicate test that checks if the thunk is evaluated (if(R1 & 7!= 0) gotoc3aO; elsegotoc3aP;). With a bang-pattern function, couldn't it make the opposite bet? That is, branch on whether the thunk is evaluated first, and then the wasted computation is only a single correctly predicted branch (and a read of a tag that we need to read anyway).
Oh, a small further addition would be needed for this tweak. In the generated code above "Sp = Sp + 8;" happens /late/, but I think it could happen right after the call to the thunk. In general, does it seem feasible to separate the slowpath from fastpath as in the following tweak of the example CMM?
* // Skip to the chase if it's already evaluated:* * start:* * if (R2 & 7 != 0) goto fastpath; else goto slowpath;* * * * slowpath: // Formerly c3aY* * if ((Sp + -8) < SpLim) goto c3aZ; else goto c3b0;* * c3aZ:* * // nop* * R1 = PicBaseReg + foo_closure;* * call (I64[BaseReg - 8])(R2, R1) args: 8, res: 0, upd: 8;* * c3b0:* * I64[Sp - 8] = PicBaseReg + block_c3aO_info;* * R1 = R2;* * Sp = Sp - 8;*
* call (I64[R1])(R1) returns to fastpath, args: 8, res: 8, upd: 8;* * // Sp bump moved to here so it's separate from "fastpath"* * Sp = Sp + 8;* * * * fastpath: // Formerly c3aO* * if (R1 & 7 >= 2) goto c3aW; else goto c3aX;* * c3aW:* * R1 = P64[R1 + 6] & (-8);* * call (I64[R1])(R1) args: 8, res: 0, upd: 8;* * c3aX:* * R1 = PicBaseReg + lvl_r39S_closure;* * call (I64[R1])(R1) args: 8, res: 0, upd: 8;*
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Thanks Simon for linking that issue! Does the patch linked there
https://ghc.haskell.org/trac/ghc/attachment/ticket/8905/0001-EXPERIMENTAL-sa...
already
cover what you suggested in your last mail? I think no, it's a more
limited change, but I'm having trouble understanding exactly what.
I've also got one really basic question -- was it decided long ago that all
these stack limit checks are cheaper than having a guard page at the end of
each stack and faulting to grow the stack? (I couldn't find a place that
this rationale was described on wiki.)
Best,
-Ryan
On Sun, Nov 1, 2015 at 10:05 AM, Simon Marlow
Yes, I think we can probably do a better job of compiling case expressions. The current compilation strategy optimises for code size, but at the expense of performance in the fast path. We can tweak this tradeoff, perhaps under the control of a flag.
Ideally the sequence should start by assuming that the closure is already evaluated, e.g.
loop: tag = R2 & 7; if (tag == 1) then // code for [] else if (tag == 2) then // code for (:) else evaluate; jump back to loop
The nice thing is that now that we don't need proc points, "loop" is just a label that we can directly jump to. Unfortunately this only works when using the NCG, not with LLVM, because LLVM requires proc points, so we might need to fall back to a different strategy for LLVM.
Similar topics came up here: https://ghc.haskell.org/trac/ghc/ticket/8905 and I think there was another ticket but I can't find it now.
Cheers Simon
On 23/10/2015 19:00, Ryan Newton wrote:
1. Small tweaks: The CMM code above seems to be /betting/ than the thunk is unevaluated, because it does the stack check and stack write /before/ the predicate test that checks if the thunk is evaluated (if(R1 & 7!= 0) gotoc3aO; elsegotoc3aP;). With a bang-pattern function, couldn't it make the opposite bet? That is, branch on whether the thunk is evaluated first, and then the wasted computation is only a single correctly predicted branch (and a read of a tag that we need to read anyway).
Oh, a small further addition would be needed for this tweak. In the generated code above "Sp = Sp + 8;" happens /late/, but I think it could happen right after the call to the thunk. In general, does it seem feasible to separate the slowpath from fastpath as in the following tweak of the example CMM?
* // Skip to the chase if it's already evaluated:* * start:* * if (R2 & 7 != 0) goto fastpath; else goto slowpath;* * * * slowpath: // Formerly c3aY* * if ((Sp + -8) < SpLim) goto c3aZ; else goto c3b0;* * c3aZ:* * // nop* * R1 = PicBaseReg + foo_closure;* * call (I64[BaseReg - 8])(R2, R1) args: 8, res: 0, upd: 8;* * c3b0:* * I64[Sp - 8] = PicBaseReg + block_c3aO_info;* * R1 = R2;* * Sp = Sp - 8;*
* call (I64[R1])(R1) returns to fastpath, args: 8, res: 8, upd: 8;* * // Sp bump moved to here so it's separate from "fastpath"* * Sp = Sp + 8;* * * * fastpath: // Formerly c3aO* * if (R1 & 7 >= 2) goto c3aW; else goto c3aX;* * c3aW:* * R1 = P64[R1 + 6] & (-8);* * call (I64[R1])(R1) args: 8, res: 0, upd: 8;* * c3aX:* * R1 = PicBaseReg + lvl_r39S_closure;* * call (I64[R1])(R1) args: 8, res: 0, upd: 8;*
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

On 02/11/2015 06:37, Ryan Newton wrote:
Thanks Simon for linking that issue! Does the patch linked there https://ghc.haskell.org/trac/ghc/attachment/ticket/8905/0001-EXPERIMENTAL-sa... already cover what you suggested in your last mail? I think no, it's a more limited change, but I'm having trouble understanding exactly what.
I've also got one really basic question -- was it decided long ago that all these stack limit checks are cheaper than having a guard page at the end of each stack and faulting to grow the stack? (I couldn't find a place that this rationale was described on wiki.)
Stacks would be immovable, and you need to arrange that you have enough address space to grow the stack if necessary, just like OS threads. Thus it's not really feasible on 32-bit platforms, and we would have to keep the existing stack-chunk mechanism for those platforms. Right now stacks start at 1k and are really cheap; they'd be somewhat more expensive under this scheme: at least 4K and a couple of mmap() calls. Stack chunks have another benefit: we can track whether each chunk is dirty separately, and only traverse dirty chunks in the GC. I don't think you'd be able to do this with contiguous stacks and no stack checks. Cheers Simon
Best, -Ryan
On Sun, Nov 1, 2015 at 10:05 AM, Simon Marlow
mailto:marlowsd@gmail.com> wrote: Yes, I think we can probably do a better job of compiling case expressions. The current compilation strategy optimises for code size, but at the expense of performance in the fast path. We can tweak this tradeoff, perhaps under the control of a flag.
Ideally the sequence should start by assuming that the closure is already evaluated, e.g.
loop: tag = R2 & 7; if (tag == 1) then // code for [] else if (tag == 2) then // code for (:) else evaluate; jump back to loop
The nice thing is that now that we don't need proc points, "loop" is just a label that we can directly jump to. Unfortunately this only works when using the NCG, not with LLVM, because LLVM requires proc points, so we might need to fall back to a different strategy for LLVM.
Similar topics came up here: https://ghc.haskell.org/trac/ghc/ticket/8905 and I think there was another ticket but I can't find it now.
Cheers Simon
On 23/10/2015 19:00, Ryan Newton wrote:
1. Small tweaks: The CMM code above seems to be /betting/ than the thunk is unevaluated, because it does the stack check and stack write /before/ the predicate test that checks if the thunk is evaluated (if(R1 & 7!= 0) gotoc3aO; elsegotoc3aP;). With a bang-pattern function, couldn't it make the opposite bet? That is, branch on whether the thunk is evaluated first, and then the wasted computation is only a single correctly predicted branch (and a read of a tag that we need to read anyway).
Oh, a small further addition would be needed for this tweak. In the generated code above "Sp = Sp + 8;" happens /late/, but I think it could happen right after the call to the thunk. In general, does it seem feasible to separate the slowpath from fastpath as in the following tweak of the example CMM?
* // Skip to the chase if it's already evaluated:* * start:* * if (R2 & 7 != 0) goto fastpath; else goto slowpath;* * * * slowpath: // Formerly c3aY* * if ((Sp + -8) < SpLim) goto c3aZ; else goto c3b0;* * c3aZ:* * // nop* * R1 = PicBaseReg + foo_closure;* * call (I64[BaseReg - 8])(R2, R1) args: 8, res: 0, upd: 8;* * c3b0:* * I64[Sp - 8] = PicBaseReg + block_c3aO_info;* * R1 = R2;* * Sp = Sp - 8;*
* call (I64[R1])(R1) returns to fastpath, args: 8, res: 8, upd: 8;* * // Sp bump moved to here so it's separate from "fastpath"* * Sp = Sp + 8;* * * * fastpath: // Formerly c3aO* * if (R1 & 7 >= 2) goto c3aW; else goto c3aX;* * c3aW:* * R1 = P64[R1 + 6] & (-8);* * call (I64[R1])(R1) args: 8, res: 0, upd: 8;* * c3aX:* * R1 = PicBaseReg + lvl_r39S_closure;* * call (I64[R1])(R1) args: 8, res: 0, upd: 8;*
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org mailto:ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Sorry, forgot to respond to your other point: On 02/11/2015 06:37, Ryan Newton wrote:
Thanks Simon for linking that issue! Does the patch linked there https://ghc.haskell.org/trac/ghc/attachment/ticket/8905/0001-EXPERIMENTAL-sa... already cover what you suggested in your last mail? I think no, it's a more limited change, but I'm having trouble understanding exactly what.
Correct, it's a more limited change. I don't have a working implementation of the scheme I described, it needs some reorganisation in the codeGen if I remember correctly. Something I think we should do is to have a gcc-like -Os flag that we could use as a hint in cases where we have a code-size vs speed tradeoff like this, there are a handful of places we could use that. Cheers Simon
I've also got one really basic question -- was it decided long ago that all these stack limit checks are cheaper than having a guard page at the end of each stack and faulting to grow the stack? (I couldn't find a place that this rationale was described on wiki.)
Best, -Ryan
On Sun, Nov 1, 2015 at 10:05 AM, Simon Marlow
mailto:marlowsd@gmail.com> wrote: Yes, I think we can probably do a better job of compiling case expressions. The current compilation strategy optimises for code size, but at the expense of performance in the fast path. We can tweak this tradeoff, perhaps under the control of a flag.
Ideally the sequence should start by assuming that the closure is already evaluated, e.g.
loop: tag = R2 & 7; if (tag == 1) then // code for [] else if (tag == 2) then // code for (:) else evaluate; jump back to loop
The nice thing is that now that we don't need proc points, "loop" is just a label that we can directly jump to. Unfortunately this only works when using the NCG, not with LLVM, because LLVM requires proc points, so we might need to fall back to a different strategy for LLVM.
Similar topics came up here: https://ghc.haskell.org/trac/ghc/ticket/8905 and I think there was another ticket but I can't find it now.
Cheers Simon
On 23/10/2015 19:00, Ryan Newton wrote:
1. Small tweaks: The CMM code above seems to be /betting/ than the thunk is unevaluated, because it does the stack check and stack write /before/ the predicate test that checks if the thunk is evaluated (if(R1 & 7!= 0) gotoc3aO; elsegotoc3aP;). With a bang-pattern function, couldn't it make the opposite bet? That is, branch on whether the thunk is evaluated first, and then the wasted computation is only a single correctly predicted branch (and a read of a tag that we need to read anyway).
Oh, a small further addition would be needed for this tweak. In the generated code above "Sp = Sp + 8;" happens /late/, but I think it could happen right after the call to the thunk. In general, does it seem feasible to separate the slowpath from fastpath as in the following tweak of the example CMM?
* // Skip to the chase if it's already evaluated:* * start:* * if (R2 & 7 != 0) goto fastpath; else goto slowpath;* * * * slowpath: // Formerly c3aY* * if ((Sp + -8) < SpLim) goto c3aZ; else goto c3b0;* * c3aZ:* * // nop* * R1 = PicBaseReg + foo_closure;* * call (I64[BaseReg - 8])(R2, R1) args: 8, res: 0, upd: 8;* * c3b0:* * I64[Sp - 8] = PicBaseReg + block_c3aO_info;* * R1 = R2;* * Sp = Sp - 8;*
* call (I64[R1])(R1) returns to fastpath, args: 8, res: 8, upd: 8;* * // Sp bump moved to here so it's separate from "fastpath"* * Sp = Sp + 8;* * * * fastpath: // Formerly c3aO* * if (R1 & 7 >= 2) goto c3aW; else goto c3aX;* * c3aW:* * R1 = P64[R1 + 6] & (-8);* * call (I64[R1])(R1) args: 8, res: 0, upd: 8;* * c3aX:* * R1 = PicBaseReg + lvl_r39S_closure;* * call (I64[R1])(R1) args: 8, res: 0, upd: 8;*
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org mailto:ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Ryan
Yes, I’m sure there is room for improvement here! Good observations.
In particular, while I’d like DATA to support this, I believe that most ‘eval’s find that the thing being evaluated is already evaluated; so the fast-path should be that case. We should bet for evaluated.
NB, however, that the stack-overflow check applies to the entire function body. Suppose we do
f x = case x of (p,q) -> case p of True -> False; False -> True
Then we can eliminate the stack check only if both the eval of ‘x’ and the eval of ‘p’ both take the fast path.
Another avenue you might like to think about is this: if function f calls g, then g’s stack-overflow test could perhaps be absorbed into ‘f’s. So
f x = case (g x) of True -> False; False -> True
So maybe f could call a super-fast entry point for g that didn’t have a stack overflow test!
(Watch out: loops, recursion, etc)
I’m happy to advise.
Simon
From: Ryan Newton [mailto:rrnewton@gmail.com]
Sent: 23 October 2015 19:31
To: Simon Peyton Jones
Cc: ghc-devs@haskell.org; Ömer Sinan Ağacan; Ryan Scott; Chao-Hong Chen; Johan Tibell
Subject: Re: Better calling conventions for strict functions (bang patterns)?
Ah, yes, so just to give a concrete example in this thread, if we take the `foo` function above and say `map foo ls`, we may well get unevaluated arguments to foo. (And this is almost precisely the same as the first example that Strict-Core paper!)
Thanks for the paper reference. I read it and it's great -- just what I was looking for. An approach that eliminates any jealousy of ML/Scheme compiler techniques vis a vis calling conventions ;-). I'm also wondering if there are some incremental steps that can be taken, short of what is proposed in the paper.
1. Small tweaks: The CMM code above seems to be betting than the thunk is unevaluated, because it does the stack check and stack write before the predicate test that checks if the thunk is evaluated (if (R1 & 7 != 0) goto c3aO; else goto c3aP;). With a bang-pattern function, couldn't it make the opposite bet? That is, branch on whether the thunk is evaluated first, and then the wasted computation is only a single correctly predicted branch (and a read of a tag that we need to read anyway).
2. The option of multiple entrypoints which is considered and discarded as fragile in the beginning of the paper (for direct call vs indirect / 1st order vs higher order). That fragile option is along the lines of what I wanted to discuss on this thread. It does seem like a tricky phase ordering concern, but how bad is it exactly? The conflict with the a case-expr rewrite is illustrated clearly in the paper, but that just means that such optimizations must happen before the final choice of which function entrypoint to call, doesn't it? I'm not 100% sure where it could go in the current compiler pipeline, but couldn't the adjustment of call target from "foo" to "foo_with_known_whnf_args" happen quite late?
Cheers,
-Ryan
P.S. One of the students CC'd, Ryan Scott, is currently on internship at Intel labs and is working to (hopefully) liberate the Intell Haskell Research Compiler as open source. Like the 2009 paper, it also uses a strict IR, and I think it will be interesting to see exactly how it handles the conversion from Core to its IR. (Probably the same as Fig 10 in the paper.)
On Fri, Oct 23, 2015 at 10:11 AM, Simon Peyton Jones

Simon,
I really enjoyed reading this paper... I was wondering if you could comment
on the implementation of Strict Core? Was it ever implemented in GHC (even
as a proof-of-concept)? If not, was it just due to a lack of time or some
fundamental limitation or problem discovered after the paper? If it was
implemented, was any benefit actually measured? Can you speculate on
whether some of the more recent changes/additions to Core (such as
coercions and roles) might fit into this? (I don't see any obvious reason
they couldn't, but that is me.)
Thanks!
Andrew
On Fri, Oct 23, 2015 at 7:11 AM, Simon Peyton Jones
It’s absolutely the case that bang patterns etc tell the caller what to do, but the function CANNOT ASSUME that its argument is evaluated. Reason: higher order functions.
I think that the way to allow functions that can assume their arg is evaluated is through types: see Type are calling conventions http://research.microsoft.com/~simonpj/papers/strict-core/tacc-hs09.pdf. But it’d be a fairly big deal to implement.
Simon
*From:* ghc-devs [mailto:ghc-devs-bounces@haskell.org] *On Behalf Of *Ryan Newton *Sent:* 23 October 2015 14:54 *To:* ghc-devs@haskell.org; Ömer Sinan Ağacan; Ryan Scott; Chao-Hong Chen; Johan Tibell *Subject:* Better calling conventions for strict functions (bang patterns)?
Hi all,
With module-level Strict and StrictData pragmas coming soon, one obvious question is what kind of the code quality GHC can achieve for strict programs.
When it came up in discussion in our research group we realized we didn't actually know whether the bang patterns, `f !x`, on function arguments were enforced by caller or callee.
Here's a Gist that shows the compilation of a trivial function:
foo :: Maybe Int -> Int
foo !x =
case x of
Just y -> y
*MailScanner has detected a possible fraud attempt from "na01.safelinks.protection.outlook.com" claiming to be* https://gist.github.com/rrnewton/1ac722189c65f26fe9ac https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fgist.github.com%2frrnewton%2f1ac722189c65f26fe9ac&data=01%7c01%7csimonpj%40064d.mgd.microsoft.com%7cb006dcdbfe834ebb6c1e08d2dbb16c03%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=qxrT8r1VSP97xQUF2qqkLlxEtSGi9VOzfmORl25W%2fWY%3d
If that function is compiled to *assume* its input is in WHNF, it should be just as efficient as the isomorphic MLton/OCaml code, right? It only needs to branch on the tag, do a field dereference, and return.
But as you can see from the STG and CMM generated, foo *does indeed* enter the thunk, adding an extra indirect jump. Here's the body:
c3aY:
if ((Sp + -8) < SpLim) goto c3aZ; else goto c3b0;
c3aZ:
// nop
R1 = PicBaseReg + foo_closure;
call (I64[BaseReg - 8])(R2, R1) args: 8, res: 0, upd: 8;
c3b0:
I64[Sp - 8] = PicBaseReg + block_c3aO_info;
R1 = R2;
Sp = Sp - 8;
if (R1 & 7 != 0) goto c3aO; else goto c3aP;
c3aP:
call (I64[R1])(R1) returns to c3aO, args: 8, res: 8, upd: 8;
c3aO:
if (R1 & 7 >= 2) goto c3aW; else goto c3aX;
c3aW:
R1 = P64[R1 + 6] & (-8);
Sp = Sp + 8;
call (I64[R1])(R1) args: 8, res: 0, upd: 8;
c3aX:
R1 = PicBaseReg + lvl_r39S_closure;
Sp = Sp + 8;
call (I64[R1])(R1) args: 8, res: 0, upd: 8;
The call inside c3aP is entering "x" as a thunk, which also incurs all of the stack limit check code. I believe that IF the input could be assumed to be in WHNF, everything above the label "c3aO" could be omitted.
So... if GHC is going to be a fabulous pure *and* imperative language, and a fabulous lazy *and* strict compiler/runtime.. is there some work we can do here to improve this situation? Would the following make sense:
- Put together a benchmark suite of all-strict programs with Strict/StrictData (compare a few benchmark's generated code to MLton, if time allows) - Modify GHC to change calling conventions for bang patterns -- caller enforces WHNF rather than callee. Existing strictness/demand/cardinality analysis would stay the same.
Unless there's something I'm really missing here, the result should be that you can have a whole chain of strict function calls, each of which knows its arguments and the arguments it passes to its callees are all in WHNF, without ever generating thunk-entry sequences.
Thanks for your time,
-Ryan
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

I’m glad you liked it!
No, I don’t think it was ever implemented. I don’t think coercions will be a problem.
Simon
From: xichekolas@gmail.com [mailto:xichekolas@gmail.com] On Behalf Of Andrew Farmer
Sent: 26 October 2015 16:59
To: Simon Peyton Jones
Cc: ghc-devs@haskell.org
Subject: Re: [DISARMED] RE: Better calling conventions for strict functions (bang patterns)?
Simon,
I really enjoyed reading this paper... I was wondering if you could comment on the implementation of Strict Core? Was it ever implemented in GHC (even as a proof-of-concept)? If not, was it just due to a lack of time or some fundamental limitation or problem discovered after the paper? If it was implemented, was any benefit actually measured? Can you speculate on whether some of the more recent changes/additions to Core (such as coercions and roles) might fit into this? (I don't see any obvious reason they couldn't, but that is me.)
Thanks!
Andrew
On Fri, Oct 23, 2015 at 7:11 AM, Simon Peyton Jones
participants (6)
-
Andrew Farmer
-
Carter Schonwald
-
Ryan Newton
-
Simon Marlow
-
Simon Peyton Jones
-
Takenobu Tani