
Alexis King
Hi all,
As some of you are likely aware, I have an open GHC proposal[1] to add native support for delimited continuations to the RTS. I also have an incomplete implementation,[2] and the only major remaining obstacle concerns async exceptions. The issue is subtle, so I’ve tried to provide all the necessary context in this email. If you’re already familiar with the ideas at play, you can skip the context about how delimited continuations work.
...
This tiny optimization actually allows not one but two savings:
1. The immediate benefit is that we save a stack frame.
2. The hidden benefit is that we never need to push the old exception masking state onto the stack.
If we had multiple mask_ frames on the stack simultaneously, we wouldn’t know what to do when returning: should we unmask them, or should they stay masked? We’d need to push that information onto the stack in the same way we must push a return address when calling a function.
By skipping these redundant stack frames, we can always be certain the right thing to do on return is to unmask async exceptions. No need to store anything else.
Indeed. However, I don't think it would be that expensive to store the masking state with a suitable implementation strategy. Specifically, don't add a "masking state" field to the return frame. Rather, I think you want to have `mask` push one of two possible return frame types: * in the case that we are already masked push an "already masked" frame. The entry code for this would be a no-op. * in the case that we aren't currently masked push the usual stg_maskAsyncExceptionszh_ret_info return frame. This incurs minimal overhead while preserving the information that you need. We avoid adding a word to the stack frame (likely saving an instruction) and in cases where today the optimsation fires the user would only pay for a return to a trivial entry frame. It seems to me that this is as close to free as we will get. Under this scheme `control` can simply look for "already masked" frames when copying the stack.
(This explanation is a slight simplification, especially once maskUninterruptible comes into play, but it’s close enough.)
Now you may see the looming problem: this strategy completely breaks down in the presence of delimited continuations. With delimited continuations, we might have a program like this:
mask_ $ prompt $ mask_ $ ...
If we capture a continuation up to this prompt, we expect the inner mask_ frame to be captured along with it. But that won’t happen if we never pushed a mask_ frame at all due to the aforementioned optimization.
So now I can finally state my question: what is the right solution for this? I see three obvious possible ways forward:
1. Keep track of whether or not we’re inside a prompt and skip the optimization in that case.
2. Keep some bookkeeping that tracks the modifications to the async exception masking state since the most recently pushed prompt.
3. Just don’t bother with the optimization at all.
Option 3 certainly seems the most appealing from a simplicity point of view, and I suspect the optimization doesn’t matter much in practice. Why? Because the real `mask` implementation from Control.Exception already avoids re-masking exceptions if they’re masked! (And that’s okay, because prompt# and control0# are not intended to be used directly in IO, so code that uses them can provide its own version of `mask`.) However, it is admittedly possible for the restore action passed to the argument of `mask` to create redundant calls, as the check is only performed in `mask` itself.
Is eliminating this optimization an acceptable compromise? Or is there reason to believe this is important for performance of real programs?
It never hurts to measure. Perhaps establish the worst-case performance impact by looking at a simple program which just masks repeatedly. For what it's worth, I rather doubt that this optimisation is going to make or break any program. Exception operations in general are somewhat rare in my experience and the cost of pushing a stack frame (or, perhaps more importantly on modern hardware, returning through it) should be reasonably cheap. By the way, when you go to implement this do add a few comments to the masking primitives. It took a bit of staring to work out what was going on in there. Cheers, - Ben