Transparently hooking into the STG stack to validate an escape analysis

Hi Devs, my master's student Sebastian and I (also Sebastian :)) are working on an escape analysis in STG, see https://gitlab.haskell.org/ghc/ghc/-/issues/16891#note_347903. We have a prototype for the escape analysis that we want to validate/exploit now. The original plan was to write the transformation that allocates non-escaping objects on the STG stack. But that is quite tricky for many reasons, one of them being treatment of the GC. This mail is rather lengthy, so I suggest you skip to "What we hope could work" and see if you can answer it without the context I provide below. If you can't, I would be very grateful if you were willing to suffer through the exposition. # Instrumentation So instead we thought about doing a (easily changed and thus versatile) instrumentation-based approach: Assign a sequence number to every instantiation (a term I will use to mean "allocation of g's closure") of things that we our analysis determines as escaping or non-escaping, such as STG's let bindings (focusing on non-let-no-escape functions for now). (One sequence number *per allocation* of the let binding's closure, not based on its syntactic identity.) Then, we set a bit in a (dynamically growing) global bit vector whenever the let "RHS is entered" and then unset it when we "leave the let-body". Example: f = \x y -> let { g = [y] \z -> y + z; } in g x Here, our analysis could see that no instantiation (which I say instead of "allocation of g's closure") of g will ever escape its scope within f. Our validation would give a fresh sequence number to the instantiation of g whenever f is called and store it in g's closure (which we arrange by piggy-backing on -prof and adding an additional field to the profiling header). Then, when g's RHS is entered, we set the bit in the global bit vector, indicating "this instantiation of g might escape". After leaving the RHS of g, we also leave the body of the defining let, which means we unset the bit in the bit vector, meaning "every use so far wasn't in an escaping scenario". So far so good. Modifying the code upon entering g takes a bit of tinkering but can be done by building on TickyTicky in StgToCmm. But what is not done so easily is inserting the continuation upon entering the let that will unset the bit! # What doesn't work: Modifying the Sequel At first, we tried to modify the sequel https://gitlab.haskell.org/ghc/ghc/-/blob/4c6985cc91b9cf89b393f69c9a603e8df0... of the let-body to an `AssignTo`. That requires us to know the registers in which the let-body will return its results, which in turn means we have to know the representation of those results, so we have to write a function `stgExprPrimRep :: GenStgExpr p -> [PrimRep]`. Urgh! We were very surprised that there was no such function. And while we tested our function, we very soon knew why. Consider the following pattern synonym matcher: GHC.Natural.$mNatJ# :: forall {rep :: GHC.Types.RuntimeRep} {r :: TYPE rep}. GHC.Num.Natural.Natural -> (GHC.Num.BigNat.BigNat -> r) -> ((# #) -> r) -> r = {} \r [scrut_sBE cont_sBF fail_sBG] case scrut_sBE of { GHC.Num.Natural.NS _ -> fail_sBG GHC.Prim.(##); GHC.Num.Natural.NB ds_sBJ -> let { sat_sBK :: GHC.Num.BigNat.BigNat = CCCS GHC.Num.BigNat.BN#! [ds_sBJ]; } in cont_sBF sat_sBK; }; Note how its result is representation-polymorphic! It only works because our machine implementation allows tail-calls. It's obvious in hindsight that we could never write `stgExprPrimRep` in such a way that it will work on the expression `cont_sBF sat_sBK`. So the sequel approach seems infeasible. # What we hope could work: A special stack frame The other alternative would be to insert a special continuation frame on the stack when we enter the let-body (inspired by stg_restore_cccs). This continuation frame would simply push all registers (FP regs, GP regs, Vec regs, ...) to the C stack, do its work (unsetting the bit), then pop all registers again and jump to the topmost continuation on the STG stack. Example: f :: forall rep (r :: TYPE rep). Int# -> (Int# -> r) -> r f = \x g -> let { h = [x] \a -> x + a; } in case h x of b { __DEFAULT -> g b } We are only interested in unsetting the bit for h here. Consider the stack when entering the body of h. caller_of_f_cont_info <- Sp Now push our special continuation frame: caller_of_f_cont_info seq_h unset_bit_stk_info <- Sp E.g., the stack frame contains the info pointer and the sequence number. (Btw., I hope I got the stack layout about right and this is even possible) Then, after we entered the continuation of the __DEFAULT alt, we do a jump to g. Plot twist: g returns an unboxed 8-tuple of `Int#`s (as caller_of_f_cont_info knows, but f certainly doesn't!), so before it returns it will push two args on the stack (correct?): caller_of_f_cont_info seq_h unset_bit_stk_info unboxed tuple component 7 unboxed tuple component 8 <- Sp And then `g` jumps directly to the entry code for `unset_bit_stk_info` (which does the register saving I mentioned), which absolutely can't figure out from Sp alone where seq_h is. Darn! I think Luite's recent work on the StgToByteCode had to work around similar issues, I found this wiki page https://gitlab.haskell.org/ghc/ghc-wiki-mirror/-/blob/master/Unboxed-Tuples-.... But we aren't in a position where we know the representation of `r` *at all*! So our idea was to scan the stack, beginning from `Sp`, until I find `unset_bit_stk_info`, giving us the following situation: caller_of_f_cont_info seq_h unset_bit_stk_info <- Bp unboxed tuple component 7 unboxed tuple component 8 <- Sp I suggestively named the register in which we store the result Bp in analogy to the traditional base pointer. This information would be enough to unset the bit at index seq_h and then copy the unboxed tuple components 7 and 8 up by two words: caller_of_f_cont_info unboxed tuple component 7 unboxed tuple component 8 <- Sp Then jump to caller_of_f_cont_info, which knows what to make of the stack and the register state. The stack scanning is horrible and too brittle and slow for a production setting, as any of the unboxed tuple components could have the same bit pattern as `unset_bit_stk_info`. We simply hope that is never the case and it's fine for the purposes of a quick-and-dirty instrumentation. QUESTION 1: What do you think? Could this work? Can you anticipate pit falls of this approach? # What about Thunks and StgRhsCon? QUESTION 2: The instrumentation as described won't work for Thunks (which are only entered once) and constructor applications (like sat_sBK in the BigNat matcher above). Not sure how to change that without incurring huge performance hits (by deactivating memoisation). Ideas are welcome here. Thanks for reading this far. I hope you could follow and are now fizzing with excitement because you have a much better idea of how to do this and will let me know :) Sebastian

Hi,
IMO the Cmm STG machine implementation is just too complex for student
projects. It's not fun to work with at all.
Why did you choose this approach?
IMO the escape analysis development and validation would be much smoother
and fun when you'd use the external STG interpreter.
When you have a solid and working design of your analysis and
transformations then you could implement it in GHC's native backend if it
needs any changes at all.
What do you think?
Do you disagree?
Have you seen my presentation about the stg interpreter?
https://www.youtube.com/watch?v=Ey5OFPkxF_w
Cheers,
Csaba
On Wed, Dec 8, 2021 at 11:20 AM Sebastian Graf
Hi Devs,
my master's student Sebastian and I (also Sebastian :)) are working on an escape analysis in STG, see https://gitlab.haskell.org/ghc/ghc/-/issues/16891#note_347903.
We have a prototype for the escape analysis that we want to validate/exploit now. The original plan was to write the transformation that allocates non-escaping objects on the STG stack. But that is quite tricky for many reasons, one of them being treatment of the GC.
This mail is rather lengthy, so I suggest you skip to "What we hope could work" and see if you can answer it without the context I provide below. If you can't, I would be very grateful if you were willing to suffer through the exposition.
# Instrumentation
So instead we thought about doing a (easily changed and thus versatile) instrumentation-based approach: Assign a sequence number to every instantiation (a term I will use to mean "allocation of g's closure") of things that we our analysis determines as escaping or non-escaping, such as STG's let bindings (focusing on non-let-no-escape functions for now). (One sequence number *per allocation* of the let binding's closure, not based on its syntactic identity.) Then, we set a bit in a (dynamically growing) global bit vector whenever the let "RHS is entered" and then unset it when we "leave the let-body". Example:
f = \x y -> let { g = [y] \z -> y + z; } in g x
Here, our analysis could see that no instantiation (which I say instead of "allocation of g's closure") of g will ever escape its scope within f. Our validation would give a fresh sequence number to the instantiation of g whenever f is called and store it in g's closure (which we arrange by piggy-backing on -prof and adding an additional field to the profiling header). Then, when g's RHS is entered, we set the bit in the global bit vector, indicating "this instantiation of g might escape". After leaving the RHS of g, we also leave the body of the defining let, which means we unset the bit in the bit vector, meaning "every use so far wasn't in an escaping scenario".
So far so good. Modifying the code upon entering g takes a bit of tinkering but can be done by building on TickyTicky in StgToCmm. But what is not done so easily is inserting the continuation upon entering the let that will unset the bit!
# What doesn't work: Modifying the Sequel
At first, we tried to modify the sequel https://gitlab.haskell.org/ghc/ghc/-/blob/4c6985cc91b9cf89b393f69c9a603e8df0... of the let-body to an `AssignTo`. That requires us to know the registers in which the let-body will return its results, which in turn means we have to know the representation of those results, so we have to write a function `stgExprPrimRep :: GenStgExpr p -> [PrimRep]`. Urgh! We were very surprised that there was no such function. And while we tested our function, we very soon knew why. Consider the following pattern synonym matcher:
GHC.Natural.$mNatJ# :: forall {rep :: GHC.Types.RuntimeRep} {r :: TYPE rep}. GHC.Num.Natural.Natural -> (GHC.Num.BigNat.BigNat -> r) -> ((# #) -> r) -> r = {} \r [scrut_sBE cont_sBF fail_sBG] case scrut_sBE of { GHC.Num.Natural.NS _ -> fail_sBG GHC.Prim.(##); GHC.Num.Natural.NB ds_sBJ -> let { sat_sBK :: GHC.Num.BigNat.BigNat = CCCS GHC.Num.BigNat.BN#! [ds_sBJ]; } in cont_sBF sat_sBK; };
Note how its result is representation-polymorphic! It only works because our machine implementation allows tail-calls. It's obvious in hindsight that we could never write `stgExprPrimRep` in such a way that it will work on the expression `cont_sBF sat_sBK`. So the sequel approach seems infeasible.
# What we hope could work: A special stack frame
The other alternative would be to insert a special continuation frame on the stack when we enter the let-body (inspired by stg_restore_cccs). This continuation frame would simply push all registers (FP regs, GP regs, Vec regs, ...) to the C stack, do its work (unsetting the bit), then pop all registers again and jump to the topmost continuation on the STG stack. Example:
f :: forall rep (r :: TYPE rep). Int# -> (Int# -> r) -> r f = \x g -> let { h = [x] \a -> x + a; } in case h x of b { __DEFAULT -> g b }
We are only interested in unsetting the bit for h here. Consider the stack when entering the body of h.
caller_of_f_cont_info <- Sp
Now push our special continuation frame:
caller_of_f_cont_info seq_h unset_bit_stk_info <- Sp
E.g., the stack frame contains the info pointer and the sequence number. (Btw., I hope I got the stack layout about right and this is even possible) Then, after we entered the continuation of the __DEFAULT alt, we do a jump to g. Plot twist: g returns an unboxed 8-tuple of `Int#`s (as caller_of_f_cont_info knows, but f certainly doesn't!), so before it returns it will push two args on the stack (correct?):
caller_of_f_cont_info seq_h unset_bit_stk_info unboxed tuple component 7 unboxed tuple component 8 <- Sp
And then `g` jumps directly to the entry code for `unset_bit_stk_info` (which does the register saving I mentioned), which absolutely can't figure out from Sp alone where seq_h is. Darn! I think Luite's recent work on the StgToByteCode had to work around similar issues, I found this wiki page https://gitlab.haskell.org/ghc/ghc-wiki-mirror/-/blob/master/Unboxed-Tuples-.... But we aren't in a position where we know the representation of `r` *at all*!
So our idea was to scan the stack, beginning from `Sp`, until I find `unset_bit_stk_info`, giving us the following situation:
caller_of_f_cont_info seq_h unset_bit_stk_info <- Bp unboxed tuple component 7 unboxed tuple component 8 <- Sp
I suggestively named the register in which we store the result Bp in analogy to the traditional base pointer. This information would be enough to unset the bit at index seq_h and then copy the unboxed tuple components 7 and 8 up by two words:
caller_of_f_cont_info unboxed tuple component 7 unboxed tuple component 8 <- Sp
Then jump to caller_of_f_cont_info, which knows what to make of the stack and the register state.
The stack scanning is horrible and too brittle and slow for a production setting, as any of the unboxed tuple components could have the same bit pattern as `unset_bit_stk_info`. We simply hope that is never the case and it's fine for the purposes of a quick-and-dirty instrumentation.
QUESTION 1: What do you think? Could this work? Can you anticipate pit falls of this approach?
# What about Thunks and StgRhsCon?
QUESTION 2: The instrumentation as described won't work for Thunks (which are only entered once) and constructor applications (like sat_sBK in the BigNat matcher above). Not sure how to change that without incurring huge performance hits (by deactivating memoisation). Ideas are welcome here.
Thanks for reading this far. I hope you could follow and are now fizzing with excitement because you have a much better idea of how to do this and will let me know :)
Sebastian _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Hey Csaba,
After catching up on your talk and reflecting about it a bit, it seems
quite obvious that your tool is the right way to collect the data and
validate our analysis!
Even if meanwhile we decided that a "transparent stack frame" (which I
believe is something similar to what you are doing here
https://github.com/grin-compiler/ghc-whole-program-compiler-project/blob/d3e...,
with an explicit `argCount` which we do not know) is not the ideal
solution we've been looking for (for different reasons).
Essentially, we have two things
An escape analysis, implemented as an STG-to-STG pass that attaches a
boolean flag to each Id whether it "escapes its scope" (for a suitable
definition of that).
We'd like to implement it in a way that it would be reusable within GHC
with moderate effort (e.g., renaming Binder to Id or accounting for
different fields), operating on a module at a time rather than whole
program.The instrumentation that tries to measure how many heap objects
could be allocated on the stack. E.g., set a closure-specific flag
whenever the closure is entered, unset that bit (once) when we "leave
the scope" that defines the closure.
If my understanding is right, we could just implement this
"instrumentation" as a simple extra field to Closure
https://github.com/grin-compiler/ghc-whole-program-compiler-project/blob/d3e...,
right? Neat!
A bit tangential: I see that your interpreter currently allocates a
fresh closure for let-no-escapes
https://github.com/grin-compiler/ghc-whole-program-compiler-project/blob/d3e...
when it could just re-use the closure of its defining scope. That would
skew our numbers somewhat compared to instrumenting GHC-compiled
programs, but I think we'd be able to work around that. I also wonder if
the semantics of your let-no-escapes are actually as typically specified
(operationally), in that a jump to a let-no-escape should also reset the
stack pointer. It should hardly matter for the programs that GHC
generates, though.
I would also be interested in knowing whether the +RTS -s "bytes
allocated in the heap" metric (very nearly) coincides with a similar
metric you could produce. It would be fantastic if that was the case!
Theoretically, that should be possible, right?
I think your interpreter work is very valuable to collect data we
otherwise would only be able to measure with a TickyTicky-based
approach. Nice!
Another, similar use case would be to identify the fraction of closures
that are only entered once. I remember that there was a ticky-based
patch with which Joachim used to measure this fraction
https://gitlab.haskell.org/ghc/ghc/-/wikis/commentary/compiler/demand#instru...
(and similarly validate the analysis results), but unfortunately it
couldn't end up in master. Ah, yes, we have a whole open ticket about
it: #10613 https://gitlab.haskell.org/ghc/ghc/-/issues/10613. In fact,
that instrumentation is also somewhat similar (adding a field to every
closure) as what we want to do.
Anyway, it seems like your work will be very valuable in replacing some
of the annoying ticky-based instrumentation ideas!
Maybe we can have a call some time this or next week to discuss details,
once Sebastian and I are more familiar with the code base?
Thanks for sticking with the project and doing all the hard work that
can build upon!
Sebastian
------ Originalnachricht ------
Von: "Csaba Hruska"
Hi,
IMO the Cmm STG machine implementation is just too complex for student projects. It's not fun to work with at all. Why did you choose this approach? IMO the escape analysis development and validation would be much smoother and fun when you'd use the external STG interpreter. When you have a solid and working design of your analysis and transformations then you could implement it in GHC's native backend if it needs any changes at all.
What do you think? Do you disagree?
Have you seen my presentation about the stg interpreter? https://www.youtube.com/watch?v=Ey5OFPkxF_w
Cheers, Csaba
On Wed, Dec 8, 2021 at 11:20 AM Sebastian Graf
wrote: Hi Devs,
my master's student Sebastian and I (also Sebastian :)) are working on an escape analysis in STG, see https://gitlab.haskell.org/ghc/ghc/-/issues/16891#note_347903.
We have a prototype for the escape analysis that we want to validate/exploit now. The original plan was to write the transformation that allocates non-escaping objects on the STG stack. But that is quite tricky for many reasons, one of them being treatment of the GC.
This mail is rather lengthy, so I suggest you skip to "What we hope could work" and see if you can answer it without the context I provide below. If you can't, I would be very grateful if you were willing to suffer through the exposition.
# Instrumentation
So instead we thought about doing a (easily changed and thus versatile) instrumentation-based approach: Assign a sequence number to every instantiation (a term I will use to mean "allocation of g's closure") of things that we our analysis determines as escaping or non-escaping, such as STG's let bindings (focusing on non-let-no-escape functions for now). (One sequence number *per allocation* of the let binding's closure, not based on its syntactic identity.) Then, we set a bit in a (dynamically growing) global bit vector whenever the let "RHS is entered" and then unset it when we "leave the let-body". Example:
f = \x y -> let { g = [y] \z -> y + z; } in g x
Here, our analysis could see that no instantiation (which I say instead of "allocation of g's closure") of g will ever escape its scope within f. Our validation would give a fresh sequence number to the instantiation of g whenever f is called and store it in g's closure (which we arrange by piggy-backing on -prof and adding an additional field to the profiling header). Then, when g's RHS is entered, we set the bit in the global bit vector, indicating "this instantiation of g might escape". After leaving the RHS of g, we also leave the body of the defining let, which means we unset the bit in the bit vector, meaning "every use so far wasn't in an escaping scenario".
So far so good. Modifying the code upon entering g takes a bit of tinkering but can be done by building on TickyTicky in StgToCmm. But what is not done so easily is inserting the continuation upon entering the let that will unset the bit!
# What doesn't work: Modifying the Sequel
At first, we tried to modify the sequel https://gitlab.haskell.org/ghc/ghc/-/blob/4c6985cc91b9cf89b393f69c9a603e8df0... of the let-body to an `AssignTo`. That requires us to know the registers in which the let-body will return its results, which in turn means we have to know the representation of those results, so we have to write a function `stgExprPrimRep :: GenStgExpr p -> [PrimRep]`. Urgh! We were very surprised that there was no such function. And while we tested our function, we very soon knew why. Consider the following pattern synonym matcher:
GHC.Natural.$mNatJ# :: forall {rep :: GHC.Types.RuntimeRep} {r :: TYPE rep}. GHC.Num.Natural.Natural -> (GHC.Num.BigNat.BigNat -> r) -> ((# #) -> r) -> r = {} \r [scrut_sBE cont_sBF fail_sBG] case scrut_sBE of { GHC.Num.Natural.NS _ -> fail_sBG GHC.Prim.(##); GHC.Num.Natural.NB ds_sBJ -> let { sat_sBK :: GHC.Num.BigNat.BigNat = CCCS GHC.Num.BigNat.BN#! [ds_sBJ]; } in cont_sBF sat_sBK; };
Note how its result is representation-polymorphic! It only works because our machine implementation allows tail-calls. It's obvious in hindsight that we could never write `stgExprPrimRep` in such a way that it will work on the expression `cont_sBF sat_sBK`. So the sequel approach seems infeasible.
# What we hope could work: A special stack frame
The other alternative would be to insert a special continuation frame on the stack when we enter the let-body (inspired by stg_restore_cccs). This continuation frame would simply push all registers (FP regs, GP regs, Vec regs, ...) to the C stack, do its work (unsetting the bit), then pop all registers again and jump to the topmost continuation on the STG stack. Example:
f :: forall rep (r :: TYPE rep). Int# -> (Int# -> r) -> r f = \x g -> let { h = [x] \a -> x + a; } in case h x of b { __DEFAULT -> g b }
We are only interested in unsetting the bit for h here. Consider the stack when entering the body of h.
caller_of_f_cont_info <- Sp
Now push our special continuation frame:
caller_of_f_cont_info seq_h unset_bit_stk_info <- Sp
E.g., the stack frame contains the info pointer and the sequence number. (Btw., I hope I got the stack layout about right and this is even possible) Then, after we entered the continuation of the __DEFAULT alt, we do a jump to g. Plot twist: g returns an unboxed 8-tuple of `Int#`s (as caller_of_f_cont_info knows, but f certainly doesn't!), so before it returns it will push two args on the stack (correct?):
caller_of_f_cont_info seq_h unset_bit_stk_info unboxed tuple component 7 unboxed tuple component 8 <- Sp
And then `g` jumps directly to the entry code for `unset_bit_stk_info` (which does the register saving I mentioned), which absolutely can't figure out from Sp alone where seq_h is. Darn! I think Luite's recent work on the StgToByteCode had to work around similar issues, I found this wiki page https://gitlab.haskell.org/ghc/ghc-wiki-mirror/-/blob/master/Unboxed-Tuples-.... But we aren't in a position where we know the representation of `r` *at all*!
So our idea was to scan the stack, beginning from `Sp`, until I find `unset_bit_stk_info`, giving us the following situation:
caller_of_f_cont_info seq_h unset_bit_stk_info <- Bp unboxed tuple component 7 unboxed tuple component 8 <- Sp
I suggestively named the register in which we store the result Bp in analogy to the traditional base pointer. This information would be enough to unset the bit at index seq_h and then copy the unboxed tuple components 7 and 8 up by two words:
caller_of_f_cont_info unboxed tuple component 7 unboxed tuple component 8 <- Sp
Then jump to caller_of_f_cont_info, which knows what to make of the stack and the register state.
The stack scanning is horrible and too brittle and slow for a production setting, as any of the unboxed tuple components could have the same bit pattern as `unset_bit_stk_info`. We simply hope that is never the case and it's fine for the purposes of a quick-and-dirty instrumentation.
QUESTION 1: What do you think? Could this work? Can you anticipate pit falls of this approach?
# What about Thunks and StgRhsCon?
QUESTION 2: The instrumentation as described won't work for Thunks (which are only entered once) and constructor applications (like sat_sBK in the BigNat matcher above). Not sure how to change that without incurring huge performance hits (by deactivating memoisation). Ideas are welcome here.
Thanks for reading this far. I hope you could follow and are now fizzing with excitement because you have a much better idea of how to do this and will let me know :)
Sebastian _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Thanks for the feedback!
Let's have a video meeting. My schedule is flexible. What time is best for
you?
Cheers,
Csaba
On Thu, Dec 16, 2021 at 5:29 PM Sebastian Graf
Hey Csaba,
After catching up on your talk and reflecting about it a bit, it seems quite obvious that your tool is the right way to collect the data and validate our analysis! Even if meanwhile we decided that a "transparent stack frame" (which I believe is something similar to what you are doing here https://github.com/grin-compiler/ghc-whole-program-compiler-project/blob/d3e..., with an explicit `argCount` which we do not know) is not the ideal solution we've been looking for (for different reasons).
Essentially, we have two things
1. An escape analysis, implemented as an STG-to-STG pass that attaches a boolean flag to each Id whether it "escapes its scope" (for a suitable definition of that). We'd like to implement it in a way that it would be reusable within GHC with moderate effort (e.g., renaming Binder to Id or accounting for different fields), operating on a module at a time rather than whole program. 2. The instrumentation that tries to measure how many heap objects could be allocated on the stack. E.g., set a closure-specific flag whenever the closure is entered, unset that bit (once) when we "leave the scope" that defines the closure. If my understanding is right, we could just implement this "instrumentation" as a simple extra field to Closure https://github.com/grin-compiler/ghc-whole-program-compiler-project/blob/d3e..., right? Neat!
A bit tangential: I see that your interpreter currently allocates a fresh closure for let-no-escapes https://github.com/grin-compiler/ghc-whole-program-compiler-project/blob/d3e... when it could just re-use the closure of its defining scope. That would skew our numbers somewhat compared to instrumenting GHC-compiled programs, but I think we'd be able to work around that. I also wonder if the semantics of your let-no-escapes are actually as typically specified (operationally), in that a jump to a let-no-escape should also reset the stack pointer. It should hardly matter for the programs that GHC generates, though.
I would also be interested in knowing whether the +RTS -s "bytes allocated in the heap" metric (very nearly) coincides with a similar metric you could produce. It would be fantastic if that was the case! Theoretically, that should be possible, right?
I think your interpreter work is very valuable to collect data we otherwise would only be able to measure with a TickyTicky-based approach. Nice! Another, similar use case would be to identify the fraction of closures that are only entered once. I remember that there was a ticky-based patch with which Joachim used to measure this fraction https://gitlab.haskell.org/ghc/ghc/-/wikis/commentary/compiler/demand#instru... (and similarly validate the analysis results), but unfortunately it couldn't end up in master. Ah, yes, we have a whole open ticket about it: #10613 https://gitlab.haskell.org/ghc/ghc/-/issues/10613. In fact, that instrumentation is also somewhat similar (adding a field to every closure) as what we want to do.
Anyway, it seems like your work will be very valuable in replacing some of the annoying ticky-based instrumentation ideas!
Maybe we can have a call some time this or next week to discuss details, once Sebastian and I are more familiar with the code base?
Thanks for sticking with the project and doing all the hard work that can build upon! Sebastian
------ Originalnachricht ------ Von: "Csaba Hruska"
An: "Sebastian Graf" Cc: "ghc-devs" ; "Sebastian Scheper" < sebastian.scheper@student.kit.edu> Gesendet: 15.12.2021 16:16:27 Betreff: Re: Transparently hooking into the STG stack to validate an escape analysis Hi,
IMO the Cmm STG machine implementation is just too complex for student projects. It's not fun to work with at all. Why did you choose this approach? IMO the escape analysis development and validation would be much smoother and fun when you'd use the external STG interpreter. When you have a solid and working design of your analysis and transformations then you could implement it in GHC's native backend if it needs any changes at all.
What do you think? Do you disagree?
Have you seen my presentation about the stg interpreter? https://www.youtube.com/watch?v=Ey5OFPkxF_w
Cheers, Csaba
On Wed, Dec 8, 2021 at 11:20 AM Sebastian Graf
wrote: Hi Devs,
my master's student Sebastian and I (also Sebastian :)) are working on an escape analysis in STG, see https://gitlab.haskell.org/ghc/ghc/-/issues/16891#note_347903.
We have a prototype for the escape analysis that we want to validate/exploit now. The original plan was to write the transformation that allocates non-escaping objects on the STG stack. But that is quite tricky for many reasons, one of them being treatment of the GC.
This mail is rather lengthy, so I suggest you skip to "What we hope could work" and see if you can answer it without the context I provide below. If you can't, I would be very grateful if you were willing to suffer through the exposition.
# Instrumentation
So instead we thought about doing a (easily changed and thus versatile) instrumentation-based approach: Assign a sequence number to every instantiation (a term I will use to mean "allocation of g's closure") of things that we our analysis determines as escaping or non-escaping, such as STG's let bindings (focusing on non-let-no-escape functions for now). (One sequence number *per allocation* of the let binding's closure, not based on its syntactic identity.) Then, we set a bit in a (dynamically growing) global bit vector whenever the let "RHS is entered" and then unset it when we "leave the let-body". Example:
f = \x y -> let { g = [y] \z -> y + z; } in g x
Here, our analysis could see that no instantiation (which I say instead of "allocation of g's closure") of g will ever escape its scope within f. Our validation would give a fresh sequence number to the instantiation of g whenever f is called and store it in g's closure (which we arrange by piggy-backing on -prof and adding an additional field to the profiling header). Then, when g's RHS is entered, we set the bit in the global bit vector, indicating "this instantiation of g might escape". After leaving the RHS of g, we also leave the body of the defining let, which means we unset the bit in the bit vector, meaning "every use so far wasn't in an escaping scenario".
So far so good. Modifying the code upon entering g takes a bit of tinkering but can be done by building on TickyTicky in StgToCmm. But what is not done so easily is inserting the continuation upon entering the let that will unset the bit!
# What doesn't work: Modifying the Sequel
At first, we tried to modify the sequel https://gitlab.haskell.org/ghc/ghc/-/blob/4c6985cc91b9cf89b393f69c9a603e8df0... of the let-body to an `AssignTo`. That requires us to know the registers in which the let-body will return its results, which in turn means we have to know the representation of those results, so we have to write a function `stgExprPrimRep :: GenStgExpr p -> [PrimRep]`. Urgh! We were very surprised that there was no such function. And while we tested our function, we very soon knew why. Consider the following pattern synonym matcher:
GHC.Natural.$mNatJ# :: forall {rep :: GHC.Types.RuntimeRep} {r :: TYPE rep}. GHC.Num.Natural.Natural -> (GHC.Num.BigNat.BigNat -> r) -> ((# #) -> r) -> r = {} \r [scrut_sBE cont_sBF fail_sBG] case scrut_sBE of { GHC.Num.Natural.NS _ -> fail_sBG GHC.Prim.(##); GHC.Num.Natural.NB ds_sBJ -> let { sat_sBK :: GHC.Num.BigNat.BigNat = CCCS GHC.Num.BigNat.BN#! [ds_sBJ]; } in cont_sBF sat_sBK; };
Note how its result is representation-polymorphic! It only works because our machine implementation allows tail-calls. It's obvious in hindsight that we could never write `stgExprPrimRep` in such a way that it will work on the expression `cont_sBF sat_sBK`. So the sequel approach seems infeasible.
# What we hope could work: A special stack frame
The other alternative would be to insert a special continuation frame on the stack when we enter the let-body (inspired by stg_restore_cccs). This continuation frame would simply push all registers (FP regs, GP regs, Vec regs, ...) to the C stack, do its work (unsetting the bit), then pop all registers again and jump to the topmost continuation on the STG stack. Example:
f :: forall rep (r :: TYPE rep). Int# -> (Int# -> r) -> r f = \x g -> let { h = [x] \a -> x + a; } in case h x of b { __DEFAULT -> g b }
We are only interested in unsetting the bit for h here. Consider the stack when entering the body of h.
caller_of_f_cont_info <- Sp
Now push our special continuation frame:
caller_of_f_cont_info seq_h unset_bit_stk_info <- Sp
E.g., the stack frame contains the info pointer and the sequence number. (Btw., I hope I got the stack layout about right and this is even possible) Then, after we entered the continuation of the __DEFAULT alt, we do a jump to g. Plot twist: g returns an unboxed 8-tuple of `Int#`s (as caller_of_f_cont_info knows, but f certainly doesn't!), so before it returns it will push two args on the stack (correct?):
caller_of_f_cont_info seq_h unset_bit_stk_info unboxed tuple component 7 unboxed tuple component 8 <- Sp
And then `g` jumps directly to the entry code for `unset_bit_stk_info` (which does the register saving I mentioned), which absolutely can't figure out from Sp alone where seq_h is. Darn! I think Luite's recent work on the StgToByteCode had to work around similar issues, I found this wiki page https://gitlab.haskell.org/ghc/ghc-wiki-mirror/-/blob/master/Unboxed-Tuples-.... But we aren't in a position where we know the representation of `r` *at all*!
So our idea was to scan the stack, beginning from `Sp`, until I find `unset_bit_stk_info`, giving us the following situation:
caller_of_f_cont_info seq_h unset_bit_stk_info <- Bp unboxed tuple component 7 unboxed tuple component 8 <- Sp
I suggestively named the register in which we store the result Bp in analogy to the traditional base pointer. This information would be enough to unset the bit at index seq_h and then copy the unboxed tuple components 7 and 8 up by two words:
caller_of_f_cont_info unboxed tuple component 7 unboxed tuple component 8 <- Sp
Then jump to caller_of_f_cont_info, which knows what to make of the stack and the register state.
The stack scanning is horrible and too brittle and slow for a production setting, as any of the unboxed tuple components could have the same bit pattern as `unset_bit_stk_info`. We simply hope that is never the case and it's fine for the purposes of a quick-and-dirty instrumentation.
QUESTION 1: What do you think? Could this work? Can you anticipate pit falls of this approach?
# What about Thunks and StgRhsCon?
QUESTION 2: The instrumentation as described won't work for Thunks (which are only entered once) and constructor applications (like sat_sBK in the BigNat matcher above). Not sure how to change that without incurring huge performance hits (by deactivating memoisation). Ideas are welcome here.
Thanks for reading this far. I hope you could follow and are now fizzing with excitement because you have a much better idea of how to do this and will let me know :)
Sebastian _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

(Moving the conversation off the list.)
------ Originalnachricht ------
Von: "Csaba Hruska"
Thanks for the feedback! Let's have a video meeting. My schedule is flexible. What time is best for you?
Cheers, Csaba
On Thu, Dec 16, 2021 at 5:29 PM Sebastian Graf
wrote: Hey Csaba,
After catching up on your talk and reflecting about it a bit, it seems quite obvious that your tool is the right way to collect the data and validate our analysis! Even if meanwhile we decided that a "transparent stack frame" (which I believe is something similar to what you are doing here https://github.com/grin-compiler/ghc-whole-program-compiler-project/blob/d3e..., with an explicit `argCount` which we do not know) is not the ideal solution we've been looking for (for different reasons).
Essentially, we have two things An escape analysis, implemented as an STG-to-STG pass that attaches a boolean flag to each Id whether it "escapes its scope" (for a suitable definition of that). We'd like to implement it in a way that it would be reusable within GHC with moderate effort (e.g., renaming Binder to Id or accounting for different fields), operating on a module at a time rather than whole program.The instrumentation that tries to measure how many heap objects could be allocated on the stack. E.g., set a closure-specific flag whenever the closure is entered, unset that bit (once) when we "leave the scope" that defines the closure. If my understanding is right, we could just implement this "instrumentation" as a simple extra field to Closure https://github.com/grin-compiler/ghc-whole-program-compiler-project/blob/d3e..., right? Neat! A bit tangential: I see that your interpreter currently allocates a fresh closure for let-no-escapes https://github.com/grin-compiler/ghc-whole-program-compiler-project/blob/d3e... when it could just re-use the closure of its defining scope. That would skew our numbers somewhat compared to instrumenting GHC-compiled programs, but I think we'd be able to work around that. I also wonder if the semantics of your let-no-escapes are actually as typically specified (operationally), in that a jump to a let-no-escape should also reset the stack pointer. It should hardly matter for the programs that GHC generates, though.
I would also be interested in knowing whether the +RTS -s "bytes allocated in the heap" metric (very nearly) coincides with a similar metric you could produce. It would be fantastic if that was the case! Theoretically, that should be possible, right?
I think your interpreter work is very valuable to collect data we otherwise would only be able to measure with a TickyTicky-based approach. Nice! Another, similar use case would be to identify the fraction of closures that are only entered once. I remember that there was a ticky-based patch with which Joachim used to measure this fraction https://gitlab.haskell.org/ghc/ghc/-/wikis/commentary/compiler/demand#instru... (and similarly validate the analysis results), but unfortunately it couldn't end up in master. Ah, yes, we have a whole open ticket about it: #10613 https://gitlab.haskell.org/ghc/ghc/-/issues/10613. In fact, that instrumentation is also somewhat similar (adding a field to every closure) as what we want to do.
Anyway, it seems like your work will be very valuable in replacing some of the annoying ticky-based instrumentation ideas!
Maybe we can have a call some time this or next week to discuss details, once Sebastian and I are more familiar with the code base?
Thanks for sticking with the project and doing all the hard work that can build upon! Sebastian
------ Originalnachricht ------ Von: "Csaba Hruska"
An: "Sebastian Graf" Cc: "ghc-devs" ; "Sebastian Scheper" Gesendet: 15.12.2021 16:16:27 Betreff: Re: Transparently hooking into the STG stack to validate an escape analysis Hi,
IMO the Cmm STG machine implementation is just too complex for student projects. It's not fun to work with at all. Why did you choose this approach? IMO the escape analysis development and validation would be much smoother and fun when you'd use the external STG interpreter. When you have a solid and working design of your analysis and transformations then you could implement it in GHC's native backend if it needs any changes at all.
What do you think? Do you disagree?
Have you seen my presentation about the stg interpreter? https://www.youtube.com/watch?v=Ey5OFPkxF_w
Cheers, Csaba
On Wed, Dec 8, 2021 at 11:20 AM Sebastian Graf
wrote: Hi Devs,
my master's student Sebastian and I (also Sebastian :)) are working on an escape analysis in STG, see https://gitlab.haskell.org/ghc/ghc/-/issues/16891#note_347903.
We have a prototype for the escape analysis that we want to validate/exploit now. The original plan was to write the transformation that allocates non-escaping objects on the STG stack. But that is quite tricky for many reasons, one of them being treatment of the GC.
This mail is rather lengthy, so I suggest you skip to "What we hope could work" and see if you can answer it without the context I provide below. If you can't, I would be very grateful if you were willing to suffer through the exposition.
# Instrumentation
So instead we thought about doing a (easily changed and thus versatile) instrumentation-based approach: Assign a sequence number to every instantiation (a term I will use to mean "allocation of g's closure") of things that we our analysis determines as escaping or non-escaping, such as STG's let bindings (focusing on non-let-no-escape functions for now). (One sequence number *per allocation* of the let binding's closure, not based on its syntactic identity.) Then, we set a bit in a (dynamically growing) global bit vector whenever the let "RHS is entered" and then unset it when we "leave the let-body". Example:
f = \x y -> let { g = [y] \z -> y + z; } in g x
Here, our analysis could see that no instantiation (which I say instead of "allocation of g's closure") of g will ever escape its scope within f. Our validation would give a fresh sequence number to the instantiation of g whenever f is called and store it in g's closure (which we arrange by piggy-backing on -prof and adding an additional field to the profiling header). Then, when g's RHS is entered, we set the bit in the global bit vector, indicating "this instantiation of g might escape". After leaving the RHS of g, we also leave the body of the defining let, which means we unset the bit in the bit vector, meaning "every use so far wasn't in an escaping scenario".
So far so good. Modifying the code upon entering g takes a bit of tinkering but can be done by building on TickyTicky in StgToCmm. But what is not done so easily is inserting the continuation upon entering the let that will unset the bit!
# What doesn't work: Modifying the Sequel
At first, we tried to modify the sequel https://gitlab.haskell.org/ghc/ghc/-/blob/4c6985cc91b9cf89b393f69c9a603e8df0... of the let-body to an `AssignTo`. That requires us to know the registers in which the let-body will return its results, which in turn means we have to know the representation of those results, so we have to write a function `stgExprPrimRep :: GenStgExpr p -> [PrimRep]`. Urgh! We were very surprised that there was no such function. And while we tested our function, we very soon knew why. Consider the following pattern synonym matcher:
GHC.Natural.$mNatJ# :: forall {rep :: GHC.Types.RuntimeRep} {r :: TYPE rep}. GHC.Num.Natural.Natural -> (GHC.Num.BigNat.BigNat -> r) -> ((# #) -> r) -> r = {} \r [scrut_sBE cont_sBF fail_sBG] case scrut_sBE of { GHC.Num.Natural.NS _ -> fail_sBG GHC.Prim.(##); GHC.Num.Natural.NB ds_sBJ -> let { sat_sBK :: GHC.Num.BigNat.BigNat = CCCS GHC.Num.BigNat.BN#! [ds_sBJ]; } in cont_sBF sat_sBK; };
Note how its result is representation-polymorphic! It only works because our machine implementation allows tail-calls. It's obvious in hindsight that we could never write `stgExprPrimRep` in such a way that it will work on the expression `cont_sBF sat_sBK`. So the sequel approach seems infeasible.
# What we hope could work: A special stack frame
The other alternative would be to insert a special continuation frame on the stack when we enter the let-body (inspired by stg_restore_cccs). This continuation frame would simply push all registers (FP regs, GP regs, Vec regs, ...) to the C stack, do its work (unsetting the bit), then pop all registers again and jump to the topmost continuation on the STG stack. Example:
f :: forall rep (r :: TYPE rep). Int# -> (Int# -> r) -> r f = \x g -> let { h = [x] \a -> x + a; } in case h x of b { __DEFAULT -> g b }
We are only interested in unsetting the bit for h here. Consider the stack when entering the body of h.
caller_of_f_cont_info <- Sp
Now push our special continuation frame:
caller_of_f_cont_info seq_h unset_bit_stk_info <- Sp
E.g., the stack frame contains the info pointer and the sequence number. (Btw., I hope I got the stack layout about right and this is even possible) Then, after we entered the continuation of the __DEFAULT alt, we do a jump to g. Plot twist: g returns an unboxed 8-tuple of `Int#`s (as caller_of_f_cont_info knows, but f certainly doesn't!), so before it returns it will push two args on the stack (correct?):
caller_of_f_cont_info seq_h unset_bit_stk_info unboxed tuple component 7 unboxed tuple component 8 <- Sp
And then `g` jumps directly to the entry code for `unset_bit_stk_info` (which does the register saving I mentioned), which absolutely can't figure out from Sp alone where seq_h is. Darn! I think Luite's recent work on the StgToByteCode had to work around similar issues, I found this wiki page https://gitlab.haskell.org/ghc/ghc-wiki-mirror/-/blob/master/Unboxed-Tuples-.... But we aren't in a position where we know the representation of `r` *at all*!
So our idea was to scan the stack, beginning from `Sp`, until I find `unset_bit_stk_info`, giving us the following situation:
caller_of_f_cont_info seq_h unset_bit_stk_info <- Bp unboxed tuple component 7 unboxed tuple component 8 <- Sp
I suggestively named the register in which we store the result Bp in analogy to the traditional base pointer. This information would be enough to unset the bit at index seq_h and then copy the unboxed tuple components 7 and 8 up by two words:
caller_of_f_cont_info unboxed tuple component 7 unboxed tuple component 8 <- Sp
Then jump to caller_of_f_cont_info, which knows what to make of the stack and the register state.
The stack scanning is horrible and too brittle and slow for a production setting, as any of the unboxed tuple components could have the same bit pattern as `unset_bit_stk_info`. We simply hope that is never the case and it's fine for the purposes of a quick-and-dirty instrumentation.
QUESTION 1: What do you think? Could this work? Can you anticipate pit falls of this approach?
# What about Thunks and StgRhsCon?
QUESTION 2: The instrumentation as described won't work for Thunks (which are only entered once) and constructor applications (like sat_sBK in the BigNat matcher above). Not sure how to change that without incurring huge performance hits (by deactivating memoisation). Ideas are welcome here.
Thanks for reading this far. I hope you could follow and are now fizzing with excitement because you have a much better idea of how to do this and will let me know :)
Sebastian _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
participants (2)
-
Csaba Hruska
-
Sebastian Graf