RE: High-level Cmm code and stack allocation

13 Jan 2014

      Thanks.  Reading what you write below, I can see two possible motivations.

1.  Reduce stack sizes.  
2.  Eliminate memory moves

For (1) do we have any data to show that the non-overlap of areas was giving rise to unacceptably big stacks?

For (2) that is indeed clever, but it's pretty serendipitous: it relies on the overlap being just so, so that coincidentally y gets stored in the same place as it was loaded from.  I imagine that you don't plan the stack layout to cause that to happen; it's just a coincidence.  Do we have any data to show that the coincidence happens with any frequency?

Also, as you note, we lose the opportunity for certain sorts of code motion, perhaps increasing register pressure a lot.  So there is a downside too.

You seldom do things without a very good reason, so I feel I must be missing something.

Simon

| -----Original Message-----
| From: Simon Marlow [mailto:marlowsd@gmail.com]
| Sent: 10 January 2014 17:00
| To: Simon Peyton Jones; Herbert Valerio Riedel
| Cc: ghc-devs@haskell.org
| Subject: Re: High-level Cmm code and stack allocation
| 
| So stack areas are still a great abstraction, the only change is that
| they now overlap.  It's not just about stack getting too big, I've
| copied the notes I made about it below (which I will paste into the code
| in due course).  The nice property that we can generate well-defined Cmm
| without knowing explicit stack offsets is intact.
| 
| What is different is that there used to be an intermediate state where
| live variables were saved to abstract stack areas across calls, but Sp
| was still not manifest.  This intermediate state doesn't exist any more,
| the stack layout algorithm does it all in one pass.  To me this was far
| simpler, and I think it ended up being fewer lines of code than the old
| multi-phase stack layout algorithm (it's also much faster).
| 
| Of course you can always change this.  My goal was to get code that was
| at least as good as the old code generator and in a reasonable amount of
| time, and this was the shortest path I could find to that goal.
| 
| Cheers,
| Simon
| 
| e.g. if we had
| 
|      x = Sp[old + 8]
|      y = Sp[old + 16]
| 
|      Sp[young(L) + 8]  = L
|      Sp[young(L) + 16] = y
|      Sp[young(L) + 24] = x
|      call f() returns to L
| 
| if areas semantically do not overlap, then we might optimise this to
| 
|      Sp[young(L) + 8]  = L
|      Sp[young(L) + 16] = Sp[old + 8]
|      Sp[young(L) + 24] = Sp[old + 16]
|      call f() returns to L
| 
| and now young(L) cannot be allocated at the same place as old, and we
| are doomed to use more stack.
| 
|    - old+8  conflicts with young(L)+8
|    - old+16 conflicts with young(L)+16 and young(L)+8
| 
| so young(L)+8 == old+24 and we get
| 
|      Sp[-8]  = L
|      Sp[-16] = Sp[8]
|      Sp[-24] = Sp[0]
|      Sp -= 24
|      call f() returns to L
| 
| However, if areas are defined to be "possibly overlapping" in the
| semantics, then we cannot commute any loads/stores of old with young(L),
| and we will be able to re-use both old+8 and old+16 for young(L).
| 
|      x = Sp[8]
|      y = Sp[0]
| 
|      Sp[8] = L
|      Sp[0] = y
|      Sp[-8] = x
|      Sp = Sp - 8
|      call f() returns to L
| 
| Now, the assignments of y go away,
| 
|      x = Sp[8]
|      Sp[8] = L
|      Sp[-8] = x
|      Sp = Sp - 8
|      call f() returns to L
| 
| 
| Conclusion:
| 
|    - T[old+N] aliases with U[young(L)+M] for all T, U, L, N and M
|    - T[old+N] aliases with U[old+M] only if the areas actually overlap
| 
| this ensures that we will not commute any accesses to old with
| young(L) or young(L) with young(L'), and the stack allocator will get
| the maximum opportunity to overlap these areas, keeping the stack use to
| a minimum and possibly avoiding some assignments.
| 
| 
| 
| On 10/01/2014 16:35, Simon Peyton Jones wrote:
| > Oh, ok.  Alas, a good chunk of my model of Cmm has just gone out of
| the window.  I thought that areas were such a lovely, well-behaved
| abstraction.  I was thrilled when we came up with them, and I'm very
| sorry to see them go.
| >
| > There are no many things that I no longer understand.  I now have no
| idea how we save live variables over a call, or how multiple returned
| values from one call (returned on the stack) stay right where they are
| if they are live across the next call.
| >
| > What was the actual problem?  That functions used too much stack, so
| the stack was getting too big?  But a one slot area corresponds exactly
| to a live variable, so I don't see how the area abstraction could
| possibly increase stack size.  And is stack size a crucial issue anyway?
| >
| > Apart from anything else, areas would have given a lovely solution to
| the problem this thread started with!
| >
| > I guess we can talk about this when you next visit?  But some
| documentation would be welcome.
| >
| > Simon
| >
| > | -----Original Message-----
| > | From: Simon Marlow [mailto:marlowsd@gmail.com]
| > | Sent: 10 January 2014 16:24
| > | To: Simon Peyton Jones; Herbert Valerio Riedel
| > | Cc: ghc-devs@haskell.org
| > | Subject: Re: High-level Cmm code and stack allocation
| > |
| > | There are no one-slot areas any more, I ditched those when I rewrote
| > | the stack allocator.  There is only ever one live area: either the
| > | old area or the young area for a call we are about to make or have
| just made.
| > | (see the data type: I removed the one-slot areas)
| > |
| > | I struggled for a long time with this.  The problem is that with the
| > | semantics of non-overlapping areas, code motion optimisations would
| > | tend to increase the stack requirements of the function by
| > | overlapping the live ranges of the areas.  I concluded that actually
| > | what we wanted was areas that really do overlap, and optimisations
| > | that respect that, so that we get more efficient stack usage.
| > |
| > | Cheers,
| > | 	Simon
| > |
| > | On 10/01/2014 15:22, Simon Peyton Jones wrote:
| > | > That documentation would be good, yes!  I don't know what it means
| > | > to
| > | say "we don't really have a general concept of areas any more".  We
| > | did before, and I didn't know that it had gone away.  Urk!  We can
| > | have lots of live areas, notably the old area (for the current
| > | call/return parameters, the call area for a call we are preparing,
| > | and the one-slot areas for variables we are saving on the stack.
| > | >
| > | > Here's he current story
| > | > https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler/StackAre
| > | > as
| > | >
| > | > I agree that we have no concrete syntax for talking about areas,
| > | > but
| > | that is something we could fix.  But I'm worried that they may not
| > | mean what they used to mean.
| > | >
| > | > Simon
| > | >
| > | > | -----Original Message-----
| > | > | From: Simon Marlow [mailto:marlowsd@gmail.com]
| > | > | Sent: 09 January 2014 08:39
| > | > | To: Simon Peyton Jones; Herbert Valerio Riedel
| > | > | Cc: ghc-devs@haskell.org
| > | > | Subject: Re: High-level Cmm code and stack allocation
| > | > |
| > | > | On 08/01/2014 10:07, Simon Peyton Jones wrote:
| > | > | > | > Can't we just allocate a Cmm "area"? The address of an
| > | > | > | > area is a
| > | > | > | perfectly well-defined Cmm value.
| > | > | >
| > | > | > What about this idea?
| > | > |
| > | > | We don't really have a general concept of areas (any more), and
| > | > | areas aren't exposed in the concrete Cmm syntax at all.  The
| > | > | current semantics is that areas may overlap with each other, so
| > | > | there should only be one active area at any point.  I found that
| > | > | this was important to ensure that we could generate good code
| > | > | from the stack layout algorithm, otherwise it had to make
| > | > | pessimistic assumptions
| > | and use too much stack.
| > | > |
| > | > | You're going to ask me where this is documented, and I think I
| > | > | have to admit to slacking off, sorry :-)  We did discuss it at
| > | > | the time, and I made copious notes, but I didn't transfer those
| to the code.
| > | > | I'll add a Note.
| > | > |
| > | > | Cheers,
| > | > | Simon
| > | > |
| > | > |
| > | > | > Simon
| > | > | >
| > | > | > | -----Original Message-----
| > | > | > | From: Simon Marlow [mailto:marlowsd@gmail.com]
| > | > | > | Sent: 08 January 2014 09:26
| > | > | > | To: Simon Peyton Jones; Herbert Valerio Riedel
| > | > | > | Cc: ghc-devs@haskell.org
| > | > | > | Subject: Re: High-level Cmm code and stack allocation
| > | > | > |
| > | > | > | On 07/01/14 22:53, Simon Peyton Jones wrote:
| > | > | > | > | Yes, this is technically wrong but luckily works.  I'd
| > | > | > | > | very much like to have a better solution, preferably one
| > | > | > | > | that doesn't add any extra overhead.
| > | > | > | >
| > | > | > | > | __decodeFloat_Int is a C function, so it will not touch
| > | > | > | > | the Haskell stack.
| > | > | > | >
| > | > | > | > This all seems terribly fragile to me.  At least it ought
| > | > | > | > to be
| > | > | > | surrounded with massive comments pointing out how terribly
| > | > | > | fragile it is, breaking all the rules that we carefully
| > | > | > | document
| > | elsewhere.
| > | > | > | >
| > | > | > | > Can't we just allocate a Cmm "area"? The address of an
| > | > | > | > area is a
| > | > | > | perfectly well-defined Cmm value.
| > | > | > |
| > | > | > | It is fragile, yes.  We can't use static memory because it
| > | > | > | needs to be thread-local.  This particular hack has gone
| > | > | > | through several iterations over the years: first we had
| > | > | > | static memory, which broke when we did the parallel runtime,
| > | > | > | then we had special storage in the Capability, which we gave
| > | > | > | up when GMP was split out into a separate library, because
| > | > | > | it didn't seem right to have magic fields in the Capability
| for one library.
| > | > | > |
| > | > | > | I'm looking into whether we can do temporary allocation on
| > | > | > | the heap for this instead.
| > | > | > |
| > | > | > | Cheers,
| > | > | > | Simon
| > | > | > |
| > | > | > |
| > | > | > | > Simon
| > | > | > | >
| > | > | > | > | -----Original Message-----
| > | > | > | > | From: ghc-devs [mailto:ghc-devs-bounces@haskell.org] On
| > | > | > | > | Behalf Of Simon Marlow
| > | > | > | > | Sent: 07 January 2014 16:05
| > | > | > | > | To: Herbert Valerio Riedel; ghc-devs@haskell.org
| > | > | > | > | Subject: Re: High-level Cmm code and stack allocation
| > | > | > | > |
| > | > | > | > | On 04/01/2014 23:26, Herbert Valerio Riedel wrote:
| > | > | > | > | > Hello,
| > | > | > | > | >
| > | > | > | > | > According to Note [Syntax of .cmm files],
| > | > | > | > | >
| > | > | > | > | > | There are two ways to write .cmm code:
| > | > | > | > | > |
| > | > | > | > | > |  (1) High-level Cmm code delegates the stack
| > | > | > | > | > | handling to GHC,
| > | > | > | and
| > | > | > | > | > |      never explicitly mentions Sp or registers.
| > | > | > | > | > |
| > | > | > | > | > |  (2) Low-level Cmm manages the stack itself, and
| > | > | > | > | > | must know
| > | > | about
| > | > | > | > | > |      calling conventions.
| > | > | > | > | > |
| > | > | > | > | > | Whether you want high-level or low-level Cmm is
| > | > | > | > | > | indicated by the presence of an argument list on a
| > | procedure.
| > | > | > | > | >
| > | > | > | > | > However, while working on integer-gmp I've been
| > | > | > | > | > noticing in integer-gmp/cbits/gmp-wrappers.cmm that
| > | > | > | > | > even though all Cmm
| > | > | > | > | procedures
| > | > | > | > | > have been converted to high-level Cmm, they still
| > | > | > | > | > reference the
| > | > | > | 'Sp'
| > | > | > | > | > register, e.g.
| > | > | > | > | >
| > | > | > | > | >
| > | > | > | > | >      #define GMP_TAKE1_RET1(name,mp_fun)       \
| > | > | > | > | >      name (W_ ws1, P_ d1)                      \
| > | > | > | > | >      {                                         \
| > | > | > | > | >        W_ mp_tmp1;                             \
| > | > | > | > | >        W_ mp_result1;                          \
| > | > | > | > | >                                                \
| > | > | > | > | >      again:                                    \
| > | > | > | > | >        STK_CHK_GEN_N (2 * SIZEOF_MP_INT);      \
| > | > | > | > | >        MAYBE_GC(again);                        \
| > | > | > | > | >                                                \
| > | > | > | > | >        mp_tmp1    = Sp - 1 * SIZEOF_MP_INT;    \
| > | > | > | > | >        mp_result1 = Sp - 2 * SIZEOF_MP_INT;    \
| > | > | > | > | >        ...                                     \
| > | > | > | > | >
| > | > | > | > | >
| > | > | > | > | > So is this valid high-level Cmm code? What's the
| > | > | > | > | > proper way to
| > | > | > | > | allocate
| > | > | > | > | > Stack (and/or Heap) memory from high-level Cmm code?
| > | > | > | > |
| > | > | > | > | Yes, this is technically wrong but luckily works.  I'd
| > | > | > | > | very much like to have a better solution, preferably one
| > | > | > | > | that doesn't add any extra overhead.
| > | > | > | > |
| > | > | > | > | The problem here is that we need to allocate a couple of
| > | > | > | > | temporary words and take their address; that's an
| > | > | > | > | unusual thing to do in Cmm, so it only occurs in a few
| > | > | > | > | places (mainly
| > | > | interacting with gmp).
| > | > | > | > | Usually if you want some temporary storage you can use
| > | > | > | > | local variables or some heap-allocated memory.
| > | > | > | > |
| > | > | > | > | Cheers,
| > | > | > | > | Simon
| > | > | > | > | _______________________________________________
| > | > | > | > | ghc-devs mailing list
| > | > | > | > | ghc-devs@haskell.org
| > | > | > | > | http://www.haskell.org/mailman/listinfo/ghc-devs
| > | > | > | >
| > | > | >
| > | >
| >