Re: inside the GHC code generator

28 Feb 2006

      Bulat Ziganshin wrote:
...
...
...
* C++ calling stack should be used as much as possible
* parameters are passed in C++ way: "factorial(int n, int r)"
SM> This is unworkable, I'm afraid.  Go and read Simon's paper on the STG:
SM>    http://citeseer.ist.psu.edu/peytonjones92implementing.html
SM> there are two main reasons we don't use the C stack: garbage collection
SM> and tail calls.  What do you plan to do about GC?
minimalistic solution i see - just use two stacks. unboxed values are
passed using the C conventions and C compiler will be able to
optimize using of this stack. boxed values are passed using the
current Sp[] machinery and therefore GC will not be changed
How do you propose to do thread switching?

Incedentally, GHC had two stacks a long time ago (before 4.00) I rewrote 
the runtime and code generator to replace it with one stack.  It was 
about 6 months work, if I remember correctly.  I think you misused the 
words "just" and "minimalistic" :-)

I know gcc does (some) tail-call optimisation these days.  You don't get 
any *guarantee* that a call will be implemented as a tail-call, though, 
and the conditions are quite complex.  The gcc folks treat it as an 
optimisation, rather than a fundamental part of the operational 
semantics, which is what you need for Haskell.
...
SM> This is called semi-tagging, it was tried a long time ago.  Certainly 
SM> worth trying again, but I very much doubt the gains would be anything 
SM> like you claim, and it increases code size.
we will replace 2 unpredictable jumps with one that will not be even
executed if value is in WHNF. at least, i'm very displeased by slowness
of lazy values evaluation in GHC. for example, my binary serialization
package writes 60 mb/s working with unboxed arrays and only 20 mb/s
working with the lists
You need to *try* these things and measure the difference.  Quite often 
I'm surprised with the actual results I get from an optimisation that 
looks on paper to be worthwhile.  Today's processors with 3-level caches 
are complicated beasts.

Semi tagging increases code size, so unless you can make some 
significant gains to make up for it, you've lost already.

Let me give you another example: GHC puts info tables next to the entry 
code for a closure.  On paper this looks like it might be a bad idea: it 
pollutes the instruction cache with data that is rarely accessed during 
execution.  It's hard to arrange (this is one thing the mangler does), 
so we'd like to avoid it if possible.  However, doing this eliminates an 
indirection when entering a closure, reduces the size of info tables by 
one word, and reduces code size.  These effects together are more 
beneficial than the code-cache-pollution effect - last time I measured 
it the difference was ~10%.  (turn off TABLES_NEXT_TO_CODE and try it 
yourself).
...
SM> There are other schemes, including this one that we particularly like:
SM> use a spare bit in a pointer to indicate "points to an evaluated value".
SM>   Since there are two spare bits, you can go further and use the 4 
SM> values to mean:
SM>    0: possibly unevaluated
SM>    1: evaluated, tag 0
SM>    2: evaluated, tag 1
SM>    3: evaluated, tag 2
SM> I believe Robert Ennals tried this as part of his spec eval 
SM> implementation, but IIRC he didn't get separate measurements for it (or
SM> it didn't improve things much).  Note that this increases code size too.
to be exact, we had only 2^30-1 valid pointers, all other values are
ready to use! :)
True, but I'm assuming you often want to store a pointer in addition to 
the tag (eg. for a cons cell).  For enumerations, you're quite right 
that the whole word can be used for the tag as long as it is 
distinguished from a pointer.

Cheers,
	Simon

Re: inside the GHC code generator

Simon Marlow