Re[2]: inside the GHC code generator

28 Feb 2006

      Hello Simon,

Friday, February 24, 2006, 12:52:36 PM, you wrote:
SM> Wow!  You make a lot of good suggestions.  I think some of them are
SM> unfortunately unworkable, some others have been tried before, but there
SM> are definitely some to try.  I'll suggest a few more in this email.

can you comment about unworkable/hard to implement ones? may be, ghc
users can propose some solutions that you are skipped

SM> First of all I should say that we don't *want* to use gcc as a code 
SM> generator.  Everything we've been doing in the back end over the last 
SM> few years has been aiming towards dropping the dependency on gcc and 
SM> removing the Evil Mangler.  Now is not the time to be spending a lot of
SM> effort in beefing up the C back end :-)  Our preference is to work on 
SM> both the native and C-- back ends; the C-- back end has the potential to
SM> produce the best code and be the most portable.  We also plan to keep 
SM> the "unregisterised" C back end for the purposes of porting to a new 
SM> platform, it's only the registerised path we want to lose.

i know that is the hardest part of discussion. but sometimes we need
to change our decisions. i think that it's better to compare
efforts/results of going toward gcc/c-- way starting from current
state of ghc, avoiding counting the previous efforts :(  on the c--
side:

1) time/efforts required to implement many low-level optimizations
2) more or less efficient code

on the gcc side:

1) time required to implement generation of native C code instead of
current "portable asm"
2) one of the best optimization possible

gcc way already exists and works, while C-- way is still unfinished
even in its basic, unoptimized state. next, unregisterized C way
provides the maximum possible level of portability

SM> Having said all that, if we can get better code out of the C back end
SM> without too much effort, then that's certainly worthwhile.

moreover, many changes, proposed on this thread, perfectly
implementable even on C-- way
...
...
translated to something like this
factorial:
        _s1BD = *(Sp + 0);
        if (_s1BD != 1) goto c1C4;     // n!=1 ?
        R1 = *(Sp + 4);
        Sp = Sp + 8;
        return (*(Sp + 0))();  // return from factorial
    c1C4:
        _s1BI = _s1BD * (*(Sp + 4));   // n*r
        _s1BF = _s1BD - 1;             // n-1
        *(Sp + 4) = _s1BI;
        *(Sp + 0) = _s1BF;
        goto factorial;
SM> To be fair, this is only the translation on an architecture that has no
SM> argument registers (x86, and currently x86_64 but hopefully not for long).

this just emphasizes main problem of "portable asm" code generated by
ghc - you can't optimize it as good as C compiler optimizes idiomatic
C code. x86 has 7 registers, after all, and gcc generates great code
for this procedure - but only if it is written in "idiomatic" manner.
you will need to develop your own optimization techniques, and it will
be hard and long way comparing to just generation of more high-level C
code

just to compare - i've rewritten my binary serialization library two
times. the first version just used string concatenation (!), the
second was the carefully optimized using my C experience.
nevertheless, it was slow enough. and the final 3rd was optimized
according to ghc features. and the result was astonishing

"portable asm" looking great in theory, and C-- as portable low-level
language is great theoretical idea. but reality has its own laws. want
to know why my second version of serialization library, highly
optimized according to C experience, was so slow? it just returns
tuples and constructing/analyzing these lazy tuples used 80% of the
time

ghc generates not so fast code and that is known for many years. you
have developed the proper theoretical way to solve this problem, but
there is also much easier backdoor :) at least, it seems like this

SM> The simple optimisation of changing the final tail call to factorial 
SM> into a goto to a label at the beginning of the function (as suggested by
SM> John Meacham I think) should improve the code generated by gcc for 
SM> functions like this, and is easy to do.

the main detail is to work with local variables instead of Sp[xx]. gcc
can't optimize access to Sp[xx]
...
...
1) because it just not asked. you can enable gcc optimization by
adding "-optc-O6" to the cmdline, but this leads to compilation errors
on part of modules. it seems that gcc optimization is not compatible
with "evil mangler" that is the ghc's own optimization trick.
SM> -O3 works, I haven't tried -O6.

i've attached "optc-O2-bug.hs" that don't work both with -optc-O2 and
-optc-O3
...
...
* C++ calling stack should be used as much as possible
* parameters are passed in C++ way: "factorial(int n, int r)"
SM> This is unworkable, I'm afraid.  Go and read Simon's paper on the STG:

SM>    http://citeseer.ist.psu.edu/peytonjones92implementing.html

SM> there are two main reasons we don't use the C stack: garbage collection
SM> and tail calls.  What do you plan to do about GC?

minimalistic solution i see - just use two stacks. unboxed values are
passed using the C conventions and C compiler will be able to
optimize using of this stack. boxed values are passed using the
current Sp[] machinery and therefore GC will not be changed
...
...
* groups of mutually recursive definitions should be also "naturally"
translated as much as possible
i mean here that such group should be analyzed and if there is just
one entry point function, then the whole group should be compiled in
one large function. the possible alternative is to compile these
functions in current manner but mark them "inline" so the gcc will
make these optimizations for us. i don't tested how that will work on
practice, though
...
...
* functions marked INLINE in Haskell code should be marked the same in
C++ code
SM> an inline function will have been inlined, so not sure what you mean here!

only if they are not recursive! so, to make them really inline, we
should either change compilation of these functions or at least try
this trick
...
...
* maximally speed up using of already evaluated values. instead of
unconditional jumping to the closure-computation function just test
this field and continue computation if it contains NULL (indicating
that value is already in WHNF). This change will make time penalty of
boxing data structures significantly, at least 10 times, smaller
SM> This is called semi-tagging, it was tried a long time ago.  Certainly 
SM> worth trying again, but I very much doubt the gains would be anything 
SM> like you claim, and it increases code size.

we will replace 2 unpredictable jumps with one that will not be even
executed if value is in WHNF. at least, i'm very displeased by slowness
of lazy values evaluation in GHC. for example, my binary serialization
package writes 60 mb/s working with unboxed arrays and only 20 mb/s
working with the lists

SM> There are other schemes, including this one that we particularly like:
SM> use a spare bit in a pointer to indicate "points to an evaluated value".
SM>   Since there are two spare bits, you can go further and use the 4 
SM> values to mean:

SM>    0: possibly unevaluated
SM>    1: evaluated, tag 0
SM>    2: evaluated, tag 1
SM>    3: evaluated, tag 2

SM> I believe Robert Ennals tried this as part of his spec eval 
SM> implementation, but IIRC he didn't get separate measurements for it (or
SM> it didn't improve things much).  Note that this increases code size too.

to be exact, we had only 2^30-1 valid pointers, all other values are
ready to use! :)
...
...
* in addition to ability to define strict fields, allow to define
strict data structures (say, lists) and strict types/structures of
functions parameters and results. that is a long-standing goal, but
the "![!Int] -> !Int" function can be used inside higher-order code a
lot faster than "[Int] -> Int" one. the ultimate goal is to allow
Haskell to be as fast as ML dialects in every situation and this means
that laziness should be optional in every place. this will also
allow to make efficient emulation of C++ virtual methods
SM> Strict types are interesting, and I know several people (including me)
SM> who have put some thought into how they would work.  It turns out to be
SM> surprisingly difficult, though.

SM> You might look into doing something simple: suppose you could declare a
SM> type to be unlifted:

SM>    data !T = ..

SM> The type T doesn't have a bottom element.  Its kind is !, not *, and it
SM> can't be used to instantiate a polymorphic type variable, just like 
SM> GHC's primitive types.

SM> This is quite restrictive, but it's simple and pretty easy to implement
SM> in GHC as it stands.  It gives you a nice guarantee: expressions of type
SM> T are never thunks.  case expressions deconstructing T never need to 
SM> evaluate it first.

SM> The next generalisation would be to allow ! as an arbitrary type 
SM> operator, but still with the kind restriction.  Dropping the kind 
SM> restriction is where things start to get interesting...

i know another way - allow compile-time polymorphism, just as in C++
with templates:

instance Num Int# where
  (+) = (+#)
  ...

mult :: (Num a) => a->a
mult a = a+a

instance Ord Int# where
  (>) = (>#)
  ...

main = do print (mult 2 > 3)
          print (mult 2# > 3#)

we just can't mix strict types with run-time polymorphism, all other
things are compilable

-- 
Best regards,
 Bulat                            mailto:Bulat.Ziganshin@gmail.com

Re[2]: inside the GHC code generator

Bulat Ziganshin