
Hello Simon, Friday, February 24, 2006, 12:52:36 PM, you wrote: SM> Wow! You make a lot of good suggestions. I think some of them are SM> unfortunately unworkable, some others have been tried before, but there SM> are definitely some to try. I'll suggest a few more in this email. can you comment about unworkable/hard to implement ones? may be, ghc users can propose some solutions that you are skipped SM> First of all I should say that we don't *want* to use gcc as a code SM> generator. Everything we've been doing in the back end over the last SM> few years has been aiming towards dropping the dependency on gcc and SM> removing the Evil Mangler. Now is not the time to be spending a lot of SM> effort in beefing up the C back end :-) Our preference is to work on SM> both the native and C-- back ends; the C-- back end has the potential to SM> produce the best code and be the most portable. We also plan to keep SM> the "unregisterised" C back end for the purposes of porting to a new SM> platform, it's only the registerised path we want to lose. i know that is the hardest part of discussion. but sometimes we need to change our decisions. i think that it's better to compare efforts/results of going toward gcc/c-- way starting from current state of ghc, avoiding counting the previous efforts :( on the c-- side: 1) time/efforts required to implement many low-level optimizations 2) more or less efficient code on the gcc side: 1) time required to implement generation of native C code instead of current "portable asm" 2) one of the best optimization possible gcc way already exists and works, while C-- way is still unfinished even in its basic, unoptimized state. next, unregisterized C way provides the maximum possible level of portability SM> Having said all that, if we can get better code out of the C back end SM> without too much effort, then that's certainly worthwhile. moreover, many changes, proposed on this thread, perfectly implementable even on C-- way
translated to something like this
factorial: _s1BD = *(Sp + 0); if (_s1BD != 1) goto c1C4; // n!=1 ? R1 = *(Sp + 4); Sp = Sp + 8; return (*(Sp + 0))(); // return from factorial c1C4: _s1BI = _s1BD * (*(Sp + 4)); // n*r _s1BF = _s1BD - 1; // n-1 *(Sp + 4) = _s1BI; *(Sp + 0) = _s1BF; goto factorial;
SM> To be fair, this is only the translation on an architecture that has no SM> argument registers (x86, and currently x86_64 but hopefully not for long). this just emphasizes main problem of "portable asm" code generated by ghc - you can't optimize it as good as C compiler optimizes idiomatic C code. x86 has 7 registers, after all, and gcc generates great code for this procedure - but only if it is written in "idiomatic" manner. you will need to develop your own optimization techniques, and it will be hard and long way comparing to just generation of more high-level C code just to compare - i've rewritten my binary serialization library two times. the first version just used string concatenation (!), the second was the carefully optimized using my C experience. nevertheless, it was slow enough. and the final 3rd was optimized according to ghc features. and the result was astonishing "portable asm" looking great in theory, and C-- as portable low-level language is great theoretical idea. but reality has its own laws. want to know why my second version of serialization library, highly optimized according to C experience, was so slow? it just returns tuples and constructing/analyzing these lazy tuples used 80% of the time ghc generates not so fast code and that is known for many years. you have developed the proper theoretical way to solve this problem, but there is also much easier backdoor :) at least, it seems like this SM> The simple optimisation of changing the final tail call to factorial SM> into a goto to a label at the beginning of the function (as suggested by SM> John Meacham I think) should improve the code generated by gcc for SM> functions like this, and is easy to do. the main detail is to work with local variables instead of Sp[xx]. gcc can't optimize access to Sp[xx]
1) because it just not asked. you can enable gcc optimization by adding "-optc-O6" to the cmdline, but this leads to compilation errors on part of modules. it seems that gcc optimization is not compatible with "evil mangler" that is the ghc's own optimization trick.
SM> -O3 works, I haven't tried -O6. i've attached "optc-O2-bug.hs" that don't work both with -optc-O2 and -optc-O3
* C++ calling stack should be used as much as possible * parameters are passed in C++ way: "factorial(int n, int r)"
SM> This is unworkable, I'm afraid. Go and read Simon's paper on the STG: SM> http://citeseer.ist.psu.edu/peytonjones92implementing.html SM> there are two main reasons we don't use the C stack: garbage collection SM> and tail calls. What do you plan to do about GC? minimalistic solution i see - just use two stacks. unboxed values are passed using the C conventions and C compiler will be able to optimize using of this stack. boxed values are passed using the current Sp[] machinery and therefore GC will not be changed
* groups of mutually recursive definitions should be also "naturally" translated as much as possible
i mean here that such group should be analyzed and if there is just one entry point function, then the whole group should be compiled in one large function. the possible alternative is to compile these functions in current manner but mark them "inline" so the gcc will make these optimizations for us. i don't tested how that will work on practice, though
* functions marked INLINE in Haskell code should be marked the same in C++ code
SM> an inline function will have been inlined, so not sure what you mean here! only if they are not recursive! so, to make them really inline, we should either change compilation of these functions or at least try this trick
* maximally speed up using of already evaluated values. instead of unconditional jumping to the closure-computation function just test this field and continue computation if it contains NULL (indicating that value is already in WHNF). This change will make time penalty of boxing data structures significantly, at least 10 times, smaller
SM> This is called semi-tagging, it was tried a long time ago. Certainly SM> worth trying again, but I very much doubt the gains would be anything SM> like you claim, and it increases code size. we will replace 2 unpredictable jumps with one that will not be even executed if value is in WHNF. at least, i'm very displeased by slowness of lazy values evaluation in GHC. for example, my binary serialization package writes 60 mb/s working with unboxed arrays and only 20 mb/s working with the lists SM> There are other schemes, including this one that we particularly like: SM> use a spare bit in a pointer to indicate "points to an evaluated value". SM> Since there are two spare bits, you can go further and use the 4 SM> values to mean: SM> 0: possibly unevaluated SM> 1: evaluated, tag 0 SM> 2: evaluated, tag 1 SM> 3: evaluated, tag 2 SM> I believe Robert Ennals tried this as part of his spec eval SM> implementation, but IIRC he didn't get separate measurements for it (or SM> it didn't improve things much). Note that this increases code size too. to be exact, we had only 2^30-1 valid pointers, all other values are ready to use! :)
* in addition to ability to define strict fields, allow to define strict data structures (say, lists) and strict types/structures of functions parameters and results. that is a long-standing goal, but the "![!Int] -> !Int" function can be used inside higher-order code a lot faster than "[Int] -> Int" one. the ultimate goal is to allow Haskell to be as fast as ML dialects in every situation and this means that laziness should be optional in every place. this will also allow to make efficient emulation of C++ virtual methods
SM> Strict types are interesting, and I know several people (including me) SM> who have put some thought into how they would work. It turns out to be SM> surprisingly difficult, though. SM> You might look into doing something simple: suppose you could declare a SM> type to be unlifted: SM> data !T = .. SM> The type T doesn't have a bottom element. Its kind is !, not *, and it SM> can't be used to instantiate a polymorphic type variable, just like SM> GHC's primitive types. SM> This is quite restrictive, but it's simple and pretty easy to implement SM> in GHC as it stands. It gives you a nice guarantee: expressions of type SM> T are never thunks. case expressions deconstructing T never need to SM> evaluate it first. SM> The next generalisation would be to allow ! as an arbitrary type SM> operator, but still with the kind restriction. Dropping the kind SM> restriction is where things start to get interesting... i know another way - allow compile-time polymorphism, just as in C++ with templates: instance Num Int# where (+) = (+#) ... mult :: (Num a) => a->a mult a = a+a instance Ord Int# where (>) = (>#) ... main = do print (mult 2 > 3) print (mult 2# > 3#) we just can't mix strict types with run-time polymorphism, all other things are compilable -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com