
Hello last days i studied GHC code generator, reading -ddumps of all sorts, GHC sources, various papers and John's appeals :) what i carried from this investigation: GHC has great high-level optimization. moreover, GHC now allows to program STG almost directly. when i look at STG code i don't see anything that can be done better - at least for these simple loops i tried to compile. i see just unboxed variables, primitive operations and simple loops represented as tail recursion. fine. then STG compiled to the simplified C (called abstract C earlier and quasi C-- now), what is next can be translated: * to real C and then compiled by gcc * to assembler by build-in simple C-- compiler * to assembler by some external C-- compiler (at least it is theoretically possible) that is the "simplified C"? it's really generalized abstract assembler of modern cpus. its most important feature for us is that all parameters of STG functions are passed here via the stack. so, for example, the following STG code: factorial n r = if n=1 then r else (factorial (n - 1) (n*r)) translated to something like this factorial: _s1BD = *(Sp + 0); if (_s1BD != 1) goto c1C4; // n!=1 ? R1 = *(Sp + 4); Sp = Sp + 8; return (*(Sp + 0))(); // return from factorial c1C4: _s1BI = _s1BD * (*(Sp + 4)); // n*r _s1BF = _s1BD - 1; // n-1 *(Sp + 4) = _s1BI; *(Sp + 0) = _s1BF; goto factorial; i rewrote for clarity STG code to its Haskell equivalent, the C-- to the C, plus made a few comments. is this code good? it depends :) if we see this as the assembler code, it is really bad - for just 2 instructions that perform real work, there are 4 memory accesses, two jumps and comparison. unfortunately, this translation is what GHC really does when it is called in "-fasm" mode. now you should realize why even strict Haskell code compiled to such slow executables. well, there is also gcc path ("-fvia-C", auto-enabled by "-O"). but it's only several percents better (as said in GHC docs). why this shining C compiler can't optimize such trivial code? 1) because it just not asked. you can enable gcc optimization by adding "-optc-O6" to the cmdline, but this leads to compilation errors on part of modules. it seems that gcc optimization is not compatible with "evil mangler" that is the ghc's own optimization trick. 2) moreover, enabling gcc optimization seems to make noticeable improvements only on tight loops like this factorial calculation. and even in this case asm code produced is far worse than for the normal C program that uses explicit loop (see enclosed fac.c and fac-c.s, which contains factorial functions manually written in C using tail recursion and explicit loop): factorial: movl (%ebp), %edx cmpl $1, %edx je L6 movl 4(%ebp), %eax imull %edx, %eax decl %edx movl %edx, (%ebp) movl %eax, 4(%ebp) jmp factorial as you can see here, gcc isn't unrolled the loop and retained all the stack<->register moves. why? the short answer is what gcc just don't know that data pointed by the Sp pointer is non-volatile - they can't be changed in middle of the program by some async computation or exception and don't required to be updated immediately. may be there is some way to tell gcc this? this would immediately SUBSTANTIALLY improve all ghc code generation the long answer is: are you ever heard promises that gcc is best cpu-independent assembler? no? and you know why? because gcc is not cpu-independent assembler. gcc was strongly optimized to make efficient asm from the code usually written by the C programmers. but code generated by ghc has nothing common with it. so we are stay with all these register-memory moves, non-unrolled loops and all other results of naive compilation. gcc is just don't know how to efficiently manage such code! so now we have rather strange situation: ghc generates not-so-bad "abstract asm" code, but there is just no tool to compile this code to efficient "concrete asm"! although that is perfectly possible and i can imagine the translation that uses registers to store these two internal variables, unrolls the loop and schedule instructions to maximally load the modern superscalar cpu. at this moment i recalled existence of C-- and now, i think, understand the whole history: - long ago in the 1992 Simon PJ has developed this abstract asm and rules for translation of STG to this abstract asm (you can find details in spineless-tagless-gmachine.ps.gz). this "portable asm" was represented at practice by low-level C code. - then he stand waiting for efficient compilation of ghc-generated C code by gcc or some other C compiler - in 1999 he realized that this mountain will not walk to him and started C-- project. its goal was exact to implement portable assembler and in particular make at last efficient compilation of ghc-generated code - but this project fails. at least enough time is gone and i don't hear that C-- can now be used as better alternative to gcc. so now we have what i stated above: ghc produces good portable assembler code. it is so good that there is no way to compile it efficiently :))) now what can be done in this direction: we can either improve "STG -> portable asm" or the "portable asm -> asm" translation. second alternative can be implemented either as * improving GHC's "-fasm" code generator. It is the easiest way, but we soon will turn to limited number of registers and requirement to do global code analysis. moreover, this way don't build fastest executables at this moment, so we will need some efforts just to overtake gcc. and going this way we will reimplement such standard thing as optimizing compiler just for GHC, that is useless waste of time. it may be better to make more widely used optimizing compiler and this decision turns us to second alternative: * adding appropriate optimizations to the existing C-- compiler, namely Quick C--. it should be not much more complicated than previous way, as far as we will try to optimize only typical ghc-generated code. that way we may also help other Quick C-- -based language implementations, such as Lua. but on other side this way rises two questions: first, C-- and especially "portable asm" generated by the GHC is rather low-level language. as you should know, lower level of language makes it hard or even impossible to optimize it as good as higher level one. This makes a serious barrier on ultimate optimization level possible on this way, but at this moment it is more theoretical question. The more practical question is how long we will wait for implementing in QC-- all the optimizations that are already present in gcc, such as optimization of registers usage, loop unrolling, instruction scheduling, global inlining, sse support, run-time selection between Centrino-optimized and Pentium4-optimized code and more? Using C-- as portable assembler looking good on the paper, but can we really compete with gcc?? Moreover, even gcc itself can't compete with Intel C compiler! so, going this way we will never reach the speed of C programs. and this conclusion turns us to the 3rd alternative: * use the FULL C/C++ language as "portable asm". it means - map STG to the natural C++ language idioms instead of the current translation. just try to compare code generated by gcc for the above-mentioned C code, fWXAXDfMainXDfac function in the included fac.c, and fac function in the same file. first is the current ghc translation, second is jhc's translation and third is my manual translation of the same STG function to idiomatic C. as we can see there is a big difference - the more close our intermediate code to idiomatic C the better resulting code generated by the gcc. so i propose the following, rather straightforward STG->C++ translation for the "unboxed" functions: * C++ calling stack should be used as much as possible * parameters are passed in C++ way: "factorial(int n, int r)" * multiple results can be returned via C++ pair template, if this is efficiently implemented in gcc * recursive definitions translated into the for/while loops if possible * groups of mutually recursive definitions should be also "naturally" translated as much as possible * functions marked INLINE in Haskell code should be marked the same in C++ code next source of inefficiency is the polymorphic and higher order functions. it will be great if such functions will be inlined/specialized more aggressively than other functions to reach "concrete" code as often as possible. this way Haskell's high-level features will cost us nothing in most cases, like templates in C++, which are free in terms of execution time. It should also help us to drop most of the INLINE pragmas in libs: small functions will be inlined anyway, large functions will be specialized to concrete code, and if user wants to maximally optimize some concrete function in his program, it is better to request optimization of this concrete function (20% of our code performs 80% of work, but only application programmer knows what invocation of "sort" need to be inlined and what used just to order 2 values) next source of inefficiency is the lazy evaluation. i propose: * if some function has strict parameters, then its worker part should accept this parameters in unboxed form (it is already implemented) and wrappers for this worker should be inlined at call places. it is the same requirement as for polymorphic functions * maximally speed up using of already evaluated values. instead of unconditional jumping to the closure-computation function just test this field and continue computation if it contains NULL (indicating that value is already in WHNF). This change will make time penalty of boxing data structures significantly, at least 10 times, smaller * evaluate boxed values that are not in WHNF using C (hardware) stack if possible. this will allow us to free the ebp register. if this is impossible, then "two stacks" scheme can be used - hardware stack and "natural" C recursion used for unboxed values and ebp-based explicitly managed stack used for boxed values * at this moment using of I/O exceptions in ghc leads to generation of boxed datastructures. it will be great to use C/C++ exception handling abilities (setjmp or try) to make I/O exceptions "free" in terms of speed * in addition to ability to define strict fields, allow to define strict data structures (say, lists) and strict types/structures of functions parameters and results. that is a long-standing goal, but the "![!Int] -> !Int" function can be used inside higher-order code a lot faster than "[Int] -> Int" one. the ultimate goal is to allow Haskell to be as fast as ML dialects in every situation and this means that laziness should be optional in every place. this will also allow to make efficient emulation of C++ virtual methods Seymore Cray once said that in order to develop fast vector processor you should start with making the fast scalar one. the same applies to Haskell - although it's lazy language, to make its fast implementation we need first to optimize strict, "unboxed" code. This will allow us: * to compete with C - every algorithm that can be expressed in C, can be expressed in Haskell with the same complexity and run with the same speed. this includes all the shootout-type competition and any other strict algorithms * to rewrite speed-critical parts of any Haskell program in the strict manner and get maximal possible speed without using FFI and marshalling data structures between C and Haskell * to make solid backend for the ghc's "strictifying" optimizations I personally can work on the first part of my proposals - making efficient translation from STG to "natural C/C++". this should give significant improvements for small strict loops. i also think that the following things can be implemented rather easily and should give significant speedup: * fix problems with using "-optc-O6" * mark Sp variable in some way to avoid excessive moves between stack and registers * make evaluation of boxed values that are already in WHNF as fast as possible Full implementation of these proposals will lead to the following: * for the simple imperative algorithms the C-quality code will be generated. this means that such things as matrix libraries or game physics engines written in pure Haskell will have just the same speed as their C++ equivalents. is not this that we wait for many years? :) * "real-world programs" like the darcs or pugs that mostly use strict computations, will get significant raise in speed, and become close to speed of the similar C++ projects that use templates and classes to express their complex functionality * "high-school algorithms" that involve large number of lazy computations, should also get significant speedup. even in such algorithms most computations, imho, use data in WHNF and therefore will go faster -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com