inside the GHC code generator

Hello last days i studied GHC code generator, reading -ddumps of all sorts, GHC sources, various papers and John's appeals :) what i carried from this investigation: GHC has great high-level optimization. moreover, GHC now allows to program STG almost directly. when i look at STG code i don't see anything that can be done better - at least for these simple loops i tried to compile. i see just unboxed variables, primitive operations and simple loops represented as tail recursion. fine. then STG compiled to the simplified C (called abstract C earlier and quasi C-- now), what is next can be translated: * to real C and then compiled by gcc * to assembler by build-in simple C-- compiler * to assembler by some external C-- compiler (at least it is theoretically possible) that is the "simplified C"? it's really generalized abstract assembler of modern cpus. its most important feature for us is that all parameters of STG functions are passed here via the stack. so, for example, the following STG code: factorial n r = if n=1 then r else (factorial (n - 1) (n*r)) translated to something like this factorial: _s1BD = *(Sp + 0); if (_s1BD != 1) goto c1C4; // n!=1 ? R1 = *(Sp + 4); Sp = Sp + 8; return (*(Sp + 0))(); // return from factorial c1C4: _s1BI = _s1BD * (*(Sp + 4)); // n*r _s1BF = _s1BD - 1; // n-1 *(Sp + 4) = _s1BI; *(Sp + 0) = _s1BF; goto factorial; i rewrote for clarity STG code to its Haskell equivalent, the C-- to the C, plus made a few comments. is this code good? it depends :) if we see this as the assembler code, it is really bad - for just 2 instructions that perform real work, there are 4 memory accesses, two jumps and comparison. unfortunately, this translation is what GHC really does when it is called in "-fasm" mode. now you should realize why even strict Haskell code compiled to such slow executables. well, there is also gcc path ("-fvia-C", auto-enabled by "-O"). but it's only several percents better (as said in GHC docs). why this shining C compiler can't optimize such trivial code? 1) because it just not asked. you can enable gcc optimization by adding "-optc-O6" to the cmdline, but this leads to compilation errors on part of modules. it seems that gcc optimization is not compatible with "evil mangler" that is the ghc's own optimization trick. 2) moreover, enabling gcc optimization seems to make noticeable improvements only on tight loops like this factorial calculation. and even in this case asm code produced is far worse than for the normal C program that uses explicit loop (see enclosed fac.c and fac-c.s, which contains factorial functions manually written in C using tail recursion and explicit loop): factorial: movl (%ebp), %edx cmpl $1, %edx je L6 movl 4(%ebp), %eax imull %edx, %eax decl %edx movl %edx, (%ebp) movl %eax, 4(%ebp) jmp factorial as you can see here, gcc isn't unrolled the loop and retained all the stack<->register moves. why? the short answer is what gcc just don't know that data pointed by the Sp pointer is non-volatile - they can't be changed in middle of the program by some async computation or exception and don't required to be updated immediately. may be there is some way to tell gcc this? this would immediately SUBSTANTIALLY improve all ghc code generation the long answer is: are you ever heard promises that gcc is best cpu-independent assembler? no? and you know why? because gcc is not cpu-independent assembler. gcc was strongly optimized to make efficient asm from the code usually written by the C programmers. but code generated by ghc has nothing common with it. so we are stay with all these register-memory moves, non-unrolled loops and all other results of naive compilation. gcc is just don't know how to efficiently manage such code! so now we have rather strange situation: ghc generates not-so-bad "abstract asm" code, but there is just no tool to compile this code to efficient "concrete asm"! although that is perfectly possible and i can imagine the translation that uses registers to store these two internal variables, unrolls the loop and schedule instructions to maximally load the modern superscalar cpu. at this moment i recalled existence of C-- and now, i think, understand the whole history: - long ago in the 1992 Simon PJ has developed this abstract asm and rules for translation of STG to this abstract asm (you can find details in spineless-tagless-gmachine.ps.gz). this "portable asm" was represented at practice by low-level C code. - then he stand waiting for efficient compilation of ghc-generated C code by gcc or some other C compiler - in 1999 he realized that this mountain will not walk to him and started C-- project. its goal was exact to implement portable assembler and in particular make at last efficient compilation of ghc-generated code - but this project fails. at least enough time is gone and i don't hear that C-- can now be used as better alternative to gcc. so now we have what i stated above: ghc produces good portable assembler code. it is so good that there is no way to compile it efficiently :))) now what can be done in this direction: we can either improve "STG -> portable asm" or the "portable asm -> asm" translation. second alternative can be implemented either as * improving GHC's "-fasm" code generator. It is the easiest way, but we soon will turn to limited number of registers and requirement to do global code analysis. moreover, this way don't build fastest executables at this moment, so we will need some efforts just to overtake gcc. and going this way we will reimplement such standard thing as optimizing compiler just for GHC, that is useless waste of time. it may be better to make more widely used optimizing compiler and this decision turns us to second alternative: * adding appropriate optimizations to the existing C-- compiler, namely Quick C--. it should be not much more complicated than previous way, as far as we will try to optimize only typical ghc-generated code. that way we may also help other Quick C-- -based language implementations, such as Lua. but on other side this way rises two questions: first, C-- and especially "portable asm" generated by the GHC is rather low-level language. as you should know, lower level of language makes it hard or even impossible to optimize it as good as higher level one. This makes a serious barrier on ultimate optimization level possible on this way, but at this moment it is more theoretical question. The more practical question is how long we will wait for implementing in QC-- all the optimizations that are already present in gcc, such as optimization of registers usage, loop unrolling, instruction scheduling, global inlining, sse support, run-time selection between Centrino-optimized and Pentium4-optimized code and more? Using C-- as portable assembler looking good on the paper, but can we really compete with gcc?? Moreover, even gcc itself can't compete with Intel C compiler! so, going this way we will never reach the speed of C programs. and this conclusion turns us to the 3rd alternative: * use the FULL C/C++ language as "portable asm". it means - map STG to the natural C++ language idioms instead of the current translation. just try to compare code generated by gcc for the above-mentioned C code, fWXAXDfMainXDfac function in the included fac.c, and fac function in the same file. first is the current ghc translation, second is jhc's translation and third is my manual translation of the same STG function to idiomatic C. as we can see there is a big difference - the more close our intermediate code to idiomatic C the better resulting code generated by the gcc. so i propose the following, rather straightforward STG->C++ translation for the "unboxed" functions: * C++ calling stack should be used as much as possible * parameters are passed in C++ way: "factorial(int n, int r)" * multiple results can be returned via C++ pair template, if this is efficiently implemented in gcc * recursive definitions translated into the for/while loops if possible * groups of mutually recursive definitions should be also "naturally" translated as much as possible * functions marked INLINE in Haskell code should be marked the same in C++ code next source of inefficiency is the polymorphic and higher order functions. it will be great if such functions will be inlined/specialized more aggressively than other functions to reach "concrete" code as often as possible. this way Haskell's high-level features will cost us nothing in most cases, like templates in C++, which are free in terms of execution time. It should also help us to drop most of the INLINE pragmas in libs: small functions will be inlined anyway, large functions will be specialized to concrete code, and if user wants to maximally optimize some concrete function in his program, it is better to request optimization of this concrete function (20% of our code performs 80% of work, but only application programmer knows what invocation of "sort" need to be inlined and what used just to order 2 values) next source of inefficiency is the lazy evaluation. i propose: * if some function has strict parameters, then its worker part should accept this parameters in unboxed form (it is already implemented) and wrappers for this worker should be inlined at call places. it is the same requirement as for polymorphic functions * maximally speed up using of already evaluated values. instead of unconditional jumping to the closure-computation function just test this field and continue computation if it contains NULL (indicating that value is already in WHNF). This change will make time penalty of boxing data structures significantly, at least 10 times, smaller * evaluate boxed values that are not in WHNF using C (hardware) stack if possible. this will allow us to free the ebp register. if this is impossible, then "two stacks" scheme can be used - hardware stack and "natural" C recursion used for unboxed values and ebp-based explicitly managed stack used for boxed values * at this moment using of I/O exceptions in ghc leads to generation of boxed datastructures. it will be great to use C/C++ exception handling abilities (setjmp or try) to make I/O exceptions "free" in terms of speed * in addition to ability to define strict fields, allow to define strict data structures (say, lists) and strict types/structures of functions parameters and results. that is a long-standing goal, but the "![!Int] -> !Int" function can be used inside higher-order code a lot faster than "[Int] -> Int" one. the ultimate goal is to allow Haskell to be as fast as ML dialects in every situation and this means that laziness should be optional in every place. this will also allow to make efficient emulation of C++ virtual methods Seymore Cray once said that in order to develop fast vector processor you should start with making the fast scalar one. the same applies to Haskell - although it's lazy language, to make its fast implementation we need first to optimize strict, "unboxed" code. This will allow us: * to compete with C - every algorithm that can be expressed in C, can be expressed in Haskell with the same complexity and run with the same speed. this includes all the shootout-type competition and any other strict algorithms * to rewrite speed-critical parts of any Haskell program in the strict manner and get maximal possible speed without using FFI and marshalling data structures between C and Haskell * to make solid backend for the ghc's "strictifying" optimizations I personally can work on the first part of my proposals - making efficient translation from STG to "natural C/C++". this should give significant improvements for small strict loops. i also think that the following things can be implemented rather easily and should give significant speedup: * fix problems with using "-optc-O6" * mark Sp variable in some way to avoid excessive moves between stack and registers * make evaluation of boxed values that are already in WHNF as fast as possible Full implementation of these proposals will lead to the following: * for the simple imperative algorithms the C-quality code will be generated. this means that such things as matrix libraries or game physics engines written in pure Haskell will have just the same speed as their C++ equivalents. is not this that we wait for many years? :) * "real-world programs" like the darcs or pugs that mostly use strict computations, will get significant raise in speed, and become close to speed of the similar C++ projects that use templates and classes to express their complex functionality * "high-school algorithms" that involve large number of lazy computations, should also get significant speedup. even in such algorithms most computations, imho, use data in WHNF and therefore will go faster -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

the long answer is: are you ever heard promises that gcc is best cpu-independent assembler? no? and you know why? because gcc is not cpu-independent assembler. gcc was strongly optimized to make efficient asm from the code usually written by the C programmers. but code generated by ghc has nothing common with it. so we are stay with all these register-memory moves, non-unrolled loops and all other results of naive compilation. gcc is just don't know how to efficiently manage such code!
would there be any advantage to targetting gcc's backend directly? I notice that Mercury seems to support this: http://www.cs.mu.oz.au/research/mercury/download/gcc-backend.html http://gcc.gnu.org/frontends.html that is, does using C as assembler disable possible optimizations, or is going through the C frontend enabling more optimizations than going to the backend directly? just curious, claus

Claus Reinke writes:
the long answer is: are you ever heard promises that gcc is best cpu-independent assembler? no? and you know why? because gcc is not cpu-independent assembler. gcc was strongly optimized to make efficient asm from the code usually written by the C programmers. but code generated by ghc has nothing common with it. so we are stay with all these register-memory moves, non-unrolled loops and all other results of naive compilation. gcc is just don't know how to efficiently manage such code!
would there be any advantage to targetting gcc's backend directly?
I notice that Mercury seems to support this: http://www.cs.mu.oz.au/research/mercury/download/gcc-backend.html http://gcc.gnu.org/frontends.html
that is, does using C as assembler disable possible optimizations, or is going through the C frontend enabling more optimizations than going to the backend directly?
On a related point, Mercury has two C backends a low level one at the level of GHC's and a high level one. Bulat might want to read this for a description of the high level C implementation: http://www.cs.mu.oz.au/research/mercury/information/papers.html#hlc_cc Also, ghc used to be faster than gcc for a naive, recursive factorial function (once the cpr analysis and optimisation was added). From what Bulat wrote it seems that gcc got better ... k

Hello Kevin, Thursday, February 23, 2006, 9:06:25 PM, you wrote: KG> On a related point, Mercury has two C backends a low level one at the KG> level of GHC's and a high level one. Bulat might want to read this for KG> a description of the high level C implementation: KG> KG> http://www.cs.mu.oz.au/research/mercury/information/papers.html#hlc_cc citating from this paper's annotation: "Many logic programming implementations compile to C, but they compile to very low-level C, and thus discard many of the advantages of compiling to a high-level language". it's the same as i think KG> Also, ghc used to be faster than gcc for a naive, recursive factorial KG> function (once the cpr analysis and optimisation was added). From KG> what Bulat wrote it seems that gcc got better ... i don't say that we must compile recursive Haskell/STG functions naively to recursive C ones (as jhc does). no - i say that we should translate recursive Haskell definitions to explicit C loops, what is NATURAL C PROGRAMMING STYLE and therefore optimized much better as you can see in the files i attached -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin writes:
Hello Kevin,
KG> Also, ghc used to be faster than gcc for a naive, recursive factorial KG> function (once the cpr analysis and optimisation was added). From KG> what Bulat wrote it seems that gcc got better ...
i don't say that we must compile recursive Haskell/STG functions naively to recursive C ones (as jhc does). no - i say that we should translate recursive Haskell definitions to explicit C loops, what is NATURAL C PROGRAMMING STYLE and therefore optimized much better as you can see in the files i attached
Ahhh, OK. I misunderstood. I don't claim ghc beat a C loop. ghc was just more optimised at making function calls :-). k

From my (limited) knowledge of GHC backend, the difficult part of your
Hello Bulat, plan is that STG is not suited to compilation to native C at all. You might need to do quite advanced translation from STG to another intemediate language, (as GRIN for example), and then some more advanced analysis/optimization before C can be generated. iirc, the tricky part is the handling of lazyness. At any point you may end up with a thunk (closure), which ghc handles easily by "evaluating" it: it's always a function pointer. (That's the tagless part) When in WHNF, it just calls a very simple function that fills registers with the evaluated data. Otherwise, the function call performs the reduction to WHNF. If you want to use the same trick, you'll end up with the same problems (bad C). So, you'll want to eliminate thunks statically, finally requiring a 'high level' analysis as I suggested above. Also, allocation + garbage collection is extremely efficient in current GHC... C/C++ alloc doesn't even come close. It's entirely possible that with even a very fast backend, the better ghc allocator will be enough to swing the balance. (see jhc) It might be possible to re-use most of it however. I certainly don't want to discourage you (we all want faster code :), but there is no easy path to climb the mountain ;) Cheers, JP.

Hello Jean-Philippe, Friday, February 24, 2006, 2:04:57 AM, you wrote: JPB> From my (limited) knowledge of GHC backend, the difficult part of your JPB> plan is that STG is not suited to compilation to native C at all. You JPB> might need to do quite advanced translation from STG to another JPB> intemediate language, (as GRIN for example), and then some more JPB> advanced analysis/optimization before C can be generated. your knowledge is very limited, because STG at all times was compiled to C and then compiled by gcc :) moreover, it's very easy to implement efficient compilation of fac() definition from STG to "natural C". it's one-hour task, maximum. the real problem is changing the whole environment, including AST representing C code inside the GHC, RTS and so on JPB> iirc, the tricky part is the handling of lazyness. At any point you JPB> may end up with a thunk (closure), which ghc handles easily by JPB> "evaluating" it: it's always a function pointer. (That's the tagless JPB> part) When in WHNF, it just calls a very simple function that fills JPB> registers with the evaluated data. Otherwise, the function call JPB> performs the reduction to WHNF. STG attaches explicit typing information to any var. i propose: 1) compile code that works with unboxed values to equivalent "natural C" code 2) for the boxed values, MAXIMALLY simplify test that they are already in WHNF. current solution leads to two jumps, what is a very expensive on modern cpus. i propose to make instead one check and one conditional jump what will be much cheaper. then, if value is in WHNF, we just load all the needed data himself. if it is not in WHNF, we perform just the same operations as in current GHC. for example, wrapper that calls the "int fac(int n, int r)" worker, can be something like this: if (p = n->closure) then goto eval_n; n_evaluated: if (p = r->closure) then goto eval_r; r_evaluated: return fac (n->value, r->value); eval_n: (*p) (); // sets n->closure=NULL; n->value to n's value goto n_evaluated; eval_r: (*p) (); goto r_evaluated; JPB> If you want to use the same trick, you'll end up with the same JPB> problems (bad C). So, you'll want to eliminate thunks statically, JPB> finally requiring a 'high level' analysis as I suggested above. really? :) JPB> Also, allocation + garbage collection is extremely efficient in JPB> current GHC... C/C++ alloc doesn't even come close. It's entirely JPB> possible that with even a very fast backend, the better ghc allocator JPB> will be enough to swing the balance. (see jhc) JPB> It might be possible to re-use most of it however. i don't propose to replace everything in ghc backend, just the way "unboxed" code is compiled and something around it. i just don't know enough about other parts of ghc backend/rts :))) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Hello Claus, Thursday, February 23, 2006, 8:56:57 PM, you wrote:
the long answer is: are you ever heard promises that gcc is best cpu-independent assembler? no? and you know why? because gcc is not cpu-independent assembler. gcc was strongly optimized to make efficient asm from the code usually written by the C programmers. but code generated by ghc has nothing common with it. so we are stay with all these register-memory moves, non-unrolled loops and all other results of naive compilation. gcc is just don't know how to efficiently manage such code!
CR> would there be any advantage to targetting gcc's backend directly? CR> I notice that Mercury seems to support this: CR> http://www.cs.mu.oz.au/research/mercury/download/gcc-backend.html CR> http://gcc.gnu.org/frontends.html CR> that is, does using C as assembler disable possible optimizations, CR> or is going through the C frontend enabling more optimizations than CR> going to the backend directly? i will read this. but one disadvantage is obvious - we can't use other C compilers, including more efficient intel c/c++ (for my code it was constantly 10% faster) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

bulat.ziganshin@gmail.com wrote:
* multiple results can be returned via C++ pair template, if this is efficiently implemented in gcc
There's a -freg-struct-return option in 2.x, 3.x and 4.x. I think it's off by default on many architectures.
* recursive definitions translated into the for/while loops if possible
I think recent versions of GCC do a good job of this. See http://home.in.tum.de/~baueran/thesis/ All of this efficiency stuff aside, there's a big problem you're neglecting: GARBAGE COLLECTION. For a language that allocates as much as Haskell I think a conservative collector is absolutely out of the question, because they can't compact the heap and so allocation becomes very expensive. A copying collector has to be able to walk the stack and find all the roots, and that interacts very badly with optimization. All the languages I've seen that compile via C either aren't garbage collected or use a conservative or reference-count collector. The closest thing I've seen to a solution is the technique used in Urban Boquist's thesis, which is to put a static table in the executable with information about register and stack usage indexed by procedure return point, and use that to unwind the stack and find roots. Some C compilers (including gcc) support debugging of optimized code, so it might be possible to parse the debugging tables and extract this information. As far as I know, no one has ever looked into the feasibility of doing this, which is kind of surprising since root-finding is probably the single biggest obstacle to compiling functional languages via C. -- Ben

Hello Ben, Friday, February 24, 2006, 2:04:26 AM, you wrote:
* multiple results can be returned via C++ pair template, if this is efficiently implemented in gcc
BRG> There's a -freg-struct-return option in 2.x, 3.x and 4.x. I think it's off BRG> by default on many architectures. thank you! -1 problem. moreover, we can use just plain C for our translation instead of C++
* recursive definitions translated into the for/while loops if possible
BRG> I think recent versions of GCC do a good job of this. See BRG> http://home.in.tum.de/~baueran/thesis/ there is no problem with proper handling of tail calls. the problem is what these loops are not unrolled, generating significantly worse code than explicit loop. you can see this in the files i attached to the original letter BRG> All of this efficiency stuff aside, there's a big problem you're neglecting: BRG> GARBAGE COLLECTION. For a language that allocates as much as Haskell I think BRG> a conservative collector is absolutely out of the question, because they BRG> can't compact the heap and so allocation becomes very expensive. A copying BRG> collector has to be able to walk the stack and find all the roots, and that BRG> interacts very badly with optimization. All the languages I've seen that BRG> compile via C either aren't garbage collected or use a conservative or BRG> reference-count collector. as i said, we can just use 2 stacks - one, pointed by EBP register, contains all boxed values, second is hardware stack, pointed of course by ESP, contains unboxed values and managed by gcc as for any other C programs. so, the boxed parameters to functions are go through EBP-pointed stack and unboxed values passed via usual C conventions: int fac(int n, int r) currently, EBP is used for all data and ESP is just not used. moreover, until 1999 the same two stacks scheme was used in GHC. comparing to current one stack scheme, we will need more space for stacks and may lose something because memory usage patterns will be slightly less localized -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Hi Bulat, Wow! You make a lot of good suggestions. I think some of them are unfortunately unworkable, some others have been tried before, but there are definitely some to try. I'll suggest a few more in this email. First of all I should say that we don't *want* to use gcc as a code generator. Everything we've been doing in the back end over the last few years has been aiming towards dropping the dependency on gcc and removing the Evil Mangler. Now is not the time to be spending a lot of effort in beefing up the C back end :-) Our preference is to work on both the native and C-- back ends; the C-- back end has the potential to produce the best code and be the most portable. We also plan to keep the "unregisterised" C back end for the purposes of porting to a new platform, it's only the registerised path we want to lose. Having said all that, if we can get better code out of the C back end without too much effort, then that's certainly worthwhile.
translated to something like this
factorial: _s1BD = *(Sp + 0); if (_s1BD != 1) goto c1C4; // n!=1 ? R1 = *(Sp + 4); Sp = Sp + 8; return (*(Sp + 0))(); // return from factorial c1C4: _s1BI = _s1BD * (*(Sp + 4)); // n*r _s1BF = _s1BD - 1; // n-1 *(Sp + 4) = _s1BI; *(Sp + 0) = _s1BF; goto factorial;
To be fair, this is only the translation on an architecture that has no argument registers (x86, and currently x86_64 but hopefully not for long). The simple optimisation of changing the final tail call to factorial into a goto to a label at the beginning of the function (as suggested by John Meacham I think) should improve the code generated by gcc for functions like this, and is easy to do.
1) because it just not asked. you can enable gcc optimization by adding "-optc-O6" to the cmdline, but this leads to compilation errors on part of modules. it seems that gcc optimization is not compatible with "evil mangler" that is the ghc's own optimization trick.
-O3 works, I haven't tried -O6.
* C++ calling stack should be used as much as possible * parameters are passed in C++ way: "factorial(int n, int r)"
This is unworkable, I'm afraid. Go and read Simon's paper on the STG: http://citeseer.ist.psu.edu/peytonjones92implementing.html there are two main reasons we don't use the C stack: garbage collection and tail calls. What do you plan to do about GC?
* recursive definitions translated into the for/while loops if possible
Certainly a good idea.
* groups of mutually recursive definitions should be also "naturally" translated as much as possible * functions marked INLINE in Haskell code should be marked the same in C++ code
an inline function will have been inlined, so not sure what you mean here!
* maximally speed up using of already evaluated values. instead of unconditional jumping to the closure-computation function just test this field and continue computation if it contains NULL (indicating that value is already in WHNF). This change will make time penalty of boxing data structures significantly, at least 10 times, smaller
This is called semi-tagging, it was tried a long time ago. Certainly worth trying again, but I very much doubt the gains would be anything like you claim, and it increases code size. There are other schemes, including this one that we particularly like: use a spare bit in a pointer to indicate "points to an evaluated value". Since there are two spare bits, you can go further and use the 4 values to mean: 0: possibly unevaluated 1: evaluated, tag 0 2: evaluated, tag 1 3: evaluated, tag 2 I believe Robert Ennals tried this as part of his spec eval implementation, but IIRC he didn't get separate measurements for it (or it didn't improve things much). Note that this increases code size too.
* in addition to ability to define strict fields, allow to define strict data structures (say, lists) and strict types/structures of functions parameters and results. that is a long-standing goal, but the "![!Int] -> !Int" function can be used inside higher-order code a lot faster than "[Int] -> Int" one. the ultimate goal is to allow Haskell to be as fast as ML dialects in every situation and this means that laziness should be optional in every place. this will also allow to make efficient emulation of C++ virtual methods
Strict types are interesting, and I know several people (including me) who have put some thought into how they would work. It turns out to be surprisingly difficult, though. You might look into doing something simple: suppose you could declare a type to be unlifted: data !T = .. The type T doesn't have a bottom element. Its kind is !, not *, and it can't be used to instantiate a polymorphic type variable, just like GHC's primitive types. This is quite restrictive, but it's simple and pretty easy to implement in GHC as it stands. It gives you a nice guarantee: expressions of type T are never thunks. case expressions deconstructing T never need to evaluate it first. The next generalisation would be to allow ! as an arbitrary type operator, but still with the kind restriction. Dropping the kind restriction is where things start to get interesting... Cheers, Simon

Hello Simon, Friday, February 24, 2006, 12:52:36 PM, you wrote: SM> Wow! You make a lot of good suggestions. I think some of them are SM> unfortunately unworkable, some others have been tried before, but there SM> are definitely some to try. I'll suggest a few more in this email. can you comment about unworkable/hard to implement ones? may be, ghc users can propose some solutions that you are skipped SM> First of all I should say that we don't *want* to use gcc as a code SM> generator. Everything we've been doing in the back end over the last SM> few years has been aiming towards dropping the dependency on gcc and SM> removing the Evil Mangler. Now is not the time to be spending a lot of SM> effort in beefing up the C back end :-) Our preference is to work on SM> both the native and C-- back ends; the C-- back end has the potential to SM> produce the best code and be the most portable. We also plan to keep SM> the "unregisterised" C back end for the purposes of porting to a new SM> platform, it's only the registerised path we want to lose. i know that is the hardest part of discussion. but sometimes we need to change our decisions. i think that it's better to compare efforts/results of going toward gcc/c-- way starting from current state of ghc, avoiding counting the previous efforts :( on the c-- side: 1) time/efforts required to implement many low-level optimizations 2) more or less efficient code on the gcc side: 1) time required to implement generation of native C code instead of current "portable asm" 2) one of the best optimization possible gcc way already exists and works, while C-- way is still unfinished even in its basic, unoptimized state. next, unregisterized C way provides the maximum possible level of portability SM> Having said all that, if we can get better code out of the C back end SM> without too much effort, then that's certainly worthwhile. moreover, many changes, proposed on this thread, perfectly implementable even on C-- way
translated to something like this
factorial: _s1BD = *(Sp + 0); if (_s1BD != 1) goto c1C4; // n!=1 ? R1 = *(Sp + 4); Sp = Sp + 8; return (*(Sp + 0))(); // return from factorial c1C4: _s1BI = _s1BD * (*(Sp + 4)); // n*r _s1BF = _s1BD - 1; // n-1 *(Sp + 4) = _s1BI; *(Sp + 0) = _s1BF; goto factorial;
SM> To be fair, this is only the translation on an architecture that has no SM> argument registers (x86, and currently x86_64 but hopefully not for long). this just emphasizes main problem of "portable asm" code generated by ghc - you can't optimize it as good as C compiler optimizes idiomatic C code. x86 has 7 registers, after all, and gcc generates great code for this procedure - but only if it is written in "idiomatic" manner. you will need to develop your own optimization techniques, and it will be hard and long way comparing to just generation of more high-level C code just to compare - i've rewritten my binary serialization library two times. the first version just used string concatenation (!), the second was the carefully optimized using my C experience. nevertheless, it was slow enough. and the final 3rd was optimized according to ghc features. and the result was astonishing "portable asm" looking great in theory, and C-- as portable low-level language is great theoretical idea. but reality has its own laws. want to know why my second version of serialization library, highly optimized according to C experience, was so slow? it just returns tuples and constructing/analyzing these lazy tuples used 80% of the time ghc generates not so fast code and that is known for many years. you have developed the proper theoretical way to solve this problem, but there is also much easier backdoor :) at least, it seems like this SM> The simple optimisation of changing the final tail call to factorial SM> into a goto to a label at the beginning of the function (as suggested by SM> John Meacham I think) should improve the code generated by gcc for SM> functions like this, and is easy to do. the main detail is to work with local variables instead of Sp[xx]. gcc can't optimize access to Sp[xx]
1) because it just not asked. you can enable gcc optimization by adding "-optc-O6" to the cmdline, but this leads to compilation errors on part of modules. it seems that gcc optimization is not compatible with "evil mangler" that is the ghc's own optimization trick.
SM> -O3 works, I haven't tried -O6. i've attached "optc-O2-bug.hs" that don't work both with -optc-O2 and -optc-O3
* C++ calling stack should be used as much as possible * parameters are passed in C++ way: "factorial(int n, int r)"
SM> This is unworkable, I'm afraid. Go and read Simon's paper on the STG: SM> http://citeseer.ist.psu.edu/peytonjones92implementing.html SM> there are two main reasons we don't use the C stack: garbage collection SM> and tail calls. What do you plan to do about GC? minimalistic solution i see - just use two stacks. unboxed values are passed using the C conventions and C compiler will be able to optimize using of this stack. boxed values are passed using the current Sp[] machinery and therefore GC will not be changed
* groups of mutually recursive definitions should be also "naturally" translated as much as possible
i mean here that such group should be analyzed and if there is just one entry point function, then the whole group should be compiled in one large function. the possible alternative is to compile these functions in current manner but mark them "inline" so the gcc will make these optimizations for us. i don't tested how that will work on practice, though
* functions marked INLINE in Haskell code should be marked the same in C++ code
SM> an inline function will have been inlined, so not sure what you mean here! only if they are not recursive! so, to make them really inline, we should either change compilation of these functions or at least try this trick
* maximally speed up using of already evaluated values. instead of unconditional jumping to the closure-computation function just test this field and continue computation if it contains NULL (indicating that value is already in WHNF). This change will make time penalty of boxing data structures significantly, at least 10 times, smaller
SM> This is called semi-tagging, it was tried a long time ago. Certainly SM> worth trying again, but I very much doubt the gains would be anything SM> like you claim, and it increases code size. we will replace 2 unpredictable jumps with one that will not be even executed if value is in WHNF. at least, i'm very displeased by slowness of lazy values evaluation in GHC. for example, my binary serialization package writes 60 mb/s working with unboxed arrays and only 20 mb/s working with the lists SM> There are other schemes, including this one that we particularly like: SM> use a spare bit in a pointer to indicate "points to an evaluated value". SM> Since there are two spare bits, you can go further and use the 4 SM> values to mean: SM> 0: possibly unevaluated SM> 1: evaluated, tag 0 SM> 2: evaluated, tag 1 SM> 3: evaluated, tag 2 SM> I believe Robert Ennals tried this as part of his spec eval SM> implementation, but IIRC he didn't get separate measurements for it (or SM> it didn't improve things much). Note that this increases code size too. to be exact, we had only 2^30-1 valid pointers, all other values are ready to use! :)
* in addition to ability to define strict fields, allow to define strict data structures (say, lists) and strict types/structures of functions parameters and results. that is a long-standing goal, but the "![!Int] -> !Int" function can be used inside higher-order code a lot faster than "[Int] -> Int" one. the ultimate goal is to allow Haskell to be as fast as ML dialects in every situation and this means that laziness should be optional in every place. this will also allow to make efficient emulation of C++ virtual methods
SM> Strict types are interesting, and I know several people (including me) SM> who have put some thought into how they would work. It turns out to be SM> surprisingly difficult, though. SM> You might look into doing something simple: suppose you could declare a SM> type to be unlifted: SM> data !T = .. SM> The type T doesn't have a bottom element. Its kind is !, not *, and it SM> can't be used to instantiate a polymorphic type variable, just like SM> GHC's primitive types. SM> This is quite restrictive, but it's simple and pretty easy to implement SM> in GHC as it stands. It gives you a nice guarantee: expressions of type SM> T are never thunks. case expressions deconstructing T never need to SM> evaluate it first. SM> The next generalisation would be to allow ! as an arbitrary type SM> operator, but still with the kind restriction. Dropping the kind SM> restriction is where things start to get interesting... i know another way - allow compile-time polymorphism, just as in C++ with templates: instance Num Int# where (+) = (+#) ... mult :: (Num a) => a->a mult a = a+a instance Ord Int# where (>) = (>#) ... main = do print (mult 2 > 3) print (mult 2# > 3#) we just can't mix strict types with run-time polymorphism, all other things are compilable -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin wrote:
* C++ calling stack should be used as much as possible * parameters are passed in C++ way: "factorial(int n, int r)"
SM> This is unworkable, I'm afraid. Go and read Simon's paper on the STG:
SM> http://citeseer.ist.psu.edu/peytonjones92implementing.html
SM> there are two main reasons we don't use the C stack: garbage collection SM> and tail calls. What do you plan to do about GC?
minimalistic solution i see - just use two stacks. unboxed values are passed using the C conventions and C compiler will be able to optimize using of this stack. boxed values are passed using the current Sp[] machinery and therefore GC will not be changed
How do you propose to do thread switching? Incedentally, GHC had two stacks a long time ago (before 4.00) I rewrote the runtime and code generator to replace it with one stack. It was about 6 months work, if I remember correctly. I think you misused the words "just" and "minimalistic" :-) I know gcc does (some) tail-call optimisation these days. You don't get any *guarantee* that a call will be implemented as a tail-call, though, and the conditions are quite complex. The gcc folks treat it as an optimisation, rather than a fundamental part of the operational semantics, which is what you need for Haskell.
SM> This is called semi-tagging, it was tried a long time ago. Certainly SM> worth trying again, but I very much doubt the gains would be anything SM> like you claim, and it increases code size.
we will replace 2 unpredictable jumps with one that will not be even executed if value is in WHNF. at least, i'm very displeased by slowness of lazy values evaluation in GHC. for example, my binary serialization package writes 60 mb/s working with unboxed arrays and only 20 mb/s working with the lists
You need to *try* these things and measure the difference. Quite often I'm surprised with the actual results I get from an optimisation that looks on paper to be worthwhile. Today's processors with 3-level caches are complicated beasts. Semi tagging increases code size, so unless you can make some significant gains to make up for it, you've lost already. Let me give you another example: GHC puts info tables next to the entry code for a closure. On paper this looks like it might be a bad idea: it pollutes the instruction cache with data that is rarely accessed during execution. It's hard to arrange (this is one thing the mangler does), so we'd like to avoid it if possible. However, doing this eliminates an indirection when entering a closure, reduces the size of info tables by one word, and reduces code size. These effects together are more beneficial than the code-cache-pollution effect - last time I measured it the difference was ~10%. (turn off TABLES_NEXT_TO_CODE and try it yourself).
SM> There are other schemes, including this one that we particularly like: SM> use a spare bit in a pointer to indicate "points to an evaluated value". SM> Since there are two spare bits, you can go further and use the 4 SM> values to mean:
SM> 0: possibly unevaluated SM> 1: evaluated, tag 0 SM> 2: evaluated, tag 1 SM> 3: evaluated, tag 2
SM> I believe Robert Ennals tried this as part of his spec eval SM> implementation, but IIRC he didn't get separate measurements for it (or SM> it didn't improve things much). Note that this increases code size too.
to be exact, we had only 2^30-1 valid pointers, all other values are ready to use! :)
True, but I'm assuming you often want to store a pointer in addition to the tag (eg. for a cons cell). For enumerations, you're quite right that the whole word can be used for the tag as long as it is distinguished from a pointer. Cheers, Simon

bulat.ziganshin@gmail.com wrote:
...
as you can see here, gcc isn't unrolled the loop and retained all the stack<->register moves. why? the short answer is what gcc just don't know that data pointed by the Sp pointer is non-volatile - they can't be changed in middle of the program by some async computation or exception and don't required to be updated immediately. may be there is some way to tell gcc this? this would immediately SUBSTANTIALLY improve all ghc code generation
Would the C99 restrict keyword help in this situation?

I had to witch rather urgently to a new machine (an Intel based Mac OS X), on which I need to get lhs2TeX running. I have installed the ghc version from the page: http://cvs.haskell.org/trac/ghc/wiki/X86OSXGhc but when i try to compile lhs2TeX I get the following error message: /usr/local/bin/ghc -O -package lang --make -o lhs2TeX Main.lhs TeXCommands.lhs TeXParser.lhs Typewriter.lhs Math.lhs MathPoly.lhs MathCommon.lhs NewCode.lhs Directives.lhs HsLexer.lhs FileNameUtils.lhs Parser.lhs FiniteMap.lhs Auxiliaries.lhs StateT.lhs Document.lhs Verbatim.lhs Value.lhs Version.lhs ghc-6.5.20060608: unknown package: lang which refers to a package called "lang". I looked in the list of packages at: http://www.haskell.org/ghc/dist/stable/docs/libraries/index.html which does not mention a package by that name. Also at http://hackage.haskell.org/trac/ghc/wiki/GhcDarcs this package seems to be unknown. The list of packages that apparently was installed with the compiler are: /usr/local/lib/ghc-6.5.20060608/package.conf: ALUT-2.0, Cabal-1.1.4, GLUT-2.0, HGL-3.1, HUnit-1.1, OpenGL-2.1, QuickCheck-1.0, X11-1.1, base-1.0, fgl-5.2, (ghc-6.5.20060608), haskell-src-1.0, haskell98-1.0, mtl-1.0, network-1.0, parsec-2.0, readline-1.0, rts-1.0, stm-1.0, template-haskell-1.0, time-0.3.1, unix-1.0 So my questions are: - what did I do wrong (I apologize for not being an able installer) - where can I find a package by the name "lang" - what is the next step to take Doaitse

Hello Doaitse, Friday, July 7, 2006, 10:24:38 PM, you wrote:
which refers to a package called "lang".
it was a part of ghc 6.4, but gone in 6.5. try to remove this package dependency from .cabal file and see which functions will not be found at compilation. after that you can search existing packages to find which package contain these functions now you can download 6.5 sources from http://www.haskell.org/ghc/dist/current/dist and look at the directory "libraries" which contains one subdir for each package -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Hello Bulat, Saturday, July 8, 2006, 8:00:52 AM, you wrote:
which refers to a package called "lang".
it was a part of ghc 6.4, but gone in 6.5. try to remove this package
i'va found this package sources. all these modules now are part of 'base' package, they just has different names, say ArrayBase -> Data.Array.Base. you should just omit this package from .cabal file and change import statements to mention new hierarchical names of the same modules
I looked in the list of packages at: http://www.haskell.org/ghc/dist/stable/docs/libraries/index.html
look at http://www.haskell.org/ghc/dist/stable/docs/hslibs/index.html which lists old H98-style packages which uses plain names of modules look at info about each module - it says which new module form Base package supersedes it -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Doaitse Swierstra
but when i try to compile lhs2TeX I get the following error message: /usr/local/bin/ghc -O -package lang --make -o lhs2TeX Main.lhs [..]
Did you try without the '-package lang'? I think 'ghc --make' is rather effective at automatically determining which packages to include. (Also, I'm not sure why you include all the source files together with --make, since --make tracks down dependent modules and compile them automatically.) -k -- If I haven't seen further, it is by standing in the footprints of giants
participants (10)
-
Ben Rudiak-Gould
-
Bulat Ziganshin
-
bulat.ziganshin@gmail.com
-
Claus Reinke
-
Doaitse Swierstra
-
Jean-Philippe Bernardy
-
Ken Whitegoat
-
Ketil Malde
-
Kevin Glynn
-
Simon Marlow