inside the GHC code generator

The following link gives reasons for not generating via C http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&selm=4zp8kn7xe.fsf_-_%40beta.franz.com Naturally a number of these are common lisp specific, however I think that Haskell and GCC are quite semantically different, so using GCC might prevent a lot of optimizations, that aren't possible in C, but would be in GHC. The OCAML-OPT backend is supposed to produce i386 assembly that is competitive with GCC. Maybe this could be ported to GHC? Rene.

Hello Rene, Thursday, February 23, 2006, 4:19:15 PM, you wrote: RdV> The following link gives reasons for not generating via C RdV> http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&selm=4zp8kn7xe.fsf_-_%40beta.franz.com i read it now RdV> Naturally a number of these are common lisp specific, however I think that RdV> Haskell and GCC are quite semantically different, so using GCC might prevent RdV> a lot of optimizations, that aren't possible in C, but would be in GHC. seems that you don;t understand the situation. ghc compiles Haskell to language called "core", do almost all optimizations at level of this language, then translates final result to the "STG" language from that the C-- code is generated. changing the translation of STG can't prevent ANY ghc optimization. although iy is not so easy because ghc code generation and RTS closely tied together RdV> The OCAML-OPT backend is supposed to produce i386 assembly that is RdV> competitive with GCC. Maybe this could be ported to GHC? can you please show the code for the factorial accumulating function: factorial n r = if n=1 then r else (factorial (n - 1) (n*r)) i think that ocaml can't generate code better than gcc and especially icc (intel C/C++ compiler), but may be i'm wrong? ;) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

seems that you don;t understand the situation. ghc compiles Haskell to language called "core", do almost all optimizations at level of this language, then translates final result to the "STG" language from that the C-- code is generated. changing the translation of STG can't prevent ANY ghc optimization. although iy is not so easy because ghc code generation and RTS closely tied together
I should have been a bit clearer here. I meant that optimizations that are available from STG -> Assembler, are better than STG -> C -> Assembler. GHC currently doesn't do most of the optimizations I am thinking of. -- Bit tagging to reduce pointer chasing, speed up pattern matching. Due to memory latency and speed it is quicker to do bit masking rather than memory reads -- Parameter passing and regisgter usage opimizations that rely on the structure of the RTS. -- Multiple stacks with custom frame layout. -- dynamic code optimization etc. -- Taking advantage of special assember instructions and flags. Though I have also seen comments that you can do a lot of these with GCC if you do your own stack and parameter management. i.e. don't use the C stack at all. Though your suggestions are probably better than nothing, which is probably what the alternative is (for instance I have not sufficient time to work on these things). Note that I didn't say that the assembly generation of OCAML was better than GCC, just that it was comparable. Rene.

Hello Rene, Thursday, February 23, 2006, 5:32:21 PM, you wrote:
seems that you don;t understand the situation. ghc compiles Haskell to language called "core", do almost all optimizations at level of this language, then translates final result to the "STG" language from that the C-- code is generated. changing the translation of STG can't prevent ANY ghc optimization. although iy is not so easy because ghc code generation and RTS closely tied together
RdV> I should have been a bit clearer here. I meant that optimizations that are RdV> available from STG -> Assembler, are better than STG -> C -> Assembler. theoretically - yes. in practice, it is hard to compete with gcc. ghc wait for such code generation 15 years (wait for the mountain to go in our direction :) and i think that the REAL way to reach gcc level optimization is to compile to the idiomatic C instead of continue waiting while this theoretically possible optimization will arise RdV> GHC currently doesn't do most of the optimizations I am thinking of. RdV> -- Bit tagging to reduce pointer chasing, speed up pattern matching. Due to RdV> memory latency and speed it is quicker to do bit masking rather than memory RdV> reads RdV> -- Parameter passing and regisgter usage opimizations that rely on the RdV> structure of the RTS. RdV> -- Multiple stacks with custom frame layout. RdV> -- dynamic code optimization etc. RdV> -- Taking advantage of special assember instructions and flags. i think that all these things can be done with "idiomatic C" way RdV> Though I have also seen comments that you can do a lot of these with GCC if RdV> you do your own stack and parameter management. i.e. don't use the C stack RdV> at all. as i see in the current code generation, attaempts to do our own stack management lead to that gcc just don't understand such code and don't optimize it as good as possible. please compare code generated for fac() function with all other variants RdV> Though your suggestions are probably better than nothing, which is probably RdV> what the alternative is (for instance I have not sufficient time to work on RdV> these things). moreover, i sure that you can't compete with gcc :) RdV> Note that I didn't say that the assembly generation of OCAML was better than RdV> GCC, just that it was comparable. what mean "comparable"? and what we should do to reuse this code generation in ghc? at least i know what to do to reuse gcc code generation. can you propose similar plan to reuse ocaml code generation? -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Hello kyra, Thursday, February 23, 2006, 5:38:54 PM, you wrote: k> Bulat Ziganshin wrote:
i think that ocaml can't generate code better than gcc and especially icc (intel C/C++ compiler), but may be i'm wrong? ;)
k> didn't try factorial, but exponential fib in ocaml is *FASTER* than both k> gcc and intel c/c++ with highest optimization levels
i prefer to see the asm code. this may be because of better high-level optimization strategies (reusing fib values). the scheme about i say will combine advantages of both worlds -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin wrote:
i prefer to see the asm code. this may be because of better high-level optimization strategies (reusing fib values). the scheme about i say will combine advantages of both worlds no strategies, plain exponential algorithm,
ocaml: <excerpt> _camlFibo__fib_57: sub esp, 8 L101: cmp eax, 5 jge L100 mov eax, 3 add esp, 8 ret L100: mov DWORD PTR 0[esp], eax add eax, -4 call _camlFibo__fib_57 L102: mov DWORD PTR 4[esp], eax mov eax, DWORD PTR 0[esp] add eax, -2 call _camlFibo__fib_57 L103: mov ebx, DWORD PTR 4[esp] add eax, ebx dec eax add esp, 8 ret </excerpt> visual C++ 7.1 (next fastest): <excerpt> _fib PROC NEAR push esi mov esi, DWORD PTR 8[esp] cmp esi, 2 jge SHORT $L945 mov eax, 1 pop esi ret 0 $L945: lea eax, DWORD PTR [esi-2] push edi push eax call _fib dec esi push esi mov edi, eax call _fib add esp, 8 add eax, edi pop edi pop esi ret 0 _fib ENDP </excerpt> also, Clean is *EXACTLY* in line with ocaml. This is interesting, because Clean is so much similar to Haskell.

Hello kyra, Friday, February 24, 2006, 12:37:02 AM, you wrote:
i prefer to see the asm code. this may be because of better high-level optimization strategies (reusing fib values). the scheme about i say will combine advantages of both worlds k> no strategies, plain exponential algorithm,
yes, the ocaml compiler works better with stack. but i sure that in most cases gcc will outperform ocaml because it has large number of optimizations which is not easy to implement (unrolling, instruction scheduling and so on) k> also, Clean is *EXACTLY* in line with ocaml. This is interesting, k> because Clean is so much similar to Haskell. clean differs from Haskell in support of unique types and strictness annotations. the last is slowly migrates into GHC in form of shebang patters, but i think that it is a half-solution. i mentioned in original letter my proposals to add strictness annotations to function types declarations and to declare strict datastructures, such as "![Int]" -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On 2/24/06, Bulat Ziganshin
Hello kyra,
Friday, February 24, 2006, 12:37:02 AM, you wrote:
i prefer to see the asm code. this may be because of better high-level optimization strategies (reusing fib values). the scheme about i say will combine advantages of both worlds k> no strategies, plain exponential algorithm,
yes, the ocaml compiler works better with stack. but i sure that in most cases gcc will outperform ocaml because it has large number of optimizations which is not easy to implement (unrolling, instruction scheduling and so on)
k> also, Clean is *EXACTLY* in line with ocaml. This is interesting, k> because Clean is so much similar to Haskell.
clean differs from Haskell in support of unique types and strictness annotations. the last is slowly migrates into GHC in form of shebang patters, but i think that it is a half-solution. i mentioned in original letter my proposals to add strictness annotations to function types declarations and to declare strict datastructures, such as "![Int]"
As I've understood it, Clean's strictness annotations are a bit of a hack which only works on certain built-in types. Am I mistaking here? -- Friendly, Lemmih

Hello Lemmih, Friday, February 24, 2006, 1:15:51 PM, you wrote:
clean differs from Haskell in support of unique types and strictness annotations. the last is slowly migrates into GHC in form of shebang patters, but i think that it is a half-solution. i mentioned in original letter my proposals to add strictness annotations to function types declarations and to declare strict datastructures, such as "![Int]"
L> As I've understood it, Clean's strictness annotations are a bit of a L> hack which only works on certain built-in types. Am I mistaking here? i don't know Clean very well, although i've seen rumors that it supports strict datastructures. after all, this don't need changes in language itself, it's just a syntax sugar: data [a] = [] | a:[a] x :: ![Int] translates to the data StrictList a = Nil | Cons a !(StrictList a) x :: !(StrictList a) ... i've found the following in the 6 feb letter in cafe by Brian Hulley: Bulat Ziganshin wrote:
yes, i remember this SPJ's question :) "[!a]" means that list elements are strict, it's the same as defining new list type with strict elements and using it here. "![a]" means "strict list", it is the same as defining list with "next" field strict:
data List1 a = Nil1 | List1 !a (List1 a) data List2 a = Nil2 | List2 a !(List2 a) data List3 a = Nil3 | List3 !a !(List3 a)
Clean allows (AFAIK) several distinctions to be made: 1) ![a] means that the list of a's is a strict argument, just like writing !b 2) [!a] means that the list is head strict (List1 a) 3) [a!] means that the list is tail strict (List2 a) 4) [!a!] means that the list is head and tail strict (List3 a) 5) ![!a!] means that the head-and-tail-strict-list-argument is strict!!! I think also (though I'm not entirely sure) that these distinctions are generalized for other data types by talking about element strictness and spine strictness. One motivation seems to be that in the absence of whole program optimization, the strictness annotations on a function's type can allow the compiler to avoid creating thunks at the call site for cross-module calls whereas using seq in the function body itself means that the thunk still has to be created at the call site because the compiler can't possibly know that it's going to be immediately evaluated by seq. and my own letter earlier on the 6 feb:
foo :: !Int -> !Int
KM> (Is the second ! actually meaningful?) yes! it means that the function is strict in its result - i.e. can't return undefined value when strict arguments are given. this sort of knowledge should help a compiler to "propagate" strictness and figure out the parts of program that can be compiled as strict code. really, i think ghc is able to figure functions with strict result just like it is able to figure strict function arguments KM> Personally, I think is much nicer than sprinkling seq's around, and KM> generally sufficient. However, there could perhaps be disambiguities? btw, it's just implemented in the GHC HEAD KM> Last time this came up, I think examples resembling these were brought KM> up: KM> foo :: [!a] -> ![a] -> a yes, i remember this SPJ's question :) "[!a]" means that list elements are strict, it's the same as defining new list type with strict elements and using it here. "![a]" means "strict list", it is the same as defining list with "next" field strict: data List1 a = Nil1 | List1 !a (List1 a) data List2 a = Nil2 | List2 a !(List2 a) data List3 a = Nil3 | List3 !a !(List3 a) the type List3 is a simple strict list, like in any strict programming language. foo :: [!a] -> ![a] -> ![!a] -> a translates to foo :: List1 a -> List2 a -> List3 a -> a KM> foo' :: Map !Int String -> Int -> String that means that keys in this map saved as strict values. for example, the following definition type Map a b = [(a,b)] will be instantiated to Map !Int String ==> [(!Int, String)] KM> Anyway, if a reasonable semantics can be formulated, I think KM> strictness type annotations would be a great, useful, and KM> relatively non-intrusive (AFAICT, entirely backwards compatible) KM> addtion to Haskell'. such proposal already exists and supported by implementing this in GHC HEAD -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On 2/24/06, Bulat Ziganshin
Hello Lemmih,
Friday, February 24, 2006, 1:15:51 PM, you wrote:
clean differs from Haskell in support of unique types and strictness annotations. the last is slowly migrates into GHC in form of shebang patters, but i think that it is a half-solution. i mentioned in original letter my proposals to add strictness annotations to function types declarations and to declare strict datastructures, such as "![Int]"
L> As I've understood it, Clean's strictness annotations are a bit of a L> hack which only works on certain built-in types. Am I mistaking here?
i don't know Clean very well, although i've seen rumors that it supports strict datastructures. after all, this don't need changes in language itself, it's just a syntax sugar:
data [a] = [] | a:[a] x :: ![Int]
translates to the
data StrictList a = Nil | Cons a !(StrictList a) x :: !(StrictList a)
Let's try this: x :: ![Int] -> Int It would translate to something like this: mkStrictList :: [a] -> StrictList a x = xStrict . mkStrictList xStrict = ... Wouldn't it be very expensive to strictify the list? -- Friendly, Lemmih

Hello Lemmih, Friday, February 24, 2006, 8:55:37 PM, you wrote:
x :: ![Int]
translates to the
data StrictList a = Nil | Cons a !(StrictList a) x :: !(StrictList a)
L> Let's try this: L> x :: ![Int] -> Int L> It would translate to something like this: L> mkStrictList :: [a] -> StrictList a L> x = xStrict . mkStrictList L> xStrict = ... L> Wouldn't it be very expensive to strictify the list? yes, it would. but evaluating list as lazy one on EACH repetition of some loop will cost much more. just for example - i found that the cycle repeat 6666666 (putStr (replicate 15 ' ')) works several times slower than repeat (10^8) (putChar ' ') if argument of putStr will be evaluated only one time then much of difference will gone. of course, that is just benchmark. but there are cases when i prefer to work with strict datastructures and specialize my functions accordingly. typically it's the cases where time/space are really critical -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Hello Rene, Thursday, February 23, 2006, 4:19:15 PM, you wrote: RdV> The following link gives reasons for not generating via C RdV> http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&selm=4zp8kn7xe.fsf_-_%40beta.franz.com i done reading. my question - is YOU read this? the lisp problems have almost nothing in common with haskell -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

From: Bulat Ziganshin
Hello Rene,
i done reading. my question - is YOU read this? the lisp problems have almost nothing in common with haskell
It is a long time since i read this, but some things come to mind. Listed below. Maybe GHC should generate better C. I am just not sure whether this will bring the best global (as opposed to micro-optimizations) performance. However doing anything else might be difficult. Generating better C might also be much more difficult for idioms that don't exist in C, and might encourage people to write Haskell that looks like C (in order to get C like performance). I prefer idiomatic Haskell. It would be nice if this could be fast. But can idiomatic Haskell be translated to efficient C? It might not have many direct loops for example. -- Other notes Integer is about 30 times slower than it needs to be on GHC if you have over half the values between 2^-31 and 2^31. On i386 can you basically can test the low bit for free (it runs in parallel to the integer add instruction). This would allow only values outside this range to required calling the long integer code. Such an optimization is not easily done in C. This encourages Haskell programmers to always use Int, even if the value might get too big, because Integer is too slow. Also Tail recursion is more general than looping. A general tail call optimization will bring better returns than a loop optimization in GHC (though I could be wrong here). This requires special stack handling. Also not so easy in C. If only simple loops are optimized it will encourage people to always code loops in their haskell rather than perhaps using more appropriate constructs. Also take the Maybe data type with Nothing and Just ... or any other datatypes with 0 and 1 variable constructors. Here these could be represent by special values for the 0 variable case and bit marking on the single constructor values. This could lead to good optimizations on case expressions. Also not so easy in C. The lack of this feature encourages people to encode their datatypes as Int's to get speed. Also not good. Whether we can communicate the non aliasing and aliasing properties of GHC to the C compiler, I am also not so sure. Also in GHC, I am not sure whether stack base locals are the best move. It might be best to have a fixed page as register spill in some cases. If you wish to pass and return unboxed tuples without reboxing you will probably required a special function interface with double entry points (I think this can be done in C, but it is a bit tricky). Summary: If only C like Haskell is fast, we end up with Haskell that looks like C. Yuck. Rene.

Hello Rene, Thursday, February 23, 2006, 10:17:40 PM, you wrote: RdV> Maybe GHC should generate better C. I am just not sure whether this will RdV> bring the best global (as opposed to micro-optimizations) performance. i answered in the original letter (search for "Cray" :) RdV> Generating better C might also be much more difficult for idioms that don't RdV> exist in C, i will be glad to see concrete examples RdV> and might encourage people to write Haskell that looks like C RdV> (in order to get C like performance). yes, they do just it now (see shootout entries or my Streams library) but still don't get C performance. so i think that we will go to something better position :) RdV> I prefer idiomatic Haskell. It would be nice if this could be fast. But can RdV> idiomatic Haskell be translated to efficient C? It might not have many RdV> direct loops for example. sorry, i don't understand that you mean? in general, idiomatic Haskell has big efficiency problem and this problem is laziness. the strict Haskell code is, like ML code, can be compiled efficient RdV> -- Other notes RdV> Integer is about 30 times slower than it needs to be on GHC if you have over RdV> half the values between 2^-31 and 2^31. On i386 can you basically can test RdV> the low bit for free (it runs in parallel to the integer add instruction). RdV> This would allow only values outside this range to required calling the long RdV> integer code. Such an optimization is not easily done in C. RdV> This encourages Haskell programmers to always use Int, even if the value RdV> might get too big, because Integer is too slow. this optimization is not implemented in ghc until now and i think that amount of work required to implement it is bigger than requirements to do faster Integer processing. moreover, i hope that good optimizing C compiler can avoid additional testing RdV> Also Tail recursion is more general than looping. A general tail call RdV> optimization will bring better returns than a loop optimization in GHC RdV> (though I could be wrong here). This requires special stack handling. Also RdV> not so easy in C. seems that you don't seen the attached files. tail calls are optimized in gcc RdV> If only simple loops are optimized it will encourage people to always code RdV> loops in their haskell rather than perhaps using more appropriate RdV> constructs. you are prefer that people will not have any option to make fast code? :) also i want to emphasize that C loops is the Haskell recursion. we always said that these two constructs are equivalent, now it's the time to generate code with the same speed RdV> Also take the Maybe data type with Nothing and Just ... or any other RdV> datatypes with 0 and 1 variable constructors. Here these could be represent RdV> by special values for the 0 variable case and bit marking on the single RdV> constructor values. This could lead to good optimizations on case RdV> expressions. RdV> Also not so easy in C. 1) it is far from current ghc optimization facilities 2) i don't see problems with using if() RdV> The lack of this feature encourages people to encode their datatypes as RdV> Int's to get speed. Also not good. RdV> Whether we can communicate the non aliasing and aliasing properties of GHC RdV> to the C compiler, I am also not so sure. concrete examples? RdV> Also in GHC, I am not sure whether stack base locals are the best move. It RdV> might be best to have a fixed page as register spill in some cases. it's the optimization that gcc can handle much better :) just see at the code generated for the fac() function RdV> If you wish to pass and return unboxed tuples without reboxing you will RdV> probably required a special function interface with double entry points (I RdV> think this can be done in C, but it is a bit tricky). as i said, probably pair can be used, but i don't tested this yet -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

From: Bulat Ziganshin
i answered in the original letter (search for "Cray" :)
Re-reading this, I see that you have a well defined goals that cover most of my points.
seems that you don't seen the attached files. tail calls are optimized in gcc
No I don't see any attached files (on any of your emails). Best of luck with your optimisations. I will look forward to using them. (Although I don't like the style of the shootout code, I find it very useful in helping me speed up my own code, and yes I find it better in write Haskell in such a style, than to use FFI and call C). Rene.

Rene de Visser wrote:
Integer is about 30 times slower than it needs to be on GHC if you have over half the values between 2^-31 and 2^31. On i386 can you basically can test the low bit for free (it runs in parallel to the integer add instruction). This would allow only values outside this range to required calling the long integer code. Such an optimization is not easily done in C.
Currently GHC defines data Integer = S# Int# | J# Int# ByteArray# So it already avoids using GMP for small integers. There's a "will this multiplication overflow" primitive, but I'm not sure how it's implemented. I think that changing this to use magic pointers would be pretty easy. You'd need to introduce a new primitive type, say Int31#, and then: 1. anytime you previously constructed a WHNF S# on the heap, make a magic pointer instead 2. anytime you dereference a pointer that might be an S#, check for a magic pointer first. Even if a lot of code needs to be changed, it's straightforward because the changes are local. You're just changing the meaning of a pointer such that there's a statically allocated S# n at address 2n+1. It would also be worth trying this for Char#, which is already a 31-bit type, to see if it would speed up text-processing code.
If only simple loops are optimized it will encourage people to always code loops in their haskell rather than perhaps using more appropriate constructs.
You could say that about any language. When your code needs to be fast it needs to be fast. I'd rather write ugly Haskell code than ugly C code, if it comes to that.
Also take the Maybe data type with Nothing and Just ... or any other datatypes with 0 and 1 variable constructors. Here these could be represent by special values for the 0 variable case and bit marking on the single constructor values.
I'm not sure this would help much. Nullary constructors are already allocated statically, so they're already represented by special values. You could check for those values instead of dereferencing the pointer, but in time-critical code the static object will presumably be in L1 cache anyway. I could be wrong of course, and it would be easy enough to extend the Int31# idea to nullary constructors (Int0#). -- Ben
participants (5)
-
Ben Rudiak-Gould
-
Bulat Ziganshin
-
kyra
-
Lemmih
-
Rene de Visser