
If I understand it correctly, the GHC compiler either directly generates machinecode, or it uses C as an intermediate language. I also read somewhere that C is not the most efficient intermediate representation for functional languages, and that one gets better performance when generating native machine code. However, from the discussions in this list, I could conclude that the low level machine code optimizer of GHC is far from state-of-the-art compared to industry standard C/C++ compilers. I was wondering, why doesn't GHC use the GCC (or any other standard compiler) backend intermediate code? The backend of GCC generates highly optimized code no? Or is the intermediate code format of GCC (or other compilers) not suitable for Haskell? Another question regarding the backend: a cool feature of the Microsoft Visual C++ (MVC) compiler is its ability to perform "LTCG" (link-time-code-generation), performing whole program optimization. It something like this possible with Haskell / (or the GCC backend?). Would it be possible to feed all the generated C code of the GHC compiler into the MVC compiler, to generate one big MVC / LTCG generated executable? It would be interesting to see how much the whole program optimization approach (which can do analysis on the program as if it was a single module) would improve performance... Cheers, Peter

Hi
If I understand it correctly, the GHC compiler either directly generates machinecode, or it uses C as an intermediate language.
I also read somewhere that C is not the most efficient intermediate representation for functional languages, and that one gets better performance when generating native machine code.
However, from the discussions in this list, I could conclude that the low level machine code optimizer of GHC is far from state-of-the-art compared to industry standard C/C++ compilers.
These all seem fair comments, based on my understanding. One other thing to remember is that the GHC people know this, and much work has been done, and will continue to be done until the back end is up to the standard of the front end.
I was wondering, why doesn't GHC use the GCC (or any other standard compiler) backend intermediate code? The backend of GCC generates highly optimized code no? Or is the intermediate code format of GCC (or other compilers) not suitable for Haskell?
GCC is optimised for dealing with code that comes from C, and the back end language is much like C. GCC is also not really set up to be used by bolting different front ends on to different back ends - part of this is a license issue - if the front and back ends were well split and available separately you could write a commerical backend onto a GPL front end.
Another question regarding the backend: a cool feature of the Microsoft Visual C++ (MVC) compiler is its ability to perform "LTCG" (link-time-code-generation), performing whole program optimization. It something like this possible with Haskell / (or the GCC backend?). Would it be possible to feed all the generated C code of the GHC compiler into the MVC compiler, to generate one big MVC / LTCG generated executable? It would be interesting to see how much the whole program optimization approach (which can do analysis on the program as if it was a single module) would improve performance...
You would get much better improvements on whole program at the Haskell level. By the time you get to the generated C you have obfustcated all the structure of the program, and are unlikely to benefit that much. Thanks Neil

Neil Mitchell wrote:
GCC is optimised for dealing with code that comes from C, and the back end language is much like C. GCC is also not really set up to be used by bolting different front ends on to different back ends - part of this is a license issue - if the front and back ends were well split and available separately you could write a commerical backend onto a GPL front end.
Well, I don't know about the licensing, but according to http://en.wikipedia.org/wiki/GNU_Compiler_Collection#Front_ends, a new cleaner intermediate language was created in 2005 for GCC, which might be more "general"?
You would get much better improvements on whole program at the Haskell level. By the time you get to the generated C you have obfustcated all the structure of the program, and are unlikely to benefit that much.
Yes that would be great! Or even at JIT time ;-) But, given the time GHC takes to optimize a single module, I guess it would not really be practical to let those optimization take place on the full program? It would take ages no? I just wanted to see if it is *possible* to feed just all the C code from GHC into a C compiler, and then generate a single executable from that C code (including the GHC runtime). For example, for the XBOX DEVKIT, Microsoft shipped LTCG versions for all the core libraries, which gave a nice performance improvement for free (if I'm not mistaking, this gives 10% to 15% faster code). Actually this LTCG thingy was first supported by the Intel compiler I believe. Thanks, Peter

Peter Verswyvelen wrote:
Well, I don't know about the licensing, but according to http://en.wikipedia.org/wiki/GNU_Compiler_Collection#Front_ends, a new cleaner intermediate language was created in 2005 for GCC, which might be more "general"?
It's still very difficult to work with GCC from the perspective of an external tool. Its current IR is still targeted towards the languages it currently supports and the needs of its back end, so it omits a fair bit of information that third-party tools would find useful.
I just wanted to see if it is *possible* to feed just all the C code from GHC into a C compiler, and then generate a single executable from that C code (including the GHC runtime).
It would be a fair bit of work. My recollection is that GHC "knows" that it is talking to GCC when you go through -fvia-c, so it uses some gcc extensions to handle things like register reservation. A few compilers support some of those gcc extensions, so you might have some level of success with those. The Intel and PathScale compilers do, for example, and LLVM's clang front end has some level of gcc compatibility already. In the case of the PathScale compiler (which I used to work on), when invoked in whole-program mode it emits files that look like regular .o files, but are in fact just the serialised IR for a compilation unit. It's also open source, so it would be easier to tinker with than the Intel compiler.
Actually this LTCG thingy was first supported by the Intel compiler I believe.
No, that kind of whole-program optimisation long predates its inclusion in the Intel compiler. The earliest I saw it in deployment was in MIPS compilers circa 1992, and I don't recall it being a new idea at the time. Anyway, the point about GCC being an unsuitable vehicle for this sort of thing remains. You'd do better looking at LLVM, for which I've lately been working on nice Haskell bindings: darcs get http://darcs.serpentine.com/llvm

Peter Verswyvelen
Another question regarding the backend: a cool feature of the Microsoft Visual C++ (MVC) compiler is its ability to perform "LTCG" (link-time-code-generation), performing whole program optimization. It something like this possible with Haskell / (or the GCC backend?). Would it be possible to feed all the generated C code of the GHC compiler into the MVC compiler, to generate one big MVC / LTCG generated executable? It would be interesting to see how much the whole program optimization approach (which can do analysis on the program as if it was a single module) would improve performance...
jhc does that. Or rather, it doesn't compile Haskell modules separately but throws them together (presumably) in the first pass. Regarding C, it's actually only a high-level assembly language, with compiler optimisations only optimising heavily if you use those idioms... say inlining, loop-unrolling. If your compiler backend outputs something that is already close to some generalisation of different assembly languages, containing no superfluous code and compile-time evaluable expressions and looking generally like a mess of mov's and jmp's, nothing much more than optimisating register allocation and pipeline usage optimisation can be done, which are both highly processor-specific bastards, and likely to be hard to write better than <insert favourite c-compiler>. I figure it's being lazy in the right way(TM) -- (c) this sig last receiving data processing entity. Inspect headers for past copyright information. All rights reserved. Unauthorised copying, hiring, renting, public performance and/or broadcasting of this signature prohibited.

On Tue, Jan 01, 2008 at 06:44:46PM +0100, Achim Schneider wrote:
Peter Verswyvelen
wrote: Another question regarding the backend: a cool feature of the Microsoft Visual C++ (MVC) compiler is its ability to perform "LTCG" (link-time-code-generation), performing whole program optimization. It something like this possible with Haskell / (or the GCC backend?). Would it be possible to feed all the generated C code of the GHC compiler into the MVC compiler, to generate one big MVC / LTCG generated executable? It would be interesting to see how much the whole program optimization approach (which can do analysis on the program as if it was a single module) would improve performance...
jhc does that. Or rather, it doesn't compile Haskell modules separately but throws them together (presumably) in the first pass.
It optimizes everything seperately to a level that is analogous to ghc core (which I call, oddly enough, jhc core). it then collects everything together and translates it into grin, which doesn't have an exact equivalent in ghc but is equivalent to a first order monadic strict functional language and analogous enough to the SSA intermediate form of imperative compilers that one would find it quite comfy were one used to that. I would like to move jhc to more of a 'link-time-code-generation' model though if I understand what you mean, right now jhc does a full monolithic compilation which is pretty resource intensive. I would like to switch to more of an 'incremental whole program compilation' (no, that isn't a contridiction) whereby individual files will compile to .o files, but then before the final linking a special pass will be done to generate the appropriate specialized "glue" for putting the .o files together, this should give a nice middle ground between fully monolithic and the limitations of abiding by the standard C linker. John -- John Meacham - ⑆repetae.net⑆john⑈

John Meacham wrote:
I would like to move jhc to more of a 'link-time-code-generation' model though if I understand what you mean, right now jhc does a full Yes, LTCG is Microsoft's terminology. See http://msdn.microsoft.com/msdnmag/issues/02/05/Hood
monolithic compilation which is pretty resource intensive. I would like Well, M$ LTCG is also resource intensive, but it's still a lot faster than the monolothic approach would be.
to switch to more of an 'incremental whole program compilation' (no, that isn't a contridiction) whereby individual files will compile to .o files, but then before the final linking a special pass will be done to generate the appropriate specialized "glue" for putting the .o files together, this should give a nice middle ground between fully monolithic and the limitations of abiding by the standard C linker.
That would be a bit like http://llvm.org does for imperative languages? If your compiler (pretty amazing job btw) does whole program optimization, can it remove the dictionary (aka v-table in C/C++ parlance?) overhead of type classes? Because if I understand it correctly, unless one uses existential types, a dictionary can be compiled away since the type in question is known at compile time? Cheers, Peter

On Monday 07 January 2008 20:27:17 Peter Verswyvelen wrote:
If your compiler (pretty amazing job btw) does whole program optimization, can it remove the dictionary (aka v-table in C/C++ parlance?) overhead of type classes? Because if I understand it correctly, unless one uses existential types, a dictionary can be compiled away since the type in question is known at compile time?
Yes. This is essentially what F# does. -- Dr Jon D Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/products/?e

On Mon, 2008-01-07 at 20:26 +0000, Jon Harrop wrote:
On Monday 07 January 2008 20:27:17 Peter Verswyvelen wrote:
If your compiler (pretty amazing job btw) does whole program optimization, can it remove the dictionary (aka v-table in C/C++ parlance?) overhead of type classes? Because if I understand it correctly, unless one uses existential types, a dictionary can be compiled away since the type in question is known at compile time?
Yes. This is essentially what F# does.
(This reply is aimed primarily at Peter.) Even in Haskell 98 it is not quite possible to do this. You can get rid of the vast majority of them, but the type is not always known at compile-time. There is one situation where the types depend on values. This occurs when you utilize polymorphic recursion. Polymorphic recursion is one of the things the Haskell standard wasn't conservative about. I'm not sure if any widely used language supports it besides Haskell. (Maybe Clean?) This turned out to be rather fortunate as nested data types effectively require it. As a toy example which illustrates this problem is: f :: Show a => Int -> a -> String -- type signature required f 0 x = show x f n x = f (n-1) (x,x) The dictionary to use depends on the input. There are (well, switching to Integer) an infinite number of potential types, so being able to statically resolve them would not help. Some kind of run-time witness will need to be passed around.

Jon Harrop wrote:
On Monday 07 January 2008 20:27:17 Peter Verswyvelen wrote:
If your compiler (pretty amazing job btw) does whole program optimization, can it remove the dictionary (aka v-table in C/C++ parlance?) overhead of type classes? Because if I understand it correctly, unless one uses existential types, a dictionary can be compiled away since the type in question is known at compile time?
Yes. This is essentially what F# does.
Hang on. When I asked a similar question in the F# forum, it seemed that F# simply does not have type classes (but it does support .NET interfaces), and root functions are only fully polymorphic when you mark them *inline*. For example, the following will not work in the current version of F#: **let add x y = x + y let t1 = add 1 2 let t2 = add 1.0 2.0** The first usage of add forces it to work on integers, and then its type is settled once and for all. If I understood it correctly, this is because the "root polymorphic" add function can only work on specific types because it gets compiled into an object file. You can force it to work on any type on which + can be applied by forcing the function to be inline, like let inline add x y = x+y let t1 = add 1 2 let t2 = add 1.0 2.0 Then I guess the code will not be part of the object file, but it will be inserted inplace whenever it gets used (which could result in code bloat?) Can the same trick be used in GHC? Regarding Derek Elkins excellent example f :: Show a => Int -> a -> String -- type signature required f 0 x = show x f n x = f (n-1) (x,x) this indeed cannot be inlined nor optimized away, since you generally don't know the value of the first parameter at runtime. But this is a very specific case no? According to what I interpret from John Meacham's email, this case is an exception, and in many cases class type dictionary overhead gets compiled away? Thanks, Peter

On Tue, 2008-01-08 at 11:33 +0100, Peter Verswyvelen wrote:
Jon Harrop wrote:
On Monday 07 January 2008 20:27:17 Peter Verswyvelen wrote:
If your compiler (pretty amazing job btw) does whole program optimization, can it remove the dictionary (aka v-table in C/C++ parlance?) overhead of type classes? Because if I understand it correctly, unless one uses existential types, a dictionary can be compiled away since the type in question is known at compile time?
Yes. This is essentially what F# does.
Hang on.
When I asked a similar question in the F# forum, it seemed that F# simply does not have type classes (but it does support .NET interfaces), and root functions are only fully polymorphic when you mark them *inline*.
For example, the following will not work in the current version of F#:
**let add x y = x + y let t1 = add 1 2 let t2 = add 1.0 2.0**
The first usage of add forces it to work on integers, and then its type is settled once and for all. If I understood it correctly, this is because the "root polymorphic" add function can only work on specific types because it gets compiled into an object file.
You can force it to work on any type on which + can be applied by forcing the function to be inline, like
let inline add x y = x+y let t1 = add 1 2 let t2 = add 1.0 2.0
Then I guess the code will not be part of the object file, but it will be inserted inplace whenever it gets used (which could result in code bloat?)
Can the same trick be used in GHC?
Regarding Derek Elkins excellent example
f :: Show a => Int -> a -> String -- type signature required f 0 x = show x f n x = f (n-1) (x,x)
this indeed cannot be inlined nor optimized away, since you generally don't know the value of the first parameter at runtime. But this is a very specific case no? According to what I interpret from John Meacham's email, this case is an exception, and in many cases class type dictionary overhead gets compiled away?
Quoting my previous email with added emphasis, "Even in Haskell 98 it is not quite possible to do this. <<You can get rid of the vast majority of them>>, but the type is not always known at compile-time. There is one situation where the types depend on values. This occurs when you utilize polymorphic recursion." The point of my email was that it is not possible to -completely- eliminate dictionaries (or some other run-time witness) in general.

On Mon, Jan 07, 2008 at 09:27:17PM +0100, Peter Verswyvelen wrote:
If your compiler (pretty amazing job btw) does whole program optimization, can it remove the dictionary (aka v-table in C/C++ parlance?) overhead of type classes? Because if I understand it correctly, unless one uses existential types, a dictionary can be compiled away since the type in question is known at compile time?
Both jhc and ghc can completely remove the dictionary parameter whenever the type is known at compile time because an implicit SPECIALIZATION rule is generated that gets triggered whenever an overloaded function is called on a known type. Jhc can do a bit more though when the type is not fully known at compile time. it has a fairly interesting implementation of type classes, rather than turning each type class parameter into a dictionary of functions, it passes a reification of the type itself as a single argument whenever type class arguments are present (and used). This means that the generated code looks something like (+) :: (a::*) -> a -> a -> a (+) t x y = case t of Int -> primIntAdd x y Integer -> primIntegerAdd x y ... standard optimizations then can apply to eliminating and coalescing the type scrutinizations. John -- John Meacham - ⑆repetae.net⑆john⑈

Peter Verswyvelen wrote: ...
I was wondering, why doesn't GHC use the GCC (or any other standard compiler) backend intermediate code? The backend of GCC generates highly optimized code no? Or is the intermediate code format of GCC (or other compilers) not suitable for Haskell? ...
My guess is that there is scope for producing assembly code that out-performs GCC. I speak as someone who has no actual knowledge of how GCC works :) I have a theory that the back-end of most compilers, including GCC, are an accretion of many hand-coded hard-coded functions to perform very specific optimisations for particular instructions for particular processors. These have been developed over the years by relatively ad-hoc timings of the precise assembly code the individual who wrote the function wanted to improve. So what's wrong with this approach? When a new processor is introduced it needs a different set of optimisations, which take time to develop, which means the compiler is sub-optimal for a while for new processors. Also, the number of processors we would like the compiler optimised for is large if we're really trying to get the last few percent from every processor. Modern processors aren't simple things like the old 6502 or Z80 which had definite, easily predictable, timings. Think different cache sizes, multiple pipelines, branch prediction etc. "The microarchitecture of Intel and AMD CPU’s: An optimization guide for assembly programmers and compiler makers" here http://www.agner.org/optimize/ It seems unlikely therefore that the code generation for every modern processor variant will be hand-optimised to account for all the small details. Even in GCC, which no doubt has a huge amount of continuous effort applied to it by many clever people, they are never going to hand-code every complexity of every processor. Take a small example : integer multiplication. Looking at the table in the conclusion of the microarchitecture PDF we see that different x86 processors have different relative speed of adding and multiplying. Something like this Execution latencies, typical AMD64 P4E PM Core2 integer addition 1 1 1 1 integer multiplication 3 10 4 3 We can compute 4 * 371 faster on P4E by doing 371 + 371 + 371 + 371 but on Core2 and AMD64 we should do 4 * 371 and on PM it might depend on other factors, I don't know. Now supposing we allow powers of two by shifting as well as integer addition, can we immediately say the best way of multiplying two integers on all these processors? Maybe roughly, but do we always get the absolute fastest code? For example it will depend on the values of the integers. If we know one of them in advance as I've assumed above, or if we know they frequently have particular values we might handle them first etc. This is just one small optimisation out of many. And the optimisations might interact. So the inevitable conclusion is that the precise assembly code optimisations must not be hand-coded directly into the back-end of the compiler, as I expect they are currently in most compilers, but that they need to be expressed at a higher level "repeated addition is equivalent to multiplication" and adapted down to particular processors and situations based on extensive, genetic algorithm guided, timing runs. Anyway, that's my wild idea thrown in the pot. Might be a nice research paper in there for someone :) Richard.
participants (8)
-
Achim Schneider
-
Bryan O'Sullivan
-
Derek Elkins
-
John Meacham
-
Jon Harrop
-
Neil Mitchell
-
Peter Verswyvelen
-
Richard Kelsall