
Peter Verswyvelen wrote: ...
I was wondering, why doesn't GHC use the GCC (or any other standard compiler) backend intermediate code? The backend of GCC generates highly optimized code no? Or is the intermediate code format of GCC (or other compilers) not suitable for Haskell? ...
My guess is that there is scope for producing assembly code that out-performs GCC. I speak as someone who has no actual knowledge of how GCC works :) I have a theory that the back-end of most compilers, including GCC, are an accretion of many hand-coded hard-coded functions to perform very specific optimisations for particular instructions for particular processors. These have been developed over the years by relatively ad-hoc timings of the precise assembly code the individual who wrote the function wanted to improve. So what's wrong with this approach? When a new processor is introduced it needs a different set of optimisations, which take time to develop, which means the compiler is sub-optimal for a while for new processors. Also, the number of processors we would like the compiler optimised for is large if we're really trying to get the last few percent from every processor. Modern processors aren't simple things like the old 6502 or Z80 which had definite, easily predictable, timings. Think different cache sizes, multiple pipelines, branch prediction etc. "The microarchitecture of Intel and AMD CPU’s: An optimization guide for assembly programmers and compiler makers" here http://www.agner.org/optimize/ It seems unlikely therefore that the code generation for every modern processor variant will be hand-optimised to account for all the small details. Even in GCC, which no doubt has a huge amount of continuous effort applied to it by many clever people, they are never going to hand-code every complexity of every processor. Take a small example : integer multiplication. Looking at the table in the conclusion of the microarchitecture PDF we see that different x86 processors have different relative speed of adding and multiplying. Something like this Execution latencies, typical AMD64 P4E PM Core2 integer addition 1 1 1 1 integer multiplication 3 10 4 3 We can compute 4 * 371 faster on P4E by doing 371 + 371 + 371 + 371 but on Core2 and AMD64 we should do 4 * 371 and on PM it might depend on other factors, I don't know. Now supposing we allow powers of two by shifting as well as integer addition, can we immediately say the best way of multiplying two integers on all these processors? Maybe roughly, but do we always get the absolute fastest code? For example it will depend on the values of the integers. If we know one of them in advance as I've assumed above, or if we know they frequently have particular values we might handle them first etc. This is just one small optimisation out of many. And the optimisations might interact. So the inevitable conclusion is that the precise assembly code optimisations must not be hand-coded directly into the back-end of the compiler, as I expect they are currently in most compilers, but that they need to be expressed at a higher level "repeated addition is equivalent to multiplication" and adapted down to particular processors and situations based on extensive, genetic algorithm guided, timing runs. Anyway, that's my wild idea thrown in the pot. Might be a nice research paper in there for someone :) Richard.