
its loop unroller on this guy haven't succeeded. -funroll-loops and -funroll-all-loops doesn't touch it,
That's because the C produced by GHC doesn't look like a loop to GCC. This can be fixed but given that we are moving away from -fvia-C anyway, it probably isn't worth doing.
That was one of my questions in the optimization and rewrite rules thread: shouldn't -fvia-C be supported (as a non-default option) for at least as long as the alternative isn't a clear win in all cases? If there are small changes that could make GHC-generated code more palatable to GCC's optimizer, wouldn't that be worth doing? Once -fvia-C is allowed to bitrot to the level of unoptimized bootstraps only, we might never get the good performance route back, so why not keep it in good shape as long as it offers real benefits?
The problem with low-level loop optimisations is that in general, they should be done at a low level. Core is much too early for this. To find out whether and how much to unroll a particular loop, you must take things like register pressure and instruction scheduling into account. IMO, the backend is the only reasonable place to do these optimisations.
[1] is one example reference for this argument (late unrolling). And since most compiler textbooks are oddly quiet about optimizations and their interactions, the survey [2] might also be helpful (and [3] has some pointers to more recent work). However, I'd like to note that Core is rather different from conventional language source-level code, so I would expect benefits from source-level "unrolling", too: Core retains much more of the high-level semantics, so both identifying loops and applying library-specific optimizations after unrolling are much easier here than at the basic block level in the backend. After all, it is just the normal unfolding/inlining that forms the starting point for so many of GHC's optimizations, which just happens to be blind to recursive definitions at the moment. Recursive definitions are quite widely used in Haskell code, so this blindspot can't be good for the overall effectiveness of GHC's optimizer. If one could mark recursive bindings with a counter, to limit unfoldings according to a compiler option, generalised loop unrolling would just be a consequence of what GHC does anyway, right? That doesn't change the point that, at the lower level, loop unrolling interacts strongly with the target architecture, and that some relevant information is not available at Core level. But it might be better to do both Core-level unfolding (to enable further Core2Core optimizations, independent of platform, that might no longer be visible at backend level) and backend-level unfolding and re-folding (to massage the low-level flow graph into a shape suited for the target architecture, about which the Core level has no information). One might also expect that Core-level transformations are affected by compiler flags which users select according to their target architecture (common practice at the moment), so Core2Core isn't entirely out of that loop, either;-) It is worth noting that there is a third level of optimizations, even after the backend, in modern hardware, as these notes [4] for an Intel compiler's unrolling option document. And since I'm collecting references, there's also Ian's TH [5] for doing the unrolling even before Core. Claus [1] An Aggressive Approach to Loop Unrolling, 1995 http://citeseer.ist.psu.edu/old/620489.html [2] Compiler Transformations for High-Performance Computing, ACM Computing Surveys, 1994 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.4885 [3] http://en.wikipedia.org/wiki/Loop_transformation [4] http://www.intel.com/software/products/compilers/flin/docs/main_for/mergedpr... [5] Unrolling and simplifying expressions with Template Haskell, Ian Lynagh, 2003 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.5.9813