Understanding core2core optimisation pipeline

Devs, I'm trying to understand how the core2core pipeline works. Sadly, we don't have a wiki page about this so the "only" source of information are the papers and the source code. Papers give pretty much detail about each transformation in separate but none of the papers gives a comprehensive and up-to-date overview of how the whole pipeline is structured. My questions are based on reading the user documentation and the following papers: [1] - "The Glasgow Haskell Compiler" from The Architecture of Open Source Application, vol. 2 [2] - "Compilation by Transformation in Non-Strict Functional Languages", PhD by Santos [3] - "Secrets of the Glasgow Haskell Compiler inliner" [4] - "A transformation-based optimiser for Haskell" [5] - "Modular, Higher-Order Cardinality Analysis in Theory and Practice" [6] - "Let-floatig: moving bindings to give faster programs" [7] - "Playing by the Rules: Rewriting as a practical optimisation technique in GHC" I know there are several papers missing from this list, eg. "Constructed Product Result Analysis for Haskell" or "Call-pattern specialisation for Haskell programs". The reason is that these optimisations are beyond the scope of what I'm doing at the moment (or so I believe). This mail basically asks just one question: what is the order of optimizations pefromed on Core? Since this question is pretty big and general I've separated it into smaller questions that arose from reading the above papers, documentation, and experimenting with GHC. Now the detailed questions: 1. What is the difference between a "simplifier iteration" and "simplifier phase"? Section 7.20.6.5 of the user guide mentions phases but I believe that iterations are not explained anywhere. My best guess, expressed in pseudo-code, is this (sorry about the imperative style): foreach (i in iterations) { // some optimisations here? foreach (p in phases) { //...optimisations here } // some optimisations here? } 1a. What is the default maximum iterations count? User documentation does not specify that. 2. How can I observe the effects of `-ddump-simpl-phases`. I tried compiling several different programs and this flag seems to have no effect (ie. nothing gets printed). 3. Cardinality anlaysis and inlining: cardinality analysis can determine that a let binding is used exactly once. Can the inliner re-use this information from the cardinality analysis or does it recompute it per [3], section 3.1? 4. I've compiled a sample program using `-dverbose-core2core` and got the following phases: - Desugar (after optimization) - Simplifier (Phase = InitialPhase [Gentle]) - Specialise - Levels added - Float out - Float inwards - Simplifier (Phase = 2 [main]) - Simplifier (Phase = 1 [main]) - Simplifier (Phase = 0 [main]) - Demand analysis - Worker Wrapper binds - Simplifier (Phase = 0 [post-worker-wrapper]) - Levels added - Float out - Common sub-expression - Float inwards - Simplifier (Phase = 0 [final]) - Tidy Core - CorePrep This raises lots of questions: 4a. The first phase is "Desugar (after optimization)". What optimizations are performed during desugaring? 4b. I'm not sure whether I'm looking at a single iteration of core2core transformation or at multiple ones. Some passes are performed several times (Float out, Float inwards), which suggests that there might be many iterations here. On the other hand simplifier phases are decreasing towards 0, which looks as if it was one core2core iteration. My assumption here is that every time a new core2core iteration starts the simplifier phases are counted anew from 2 towards 0. Is that correct? 4c. Why are there several 0 phases of the Simplifier? I find it confusing. 4d. I understand that some passes can be enabled or disabled using command-line options. Can the decission to run some passes be made dynamically by the compiler (eg. to run extra simplifier passes)? 4e. Are there more phases that could appear here, ie. they were ommited with -O? 4f. "Levels added" pass before the "Float out" pass: my guess is that this is preparation for the full laziness transform. So, is full laziness performed as part of "Float out" pass? A general note is that I am confused by many Simplifier passes being interleaved with other passes. I expected that simplifier phases will grouped into a single pass, as speculated in question 1. 5. What optimizations *exactly* are performed by the Simplifier? I assume that most of what's described in chapter 3 of [2]: beta reduction, let elimination, case elimination, case floating, constant folding and eta expansion. I'm not sure about floating let outwards and inwards - [1], pg. 7, says these are in a pass separate from the simplifier. `-dverbose-core2core` seems to confirm that since it reveals separate "Float out" and "Float inwards" passes. 6. [4], pg. 31, mentions the Deforestation optimisation. Is everything described in that "Deforestation" section subsumed by cardinality analysis ([5], end of section 2.1 and section 7.1)? If not then when is deforestation performed? 7. [5], section 6.1 says: "We run the analysis twice, once in the middle of the optimisation pipeline and once near the end". When exactly in the middle of the pipeline? Between which passes? This does not show up with `-dverbose-core2core` (or at least it is not explicitly named). 8. How does the rules rewriting fit into the picture? Section 7.20.6.5 of the User Guide and section 4.1 of [7] explain the interaction between rules and inlining and my guess is that both are performed by the Simplifier. Again the "simplifier phases/iterations" distinction puzzles me as to what exactly is happening when. Within a single phase is the inlining happening before rewriting or vice versa? I know that all of the above questions can be answered by looking at the source code for sufficiently long. This is actually what I'm planning to do next but if anyone could help me by answering some of these questions this would certainly save me some time. My plan is to gather up the answers on a wiki page. Janek

| My plan is to gather up | the answers on a wiki page. Excellent -- please do that! My replies are below Simon | This mail basically asks just one question: what is the order of | optimizations pefromed on Core? It's entirely defined by SimplCore.getCoreToDo :: DynFlags -> [CoreToDo] The code there should be reasonably self-explanatory. The type signature is very descriptive. | 1. What is the difference between a "simplifier iteration" and | "simplifier phase"? Roughtly, a complete run of the simplifier means "run the simplifier repeatedly until nothing further happens". The iterations are the successive iterations of this loop. Currently there's a (rather arbitrary) limit of four such iterations before we give up and declare victory. | 1a. What is the default maximum iterations count? User documentation does | not specify that. Four | 2. How can I observe the effects of `-ddump-simpl-phases`. I tried | compiling several different | programs and this flag seems to have no effect (ie. nothing gets | printed). I had to read the source code. I think you say "-ddump-simpl-phases=A,B,C" to dump the output of phases called A,B,C. But no, it seems that it only affects output of simplifier statistics (see simplifyPgmIO). I have never used this flag. Maybe it can go. Looks strange to me. | 3. Cardinality anlaysis and inlining: cardinality analysis can determine | that a let binding is | used exactly once. Can the inliner re-use this information from the | cardinality analysis or does | it recompute it per [3], section 3.1? Cardinality analysis determines that something is *demanded* once. The occurrence analyser determines when it *occurs* once. For example if .. then x else x+1 x occurs twice, but is demanded once. So they are different. The inliner uses occurrence information. | 4a. The first phase is "Desugar (after optimization)". What optimizations | are performed during | desugaring? Just a few basic ones; see CoreSubst.simpleOptPgm. It implements the "Very Simple Optimiser" which is only a page or two of code. Reading it and writing a Note that enumerates what optimisations it does would be a Good Thing. | 4b. I'm not sure whether I'm looking at a single iteration of core2core | transformation or at | multiple ones. Some passes are performed several times (Float out, Float | inwards), which suggests | that there might be many iterations here. On the other hand simplifier | phases are decreasing | towards 0, which looks as if it was one core2core iteration. My | assumption here is that every | time a new core2core iteration starts the simplifier phases are counted | anew from 2 towards 0. Is | that correct? No. getCoreToDo produces a list of CoreToDos. Each specifies a stage in the pipeline. One such stage is a run of the simplifier. Such a run has a "phase" number, which is set in getCoreToDo. This phase number is used (only) to control INLINE pragmas and RULES (see extensive documentation in the user manual e.g http://www.haskell.org/ghc/docs/latest/html/users_guide/pragmas.html#phase-c...) | 4c. Why are there several 0 phases of the Simplifier? I find it | confusing. We need to run the simplifier several times to propagate the effects of (say) strictness analysis or let-floating. But by that stage the need for staging RULES and INLINE pragmas is over. | 4d. I understand that some passes can be enabled or disabled using | command-line options. Can the | decission to run some passes be made dynamically by the compiler (eg. to | run extra simplifier | passes)? Yes. Plug-ins do precisely that, by manipulating the [CoreToDo] pipeline list. | 4e. Are there more phases that could appear here, ie. they were ommited | with -O? Could appear where? | 4f. "Levels added" pass before the "Float out" pass: my guess is that | this is preparation for the | full laziness transform. So, is full laziness performed as part of "Float | out" pass? The full laziness pass is simply a combination of set-levels (an analysis) followed by float out (which transforms the program) | A general note is that I am confused by many Simplifier passes being | interleaved with other | passes. I expected that simplifier phases will grouped into a single | pass, as speculated in | question 1. Many passes produce output that can readily be simplified. Rather than require them to perform those simplifications, we delegate it to the simplifier. | 5. What optimizations *exactly* are performed by the Simplifier? I assume | that most of what's | described in chapter 3 of [2]: beta reduction, let elimination, case | elimination, case floating, | constant folding and eta expansion. I'm not sure about floating let | outwards and inwards - [1], | pg. 7, says these are in a pass separate from the simplifier. `-dverbose- | core2core` seems to | confirm that since it reveals separate "Float out" and "Float inwards" | passes. I don't have an exhaustive list, I'm afraid. The general rule is: if it can be done with local knowledge, the simplifier should do it. Making a full list would be another good exercise. If you start I could add more. Examples: - constant foldings - applying RULES - inlining (a really important one) - case of case - case of known constructor - eta expansion and eta reduction - combining adjacent casts - pushing a cast out of the way of an application e.g. (f |> k) a ==> f (a |> k1) |> k2 for suitable k1, k2 and there are probably a lot more. Look for SimplifierTicks; every time the simplifier does something significant it bumps a tick count. (or should do so.) | 6. [4], pg. 31, mentions the Deforestation optimisation. Is everything | described in | that "Deforestation" section subsumed by cardinality analysis ([5], end | of section 2.1 and | section 7.1)? If not then when is deforestation performed? Deforestation is simply the application of rewrite RULES; that's done by the simplifier. | 7. [5], section 6.1 says: "We run the analysis twice, once in the middle | of the optimisation | pipeline and once near the end". When exactly in the middle of the | pipeline? Between which | passes? This does not show up with `-dverbose-core2core` (or at least it | is not explicitly | named). I'm not actually certain that cardinality analysis *is* run twice in HEAD. Maybe it was only in the version for the paper. Ilya can tell us | 8. How does the rules rewriting fit into the picture? It's done by the simplifier

The wiki page just went live: https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler/Core2CorePipeline It's not yet perfect but it should be a good start.
Roughtly, a complete run of the simplifier means "run the simplifier repeatedly until nothing further happens". The iterations are the successive iterations of this loop. Currently there's a (rather arbitrary) limit of four such iterations before we give up and declare victory. A limit or a default value for that limit?
If you grep for the "late_dmd_anal" option variable in the compiler/simplCore/SimplCore.lhs module, you'll see that it triggers a phase close to the endo of getCoreToDo's tasks, which contains, in particular, the "CoreDoStrictness" pass. This is the "late" phase. The paper said that the late pass is run to detect single-entry thunks and the reason why it is run late in the pipeline is that if it were run earlier this information could be invalidated by
To Ilya: the transformations. But in the source code I see that this late pass is followed by the simplifier, which can invalidate the information. Also, the documentation for -flate-dmd-anal says: "We found some opportunities for discovering strictness that were not visible earlier; and optimisations like -fspec-constr can create functions with unused arguments which are eliminated by late demand analysis". This says nothing about single-netry thunks. So, is the single-entry thunk optimisation performed by GHC? Janek
participants (2)
-
Jan Stolarek
-
Simon Peyton Jones