 
            #10397: Compiler performance regression 7.6 -> 7.8 in elimCommonBlocks -------------------------------------+------------------------------------- Reporter: | Owner: TobyGoodwin | Status: new Type: bug | Milestone: Priority: normal | Version: 7.8.4 Component: Compiler | Operating System: Unknown/Multiple Keywords: | Type of failure: None/Unknown performance | Blocked By: Architecture: | Related Tickets: Unknown/Multiple | Test Case: see ticket | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- I've been studying a compiler performance problem in ghc-7.8.4 which is not seen in ghc-7.6.3. Profiling shows the compiler spends 74.3% of its time in `elimCommonBlocks` (in `compiler/cmm/CmmCommonBlockElim.hs`). I came across the problem in real code, and rwbarton on #ghc was able to turn it into a sensible test case, which can be found at http://static.tobold.org/ghc-prof/code/ It's not really a regression, since `elimCommonBlocks` is not enabled by default in 7.6. It can be turned on with `-fnew-codegen`, and that demonstrates a similar slowdown. However 7.8 is worse than 7.6, and 7.10 slightly worse still, for reasons that I haven't explored. Compile times for `RegBig.hs` are: {{{ 7.6.3 -O2 17s 7.6.3 -O2 -fnew-codegen 177s 7.8.4 -O2 230s 7.10.1 -O2 241s }}} The problem obviously stems from the applicative expression with lots of branches `Foo <$> a <*> b <*> c <*> ...` It seems that some level of inlining occurs which means that each extra term in this expression doubles the size of the code. Incidentally, this blowup only occurs when both `RegBig.hs` ''and'' `Form.hs` are compiled with `-O2`. I haven't looked to see why the blowup occurs; although I do see that a similar blowup happens with 7.6. {{{ ... *** Simplifier: Result size of Simplifier iteration=1 = {terms: 16,525, types: 31,503, coercions: 73} Result size of Simplifier iteration=2 = {terms: 29,070, types: 45,079, coercions: 73} Result size of Simplifier iteration=3 = {terms: 50,139, types: 62,057, coercions: 73} Result size of Simplifier iteration=4 = {terms: 84,172, types: 83,805, coercions: 73} Result size of Simplifier = {terms: 84,172, types: 83,805, coercions: 73} ... }}} I've looked in some detail at what happens next, tinkering with `elimCommonBlocks` to try to understand why it runs so slowly in this case. There are two major problems. First, `elimCommonBlocks` is worse than O(n^2^), and the inlining blowup mentioned above results in a very large input: 51695 blocks in this example. (Across the project of 50 files or so where this problem originated, I see the input to `elimCommonBlocks` is normally a few hundred blocks at most.) Secondly, there is an effort made in `elimCommonBlocks` to avoid comparing whole blocks all the time, by computing a hash value of some of the block. This fails to produce sufficient unique hash values for this case. The hash must obviously omit the unique block label, and in practice all labels, and various other things, are omitted. In this case, presumably because the code is blown up from the tiny input, many many blocks are similar, and the differences are elided by the hash function. For example, 8177 blocks share the hash value 1094. These 8177 similar blocks all end up in a normal list, which is searched with Data.List.find and the full-block comparison function! Eventually this process concludes that the vast majority are in fact different: for all its hard work, `elimCommonBlocks` only finds about a dozen common blocks in total. Two blocks which both hash to 1094 are these: {{{ cCHc: _se0q::P64 = R1; _c1grM::P64 = _se0q::P64 & 7; if (_c1grM::P64 >= 2) goto c1grR; else goto c1grS; cWK7: _sig3::P64 = R1; _c1iXk::P64 = _sig3::P64 & 7; if (_c1iXk::P64 >= 2) goto c1iXp; else goto c1iXq; }}} I suggest the following steps need to be taken. == 1. Mitigation == Since this is a real problem in the 7.8 and 7.10 series, it needs a simple but effective mitigation. I suggest a test that the input to `elimCommonBlocks` is not "too big", where 1000 seems a reasonable cutoff. Another simple mitigation would be to disable common block elimination altogether (after all, `-O2` is usually understood to mean "make my code go faster even if it gets bigger"). I've tried both these possibilities with 7.8.4. The trivial patches are http://static.tobold.org/ghc-prof/0001-cbe-cutoff.patch and http://static.tobold.org/ghc-prof/0001-no-cbe.patch Compilation times for the test case `RegBig.hs` are as follows, and there is no difference in size in the `.o` file. {{{ 7.8.4 -O2 230s cbe only if input < 1000 blocks 88s no common block elimination 86s }}} == 2. Investigation == I focused on `elimCommonBlocks` thinking that the inline blowup seen earlier in the compilation is reasonable. On reflection, this is probably not so, and I think someone better acquainted with ghc internals should investigate this. == 3. Optimization == `-- TODO: Use optimization fuel` it says in `CmmCommonBlockElim.hs` Since `elimCommonBlocks` is essentially doing a `nub`, it will never perform better than O(n^2^). In fact, it's currently much worse than that as the comparison function changes as duplicate blocks are found, which in turn may create new duplicates, so we have to iterate the nub function till no more dups are found. I did experiment with improving the hash function. It's possible to include target labels, at the expense of needing to recompute block hash values on each iteration. (And currently some ''highly'' ugly code to convert labels to integers.) This considerably reduces the length of some of the hash bucket lists, and so offers a significant speedup: {{{ ghc-7.8.4 -O2 230s "hack" 170s }}} Patch for this is http://static.tobold.org/ghc-prof/0001-cbe-hacks.patch This looks like a promising approach. Including target labels doesn't eliminate all the large hash buckets, so there is scope for further improvements along these lines. I don't see any reason why the hash function should not be as discerning as the equality function, with just occasional accidental clashes. I've also pondered whether more radical rewriting of `elimCommonBlocks` might pay dividends. It's a commonplace that, if you don't need to preserve order, `map head . group . sort` is a faster `nub`. Is the order of blocks significant? I'm not sure - they all seem to end with either `goto` or `call`. I'd ''assume'' that `call` then falls through to the next block but I'm not sure. Anyway, presumably we could `zip [0..]` first, and sort again afterwards, and that should still be asymptotically faster than `nub`. Of course, this presupposes that blocks could be (sensibly? efficiently?) made an instance of `Ord`. (If they were, hash buckets could be stored in an O(1) data structure and binary search employed.) I may be able to find time to pursue some of these avenues, but as a very novice ghc hacker, I'd need some strategic direction here. Any thoughts? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10397 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler