[GHC] #10397: Compiler performance regression 7.6 -> 7.8 in elimCommonBlocks

9 May 2015

      #10397: Compiler performance regression 7.6 -> 7.8 in elimCommonBlocks
-------------------------------------+-------------------------------------
              Reporter:              |             Owner:
  TobyGoodwin                        |            Status:  new
                  Type:  bug         |         Milestone:
              Priority:  normal      |           Version:  7.8.4
             Component:  Compiler    |  Operating System:  Unknown/Multiple
              Keywords:              |   Type of failure:  None/Unknown
  performance                        |        Blocked By:
          Architecture:              |   Related Tickets:
  Unknown/Multiple                   |
             Test Case:  see ticket  |
              Blocking:              |
Differential Revisions:              |
-------------------------------------+-------------------------------------
 I've been studying a compiler performance problem in ghc-7.8.4 which is
 not seen in ghc-7.6.3. Profiling shows the compiler spends 74.3% of its
 time in `elimCommonBlocks` (in `compiler/cmm/CmmCommonBlockElim.hs`).

 I came across the problem in real code, and rwbarton on #ghc was able to
 turn it into a sensible test case, which can be found at
 http://static.tobold.org/ghc-prof/code/

 It's not really a regression, since `elimCommonBlocks` is not enabled by
 default in 7.6. It can be turned on with `-fnew-codegen`, and that
 demonstrates a similar slowdown. However 7.8 is worse than 7.6, and 7.10
 slightly worse still, for reasons that I haven't explored. Compile times
 for `RegBig.hs` are:

 {{{
 7.6.3 -O2                 17s
 7.6.3 -O2 -fnew-codegen  177s
 7.8.4 -O2                230s
 7.10.1 -O2               241s
 }}}

 The problem obviously stems from the applicative expression with lots of
 branches `Foo <$> a <*> b <*> c <*> ...` It seems that some level of
 inlining occurs which means that each extra term in this expression
 doubles the size of the code. Incidentally, this blowup only occurs when
 both `RegBig.hs` ''and'' `Form.hs` are compiled with `-O2`. I haven't
 looked to see why the blowup occurs; although I do see that a similar
 blowup happens with 7.6.

 {{{
 ...
 *** Simplifier:
 Result size of Simplifier iteration=1
   = {terms: 16,525, types: 31,503, coercions: 73}
 Result size of Simplifier iteration=2
   = {terms: 29,070, types: 45,079, coercions: 73}
 Result size of Simplifier iteration=3
   = {terms: 50,139, types: 62,057, coercions: 73}
 Result size of Simplifier iteration=4
   = {terms: 84,172, types: 83,805, coercions: 73}
 Result size of Simplifier
   = {terms: 84,172, types: 83,805, coercions: 73}
 ...
 }}}

 I've looked in some detail at what happens next, tinkering with
 `elimCommonBlocks` to try to understand why it runs so slowly in this
 case.

 There are two major problems. First, `elimCommonBlocks` is worse than
 O(n^2^), and the inlining blowup mentioned above results in a very large
 input: 51695 blocks in this example. (Across the project of 50 files or so
 where this problem originated, I see the input to `elimCommonBlocks` is
 normally a few hundred blocks at most.)

 Secondly, there is an effort made in `elimCommonBlocks` to avoid comparing
 whole blocks all the time, by computing a hash value of some of the block.
 This fails to produce sufficient unique hash values for this case.

 The hash must obviously omit the unique block label, and in practice all
 labels, and various other things, are omitted.  In this case, presumably
 because the code is blown up from the tiny input, many many blocks are
 similar, and the differences are elided by the hash function. For example,
 8177 blocks share the hash value 1094.

 These 8177 similar blocks all end up in a normal list, which is searched
 with Data.List.find and the full-block comparison function! Eventually
 this process concludes that the vast majority are in fact different: for
 all its hard work, `elimCommonBlocks` only finds about a dozen common
 blocks in total.

 Two blocks which both hash to 1094 are these:

 {{{
        cCHc:
            _se0q::P64 = R1;
            _c1grM::P64 = _se0q::P64 & 7;
            if (_c1grM::P64 >= 2) goto c1grR; else goto c1grS;

        cWK7:
            _sig3::P64 = R1;
            _c1iXk::P64 = _sig3::P64 & 7;
            if (_c1iXk::P64 >= 2) goto c1iXp; else goto c1iXq;
 }}}

 I suggest the following steps need to be taken.

 == 1. Mitigation ==

 Since this is a real problem in the 7.8 and 7.10 series, it needs a simple
 but effective mitigation. I suggest a test that the input to
 `elimCommonBlocks` is not "too big", where 1000 seems a reasonable cutoff.
 Another simple mitigation would be to disable common block elimination
 altogether (after all, `-O2` is usually understood to mean "make my code
 go faster even if it gets bigger").

 I've tried both these possibilities with 7.8.4. The trivial patches are
 http://static.tobold.org/ghc-prof/0001-cbe-cutoff.patch and
 http://static.tobold.org/ghc-prof/0001-no-cbe.patch

 Compilation times for the test case `RegBig.hs` are as follows, and there
 is no difference in size in the `.o` file.

 {{{
 7.8.4 -O2                        230s
 cbe only if input < 1000 blocks   88s
 no common block elimination       86s
 }}}

 == 2. Investigation ==

 I focused on `elimCommonBlocks` thinking that the inline blowup seen
 earlier in the compilation is reasonable. On reflection, this is probably
 not so, and I think someone better acquainted with ghc internals should
 investigate this.

 == 3. Optimization ==

 `-- TODO: Use optimization fuel` it says in `CmmCommonBlockElim.hs`

 Since `elimCommonBlocks` is essentially doing a `nub`, it will never
 perform better than O(n^2^). In fact, it's currently much worse than that
 as the comparison function changes as duplicate blocks are found, which in
 turn may create new duplicates, so we have to iterate the nub function
 till no more dups are found.

 I did experiment with improving the hash function. It's possible to
 include target labels, at the expense of needing to recompute block hash
 values on each iteration. (And currently some ''highly'' ugly code to
 convert labels to integers.) This considerably reduces the length of some
 of the hash bucket lists, and so offers a significant speedup:

 {{{
 ghc-7.8.4 -O2  230s
 "hack"         170s
 }}}

 Patch for this is http://static.tobold.org/ghc-prof/0001-cbe-hacks.patch

 This looks like a promising approach. Including target labels doesn't
 eliminate all the large hash buckets, so there is scope for further
 improvements along these lines. I don't see any reason why the hash
 function should not be as discerning as the equality function, with just
 occasional accidental clashes.

 I've also pondered whether more radical rewriting of `elimCommonBlocks`
 might pay dividends. It's a commonplace that, if you don't need to
 preserve order, `map head . group . sort` is a faster `nub`. Is the order
 of blocks significant? I'm not sure - they all seem to end with either
 `goto` or `call`. I'd ''assume'' that `call` then falls through to the
 next block but I'm not sure. Anyway, presumably we could `zip [0..]`
 first, and sort again afterwards, and that should still be asymptotically
 faster than `nub`.

 Of course, this presupposes that blocks could be (sensibly?  efficiently?)
 made an instance of `Ord`. (If they were, hash buckets could be stored in
 an O(1) data structure and binary search employed.)

 I may be able to find time to pursue some of these avenues, but as a very
 novice ghc hacker, I'd need some strategic direction here.

 Any thoughts?

--
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10397
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler