Re: [GHC] #7258: Compiling DynFlags is jolly slow

13 Nov 2017

      #7258: Compiling DynFlags is jolly slow
-------------------------------------+-------------------------------------
        Reporter:  simonpj           |                Owner:  simonpj
            Type:  bug               |               Status:  new
        Priority:  normal            |            Milestone:
       Component:  Compiler          |              Version:  7.6.1
      Resolution:                    |             Keywords:  deriving-perf
Operating System:  Unknown/Multiple  |         Architecture:
 Type of failure:  Compile-time      |  Unknown/Multiple
  performance bug                    |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:                    |  Differential Rev(s):
       Wiki Page:                    |
-------------------------------------+-------------------------------------

Comment (by tdammers):

 Replying to [comment:88 simonpj]:
...
Hmm. The `Cmm` for `g10` is actually a lot bigger than that for `f10`,
 and looked to me as if it too would scale quadratically.
It would, but for different reasons, and I believe I can rewrite it to not
 scale quadratically (but haven't gotten around to it yet).
...
Are `f10` and `g10` accurate reflections of the "scale well" and "scale
 badly" versions of Read?
No; we don't actually have a "scale well" version of `Read` yet. All the
 implementations of `Read` that I have produced so far are in the "scale
 badly" category, although the degree to which they scale badly differs
 somewhat. We have two kinds of "scale well" examples:

 1. Trivial ones, where rather than nest code such that chains of closures
 appear, we just call `read` or something else N times, but in benign ways;
 these examples do not have much of an explanatory value, and I merely
 included them to establish a performance baseline
 2. Well-behaved nested / chained examples, such as the `getline-appl` one;
 however, none of these do anything equivalent to `read`.
...
I can see why things might scale badly, if we have a lot of
{{{
p1 >>= \x ->
p2 >>= \y ->
p3 >>= \z ->
return (K x y z)
}}}
And `f10` has that structure, albeit with data constructors rather than
functions.  BUT: if that's it, why doesn't `(>>=)` inline, so we get a
 case expression instead??

 I believe the `(>>=)` does in fact inline; `f10` is the kind of shape you
 get when you inline the `(>>=)` from `ReadP`. `(>>=)` from other monadic
 types does not introduce this kind of shape, which would explain why
 `Read` misbehaves but `getLine` over `IO` does not.
...
So if `f10` is reasonably accurate, I can see two ways forward:
1 Do something in the back end to generate less code.
2 Generate different code.  I'm still intrigued (and a bit disbelieving)
why a different exposition of Read generates code that scales better.

 See above; `Read` always misbehaves here, so your disbelief is entirely
 justified. The bad behavior seems to be related to the actual
 implementation of `(>>=)`, not so much whether the `(>>=)` gets inlined.
 Or, more accurately, inlining `(>>=)` actually *is* the problem; if we
 didn't inline it, then `read-appl` would compile to something equivalent
 to the generic applicative example, which behaves nicely.
...
Concerning (1) the kind of thing I had in mind was more like this
{{{
f10 =
  A (\i0 ->
  A (\i1 ->
  A (\i2 ->
  A (\i3 ->
  A (\i4 -> let t = (i0,i1,i2,i3,i4) in
  A (\i5 ->
  A (\i6 ->
  A (\i7 ->
  A (\i8 ->
  A (\i9 ->
  case t of (i0,i1,i2,i3,i4) ->
     N (i0, i1, i2, i3, i4, i5, i6, i7, i8, i9)
}}}
Now no function closure gets more than 5 free variables.  I imagine one
 could do this as a STG-to-STG transformation if we decided that this was
 really the core issue.
Further experimentation shows that this doesn't actually achieve anything;
 the resulting Core is *exactly the same*, and if you put both functions
 into the same compilation unit, they get optimized into expressing one in
 terms of the other (I'm guessing here that CSE is smart enough to figure
 out that they are equivalent). I have written 3 versions of `f10`, one as-
 is, one (`g10`) using the suggested 5-tuple optimization, and one (`h10`)
 using a nested-tuple optimization, and I end up with Core that has
 literally these lines:

 {{{
 f10 = A f1

 g10 = f10

 h10 = f10
 }}}
...
A tantalising point is this.  We have something like this:
{{{
   let f4 = [i0,i1,i2,i3] \i4 ->
            let f5 = [i0,i1,i2,i3, i4] \i5 -> ...blah...
}}}
The list before the `\` are the free vars of the closure.
Inside the code for `f4` we have `f4`'s function closure in hand, with
 `i0...` in it.  Rather than capturing `i0, i1...` as the free vars of f5,
 we could just store a pointer to `f4`'s function closure (it's a kind of
 tuple, after all) and fetch its free vars from there.  It's as if we
 wanted to say
{{{
   let f4 = [i0,i1,i2,i3] \i4 ->
            let f5 = [f4, i4] \i5 -> ...blah...
}}}
with `f4` and `i4` being the free vars of `f5`'s closure.  This would be
 a deeper change to the code generator, but it could make a lot of sense in
 cases where ''all'' of `f4`'s free variables are also free in `f5`.  (If
 not all where, you might get a space leak by closing over `f4`.)
-- 
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:89
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler