Re: [GHC] #7258: Compiling DynFlags is jolly slow

8 Nov 2017

      #7258: Compiling DynFlags is jolly slow
-------------------------------------+-------------------------------------
        Reporter:  simonpj           |                Owner:  simonpj
            Type:  bug               |               Status:  new
        Priority:  normal            |            Milestone:
       Component:  Compiler          |              Version:  7.6.1
      Resolution:                    |             Keywords:  deriving-perf
Operating System:  Unknown/Multiple  |         Architecture:
 Type of failure:  Compile-time      |  Unknown/Multiple
  performance bug                    |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:                    |  Differential Rev(s):
       Wiki Page:                    |
-------------------------------------+-------------------------------------

Comment (by simonpj):

 Hmm. The `Cmm` for `g10` is actually a lot bigger than that for `f10`, and
 looked to me as if it too would scale quadratically.  Are `f10` and `g10`
 accurate reflections of the "scale well" and "scale badly" versions of
 Read?

 I can see why things might scale badly, if we have a lot of
 {{{
 p1 >>= \x ->
 p2 >>= \y ->
 p3 >>= \z ->
 return (K x y z)
 }}}
 And `f10` has that structure, albeit with data constructors rather than
 functions.  BUT: if that's it, why doesn't `(>>=)` inline, so we get a
 case expression instead??

 So if `f10` is reasonably accurate, I can see two ways forward:

 1 Do something in the back end to generate less code.

 2 Generate different code.  I'm still intrigued (and a bit disbelieving)
 why a different exposition of Read generates code that scales better.

 Concerning (1) the kind of thing I had in mind was more like this
 {{{
 f10 =
   A (\i0 ->
   A (\i1 ->
   A (\i2 ->
   A (\i3 ->
   A (\i4 -> let t = (i0,i1,i2,i3,i4) in
   A (\i5 ->
   A (\i6 ->
   A (\i7 ->
   A (\i8 ->
   A (\i9 ->
   case t of (i0,i1,i2,i3,i4) ->
      N (i0, i1, i2, i3, i4, i5, i6, i7, i8, i9)
 }}}
 Now no function closure gets more than 5 free variables.  I imagine one
 could do this as a STG-to-STG transformation if we decided that this was
 really the core issue.

 A tantalising point is this.  We have something like this:
 {{{
    let f4 = [i0,i1,i2,i3] \i4 ->
             let f5 = [i0,i1,i2,i3, i4] \i5 -> ...blah...
 }}}
 The list before the `\` are the free vars of the closure.

 Inside the code for `f4` we have `f4`'s function closure in hand, with
 `i0...` in it.  Rather than capturing `i0, i1...` as the free vars of f5,
 we could just store a pointer to `f4`'s function closure (it's a kind of
 tuple, after all) and fetch its free vars from there.  It's as if we
 wanted to say
 {{{
    let f4 = [i0,i1,i2,i3] \i4 ->
             let f5 = [f4, i4] \i5 -> ...blah...
 }}}
 with `f4` and `i4` being the free vars of `f5`'s closure.  This would be a
 deeper change to the code generator, but it could make a lot of sense in
 cases where ''all'' of `f4`'s free variables are also free in `f5`.  (If
 not all where, you might get a space leak by closing over `f4`.)

-- 
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:88
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler