
#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Replying to [comment:88 simonpj]:
Hmm. The `Cmm` for `g10` is actually a lot bigger than that for `f10`, and looked to me as if it too would scale quadratically.
It would, but for different reasons, and I believe I can rewrite it to not scale quadratically (but haven't gotten around to it yet).
Are `f10` and `g10` accurate reflections of the "scale well" and "scale badly" versions of Read?
No; we don't actually have a "scale well" version of `Read` yet. All the implementations of `Read` that I have produced so far are in the "scale badly" category, although the degree to which they scale badly differs somewhat. We have two kinds of "scale well" examples: 1. Trivial ones, where rather than nest code such that chains of closures appear, we just call `read` or something else N times, but in benign ways; these examples do not have much of an explanatory value, and I merely included them to establish a performance baseline 2. Well-behaved nested / chained examples, such as the `getline-appl` one; however, none of these do anything equivalent to `read`.
I can see why things might scale badly, if we have a lot of {{{ p1 >>= \x -> p2 >>= \y -> p3 >>= \z -> return (K x y z) }}} And `f10` has that structure, albeit with data constructors rather than
functions. BUT: if that's it, why doesn't `(>>=)` inline, so we get a case expression instead?? I believe the `(>>=)` does in fact inline; `f10` is the kind of shape you get when you inline the `(>>=)` from `ReadP`. `(>>=)` from other monadic types does not introduce this kind of shape, which would explain why `Read` misbehaves but `getLine` over `IO` does not.
So if `f10` is reasonably accurate, I can see two ways forward:
1 Do something in the back end to generate less code.
2 Generate different code. I'm still intrigued (and a bit disbelieving)
why a different exposition of Read generates code that scales better. See above; `Read` always misbehaves here, so your disbelief is entirely justified. The bad behavior seems to be related to the actual implementation of `(>>=)`, not so much whether the `(>>=)` gets inlined. Or, more accurately, inlining `(>>=)` actually *is* the problem; if we didn't inline it, then `read-appl` would compile to something equivalent to the generic applicative example, which behaves nicely.
Concerning (1) the kind of thing I had in mind was more like this {{{ f10 = A (\i0 -> A (\i1 -> A (\i2 -> A (\i3 -> A (\i4 -> let t = (i0,i1,i2,i3,i4) in A (\i5 -> A (\i6 -> A (\i7 -> A (\i8 -> A (\i9 -> case t of (i0,i1,i2,i3,i4) -> N (i0, i1, i2, i3, i4, i5, i6, i7, i8, i9) }}} Now no function closure gets more than 5 free variables. I imagine one could do this as a STG-to-STG transformation if we decided that this was really the core issue.
Further experimentation shows that this doesn't actually achieve anything; the resulting Core is *exactly the same*, and if you put both functions into the same compilation unit, they get optimized into expressing one in terms of the other (I'm guessing here that CSE is smart enough to figure out that they are equivalent). I have written 3 versions of `f10`, one as- is, one (`g10`) using the suggested 5-tuple optimization, and one (`h10`) using a nested-tuple optimization, and I end up with Core that has literally these lines: {{{ f10 = A f1 g10 = f10 h10 = f10 }}}
A tantalising point is this. We have something like this: {{{ let f4 = [i0,i1,i2,i3] \i4 -> let f5 = [i0,i1,i2,i3, i4] \i5 -> ...blah... }}} The list before the `\` are the free vars of the closure.
Inside the code for `f4` we have `f4`'s function closure in hand, with `i0...` in it. Rather than capturing `i0, i1...` as the free vars of f5, we could just store a pointer to `f4`'s function closure (it's a kind of tuple, after all) and fetch its free vars from there. It's as if we wanted to say {{{ let f4 = [i0,i1,i2,i3] \i4 -> let f5 = [f4, i4] \i5 -> ...blah... }}} with `f4` and `i4` being the free vars of `f5`'s closure. This would be a deeper change to the code generator, but it could make a lot of sense in cases where ''all'' of `f4`'s free variables are also free in `f5`. (If not all where, you might get a space leak by closing over `f4`.)
-- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:89 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler