Re: performance regressions

19 Dec 2014

      I've gained a little more insight here, but only a little.

INLINE doesn't work because zipWithAndUnzipM is recursive. Oddly, using worker/wrapper made allocations go *up*, so that's somehow not the answer.

When I use `inline`, GHC will inline the function once; recursive calls remain. This seems to the local minimum in allocations.
Using `inline` gives roughly a 7% allocation level change on this pathological case.

To reproduce, just compile HEAD's TcFlatten with and without the call to `inline`. You can see, using -ddump-simpl, what GHC is doing. Then, you can check the allocation numbers by compiling perf/compiler/T9872a.hs. My tests were all using the default build settings, with an unmodified build.mk, as that seemed to be the most performant setting.

In other news, a slight change to the algorithm (see #9872, comment 23) makes a 50% improvement. Will hopefully commit tonight, after a full local validation run.

Richard

On Dec 17, 2014, at 10:59 AM, Simon Peyton Jones  wrote:
...
I still would like to understand why INLINE does not make it inline. That's weird.
Eg way to reproduce.
Simion
|  -----Original Message-----
|  From: Richard Eisenberg [mailto:eir@cis.upenn.edu]
|  Sent: 17 December 2014 15:56
|  To: Simon Peyton Jones
|  Cc: Joachim Breitner; ghc-devs@haskell.org
|  Subject: Re: performance regressions
|  
|  By unsubstantiated guess is that INLINEABLE would have the same effect
|  as INLINE here, as GHC doesn't see fit to actually inline the
|  function, even with INLINE -- the big improvement seen between (1) and
|  (2) is actually specialization, not inlining. The jump from (2) to (3)
|  is actual inlining. Thus, it seems that GHC's heuristics for inlining
|  aren't working out for the best here.
|  
|  I've pushed my changes, though I agree with Simon that more research
|  may uncover even more improvements here. I didn't focus on the number
|  of calls because that number didn't regress. Will look into this soon.
|  
|  Richard
|  
|  On Dec 17, 2014, at 4:15 AM, Simon Peyton Jones
|   wrote:
|  
|  > If you use INLINEABLE, that should make the function specialisable
|  to a particular monad, even if it's in a different module. You
|  shouldn't need INLINE for that.
|  >
|  > I don't understand the difference between cases (2) and (3).
|  >
|  > I am still suspicious of why there are so many calls to this one
|  function that it, alone, is allocating a significant proportion of
|  compilation of the entire run of GHC.  Are you sure there isn't an
|  algorithmic improvement to be had, to simply reduce the number of
|  calls?
|  >
|  > Simon
|  >
|  > |  -----Original Message-----
|  > |  From: ghc-devs [mailto:ghc-devs-bounces@haskell.org] On Behalf Of
|  > | Richard Eisenberg
|  > |  Sent: 16 December 2014 21:46
|  > |  To: Joachim Breitner
|  > |  Cc: ghc-devs@haskell.org
|  > |  Subject: Re: performance regressions
|  > |
|  > |  I've learned several very interesting things in this analysis.
|  > |
|  > |  - Inlining polymorphic methods is very important. Here are some
|  > | data  points to back up that claim:
|  > |     * Original implementation using zipWithAndUnzipM:
|  8,472,613,440
|  > |  bytes allocated in the heap
|  > |     * Adding {-# INLINE #-} to the definition thereof:
|  6,639,253,488
|  > |  bytes allocated in the heap
|  > |     * Using `inline` at call site to force inlining:
|  6,281,539,792
|  > |  bytes allocated in the heap
|  > |
|  > |  The middle step above allowed GHC to specialize zipWithAndUnzipM
|  to
|  > | my  particular monad, but GHC didn't see fit to actually inline
|  the
|  > | function. Using `inline` forced it, to good effect. (I did not
|  > | collect  data on code sizes, but it wouldn't be hard to.)
|  > |
|  > |  By comparison:
|  > |     * Hand-written recursion:    6,587,809,112 bytes allocated in
|  the
|  > |  heap
|  > |  Interestingly, this is *not* the best result!
|  > |
|  > |  Conclusion: We should probably add INLINE pragmas to Util and
|  > | MonadUtils.
|  > |
|  > |
|  > |  - I then looked at rejiggering the algorithm to keep the common
|  > | case  fast. This had a side effect of changing the
|  zipWithAndUnzipM
|  > | to  mapAndUnzipM, from Control.Monad. To my surprise, this brought
|  > | disaster!
|  > |     * Using `inline` and mapAndUnzipM:        7,463,047,432 bytes
|  > |  allocated in the heap
|  > |     * Hand-written recursion:                 5,848,602,848 bytes
|  > |  allocated in the heap
|  > |
|  > |  That last number is better than the numbers above because of the
|  > | algorithm streamlining. But, the inadequacy of mapAndUnzipM
|  > | surprised  me -- it already has an INLINE pragma in Control.Monad
|  of course.
|  > |  Looking at -ddump-simpl, it seems that mapAndUnzipM was indeed
|  > | getting  inlined, but a call to `map` remained, perhaps causing
|  > | extra  allocation.
|  > |
|  > |  Conclusion: We should examine the implementation of mapAndUnzipM
|  > | (and  similar functions) in Control.Monad. Is it as fast as
|  possible?
|  > |
|  > |
|  > |
|  > |  In the end, I was unable to bring the allocation numbers down to
|  > | where  they were before my work. This is because the flattener now
|  > | deals in  roles. Most of its behavior is the same between nominal
|  > | and  representational roles, so it seems silly (though very
|  > | possible) to  specialize the code to nominal to keep that path
|  fast.
|  > | Instead, I  identified one key spot and made that go fast.
|  > |
|  > |  Thus, there is a 7% bump to memory usage on very-type-family-
|  heavy
|  > | code, compared to before my commit on Friday. (On more ordinary
|  > | code,  there is no noticeable change.)
|  > |
|  > |  Validating my patch locally now; will push when that's done.
|  > |
|  > |  Thanks,
|  > |  Richard
|  > |
|  > |  On Dec 16, 2014, at 10:41 AM, Joachim Breitner  | breitner.de> wrote:
|  > |
|  > |  > Hi,
|  > |  >
|  > |  >
|  > |  > Am Dienstag, den 16.12.2014, 09:59 -0500 schrieb Richard
|  Eisenberg:
|  > |  >> On Dec 16, 2014, at 4:01 AM, Joachim Breitner  | breitner.de> wrote:
|  > |  >>
|  > |  >>> another guess (without looking at the code, sorry): Are they
|  in
|  > | the  >>> same module? I.e., can GHC specialize the code to your
|  > | particular  Monad?
|  > |  >
|  > |  >> No, they're not in the same module. I could also try moving
|  the
|  > | >> zipWithAndUnzipM function to the same module, and even
|  > | specializing  >> it by hand to the right monad.
|  > |  >
|  > |  > I did mean zipWithAndUnzipM, so maybe yes: Try that.
|  > |  >
|  > |  > (I find it hard to believe that any polymorphic monadic code
|  > | should  > perform well, with those many calls to an unknown (>>=)
|  > | with a  > function parameter, but maybe I'm too pessimistic here.)
|  > | >  >  >> Could that be preventing the fusing?
|  > |  >
|  > |  > There is not going to be any fusing here, at least not list
|  > | fusion;  > that would require your code to be written in terms of
|  > | functions  with  > fusion rules.
|  > |  >
|  > |  > Greetings,
|  > |  > Joachim
|  > |  >
|  > |  > --
|  > |  > Joachim "nomeata" Breitner
|  > |  >  mail@joachim-breitner.de * http://www.joachim-breitner.de/  >
|  > | Jabber: nomeata@joachim-breitner.de  * GPG-Key: 0xF0FBF51F  Debian
|  > | > Developer: nomeata@debian.org  >  >
|  > | _______________________________________________
|  > |  > ghc-devs mailing list
|  > |  > ghc-devs@haskell.org
|  > |  > http://www.haskell.org/mailman/listinfo/ghc-devs
|  > |
|  > |  _______________________________________________
|  > |  ghc-devs mailing list
|  > |  ghc-devs@haskell.org
|  > |  http://www.haskell.org/mailman/listinfo/ghc-devs
|  >