
Ben, David, Reid I have been working for months (on and off, mostly off, but very ON for the last week or two) on a very simple idea: the simplifier should inline things even in the "gentle" phase. It seems so simple. And it is: the key patch is tiny. But it stressed corners of the optimiser that were not stressed before; and digging into it showed opportunities I did not know about before. So I have ended up a with a whole series of patches, which are on wip/spj-early-inline branch 7f14d15c0e5fc2c9a81db3d0f0b01d85857b1d87 Error message wibbles accumulated from the preceding patches 0499c65d9fa45e7879e1e1264fdaa15274adcba6 Improve SetLevels for join points 3b2fc0827ff6cafa34836c2d9dc710b628c990b6 Change -ddump-tc-trace output in TcErrors, slightly 9ffdf62b0ca72c4f35579f9d6f31a9beebf23025 Improve pretty-printing of types 3f346eac06399a79adf48425018ee949cee245bf Add VarSet.anyDVarSet, allDVarSet 912e71eb3b4ec91e805ecf2236d1033e55e2933a The Early Inline Patch 7188cd13f8e54efa764d52ca016b87b3669b29f5 Small changes to expression sizing in CoreUnfold bfc6fa3f377d11bdfcdbf82b65bf2f39cb00b90c Fix SetLevels for makeStaticPtr 8b1cfea089faacb5b95ffcc3511e05faeabb8076 Extend CSE to handle recursive bindings 50411995641802568bb27c867afe804f91e0524c Combine identical case alterantives in CSE 2e077ccc736a0b2a622b7f42b7929966bddb4ded Inline data constructor wrappers in phase 2 only b868de53dd19f639c1070089ecff21948ff33e0d Make Specialise work with casts c767ae5f04a09ef71dcb8f67a17225a52c2cc5d2 Stop uniques ending up in SPEC rule names b49ed1f0102f93ca7f62632c436b41bd240b501f Occurrence-analyse the result of rule firings 607a735dfb99bb8f0edf466ccb01e732218c42ec Add -fspec-constr-keen 67a0c1872c0515f1f12ea68097a84e02da92f45b Refactor floating of bindings (fiBind) e90f4d7c6d3003039fa1647a3da3dafcaa75527b More tracing in SpecConstr Much to my surprise, we get some jolly nice improvements in compiler perf: 3% perf/compiler/T5837.run T5837 [stat too good] (normal) 7% perf/compiler/parsing001.run parsing001 [stat too good] (normal) 9% perf/compiler/T12234.run T12234 [stat too good] (optasm) 35% perf/compiler/T9020.run T9020 [stat too good] (optasm) 9% perf/compiler/T3064.run T3064 [stat too good] (normal) 13% perf/compiler/T9961.run T9961 [stat too good] (normal) 20% perf/compiler/T13056.run T13056 [stat too good] (optasm) 5% perf/compiler/T9872d.run T9872d [stat too good] (normal) 5% perf/compiler/T9872c.run T9872c [stat too good] (normal) 5% perf/compiler/T9872b.run T9872b [stat too good] (normal) 7% perf/compiler/T9872a.run T9872a [stat too good] (normal) 5% perf/compiler/T783.run T783 [stat too good] (normal) 35% perf/compiler/T12227.run T12227 [stat too good] (normal) 20% perf/compiler/T1969.run T1969 [stat too good] (normal) 5% perf/should_run/lazy-bs-alloc.run lazy-bs-alloc [stat too good] (normal) 5% perf/compiler/T12707.run T12707 [stat too good] (normal) 4% perf/compiler/T3294.run T3294 [stat too good] (normal) 1.5% perf/space_leaks/T4029.run T4029 [stat too good] (ghci) So what is left? I have sunk so much time into this and am still not QUITE out of the woods. I was left with Unexpected failures: codeGen/should_compile/debug.run debug [bad stdout] (normal) concurrent/should_run/T4030.run T4030 [bad exit code] (normal) I'm re-validating having pulled from HEAD, but I THINK that's all. Now * I don't know how to Phab these individually * I have not sweated through which patch is responsible for which perf improvments. Maybe Gipeda can tell? * I have not put each error message change with the correct patch. I don't know how much that matters. So this is to say: anything you guys can do to help get this actually Done would be really helpful. I'm out of time till Monday at least. It would be great to collect those performance improvements! Thanks! Simon

Yay! Is that related to the following ("I also want to investigate
making INLINE pragmas fire in the "gentle" phase, on the grounds
that that's what the programmer said.")?
https://ghc.haskell.org/trac/ghc/ticket/12603#comment:30
On Fri, Feb 17, 2017 at 5:41 PM, Simon Peyton Jones via ghc-devs
Ben, David, Reid
I have been working for months (on and off, mostly off, but very ON for the last week or two) on a very simple idea: the simplifier should inline things even in the “gentle” phase.
It seems so simple. And it is: the key patch is tiny.
But it stressed corners of the optimiser that were not stressed before; and digging into it showed opportunities I did not know about before.
So I have ended up a with a whole series of patches, which are on wip/spj-early-inline branch
7f14d15c0e5fc2c9a81db3d0f0b01d85857b1d87 Error message wibbles accumulated from the preceding patches
0499c65d9fa45e7879e1e1264fdaa15274adcba6 Improve SetLevels for join points
3b2fc0827ff6cafa34836c2d9dc710b628c990b6 Change -ddump-tc-trace output in TcErrors, slightly
9ffdf62b0ca72c4f35579f9d6f31a9beebf23025 Improve pretty-printing of types
3f346eac06399a79adf48425018ee949cee245bf Add VarSet.anyDVarSet, allDVarSet
912e71eb3b4ec91e805ecf2236d1033e55e2933a The Early Inline Patch
7188cd13f8e54efa764d52ca016b87b3669b29f5 Small changes to expression sizing in CoreUnfold
bfc6fa3f377d11bdfcdbf82b65bf2f39cb00b90c Fix SetLevels for makeStaticPtr
8b1cfea089faacb5b95ffcc3511e05faeabb8076 Extend CSE to handle recursive bindings
50411995641802568bb27c867afe804f91e0524c Combine identical case alterantives in CSE
2e077ccc736a0b2a622b7f42b7929966bddb4ded Inline data constructor wrappers in phase 2 only
b868de53dd19f639c1070089ecff21948ff33e0d Make Specialise work with casts
c767ae5f04a09ef71dcb8f67a17225a52c2cc5d2 Stop uniques ending up in SPEC rule names
b49ed1f0102f93ca7f62632c436b41bd240b501f Occurrence-analyse the result of rule firings
607a735dfb99bb8f0edf466ccb01e732218c42ec Add -fspec-constr-keen
67a0c1872c0515f1f12ea68097a84e02da92f45b Refactor floating of bindings (fiBind)
e90f4d7c6d3003039fa1647a3da3dafcaa75527b More tracing in SpecConstr
Much to my surprise, we get some jolly nice improvements in compiler perf:
3% perf/compiler/T5837.run T5837 [stat too good] (normal)
7% perf/compiler/parsing001.run parsing001 [stat too good] (normal)
9% perf/compiler/T12234.run T12234 [stat too good] (optasm)
35% perf/compiler/T9020.run T9020 [stat too good] (optasm)
9% perf/compiler/T3064.run T3064 [stat too good] (normal)
13% perf/compiler/T9961.run T9961 [stat too good] (normal)
20% perf/compiler/T13056.run T13056 [stat too good] (optasm)
5% perf/compiler/T9872d.run T9872d [stat too good] (normal)
5% perf/compiler/T9872c.run T9872c [stat too good] (normal)
5% perf/compiler/T9872b.run T9872b [stat too good] (normal)
7% perf/compiler/T9872a.run T9872a [stat too good] (normal)
5% perf/compiler/T783.run T783 [stat too good] (normal)
35% perf/compiler/T12227.run T12227 [stat too good] (normal)
20% perf/compiler/T1969.run T1969 [stat too good] (normal)
5% perf/should_run/lazy-bs-alloc.run lazy-bs-alloc [stat too good] (normal)
5% perf/compiler/T12707.run T12707 [stat too good] (normal)
4% perf/compiler/T3294.run T3294 [stat too good] (normal)
1.5% perf/space_leaks/T4029.run T4029 [stat too good] (ghci)
So what is left? I have sunk so much time into this and am still not QUITE out of the woods. I was left with
Unexpected failures:
codeGen/should_compile/debug.run debug [bad stdout] (normal)
concurrent/should_run/T4030.run T4030 [bad exit code] (normal)
I’m re-validating having pulled from HEAD, but I THINK that’s all.
Now
· I don’t know how to Phab these individually
· I have not sweated through which patch is responsible for which perf improvments. Maybe Gipeda can tell?
· I have not put each error message change with the correct patch. I don’t know how much that matters.
So this is to say: anything you guys can do to help get this actually Done would be really helpful. I’m out of time till Monday at least.
It would be great to collect those performance improvements!
Thanks!
Simon
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

| Yay! Is that related to the following ("I also want to investigate making
| INLINE pragmas fire in the "gentle" phase, on the grounds that that's
| what the programmer said.")?
|
Yes, precisely
Simon
| -----Original Message-----
| From: Mikolaj Konarski [mailto:mikolaj@well-typed.com]
| Sent: 17 February 2017 17:06
| To: Simon Peyton Jones

Yes, we definely want these. Are you wanting each of these submitted as a separate differential *in order*? Or do you want a more complex mix-and-match? Also, are there any commits you think should be squashed? On Friday, February 17, 2017 4:41:33 PM EST Simon Peyton Jones via ghc-devs wrote:
Ben, David, Reid I have been working for months (on and off, mostly off, but very ON for the last week or two) on a very simple idea: the simplifier should inline things even in the "gentle" phase. It seems so simple. And it is: the key patch is tiny. But it stressed corners of the optimiser that were not stressed before; and digging into it showed opportunities I did not know about before. So I have ended up a with a whole series of patches, which are on wip/spj-early-inline branch
7f14d15c0e5fc2c9a81db3d0f0b01d85857b1d87 Error message wibbles accumulated from the preceding patches
0499c65d9fa45e7879e1e1264fdaa15274adcba6 Improve SetLevels for join points
3b2fc0827ff6cafa34836c2d9dc710b628c990b6 Change -ddump-tc-trace output in TcErrors, slightly
9ffdf62b0ca72c4f35579f9d6f31a9beebf23025 Improve pretty-printing of types
3f346eac06399a79adf48425018ee949cee245bf Add VarSet.anyDVarSet, allDVarSet
912e71eb3b4ec91e805ecf2236d1033e55e2933a The Early Inline Patch
7188cd13f8e54efa764d52ca016b87b3669b29f5 Small changes to expression sizing in CoreUnfold
bfc6fa3f377d11bdfcdbf82b65bf2f39cb00b90c Fix SetLevels for makeStaticPtr
8b1cfea089faacb5b95ffcc3511e05faeabb8076 Extend CSE to handle recursive bindings
50411995641802568bb27c867afe804f91e0524c Combine identical case alterantives in CSE
2e077ccc736a0b2a622b7f42b7929966bddb4ded Inline data constructor wrappers in phase 2 only
b868de53dd19f639c1070089ecff21948ff33e0d Make Specialise work with casts
c767ae5f04a09ef71dcb8f67a17225a52c2cc5d2 Stop uniques ending up in SPEC rule names
b49ed1f0102f93ca7f62632c436b41bd240b501f Occurrence-analyse the result of rule firings
607a735dfb99bb8f0edf466ccb01e732218c42ec Add -fspec-constr-keen
67a0c1872c0515f1f12ea68097a84e02da92f45b Refactor floating of bindings (fiBind)
e90f4d7c6d3003039fa1647a3da3dafcaa75527b More tracing in SpecConstr
Much to my surprise, we get some jolly nice improvements in compiler perf:
3% perf/compiler/T5837.run T5837 [stat too good] (normal)
7% perf/compiler/parsing001.run parsing001 [stat too good] (normal)
9% perf/compiler/T12234.run T12234 [stat too good] (optasm)
35% perf/compiler/T9020.run T9020 [stat too good] (optasm)
9% perf/compiler/T3064.run T3064 [stat too good] (normal)
13% perf/compiler/T9961.run T9961 [stat too good] (normal)
20% perf/compiler/T13056.run T13056 [stat too good] (optasm)
5% perf/compiler/T9872d.run T9872d [stat too good] (normal)
5% perf/compiler/T9872c.run T9872c [stat too good] (normal)
5% perf/compiler/T9872b.run T9872b [stat too good] (normal)
7% perf/compiler/T9872a.run T9872a [stat too good] (normal)
5% perf/compiler/T783.run T783 [stat too good] (normal)
35% perf/compiler/T12227.run T12227 [stat too good] (normal)
20% perf/compiler/T1969.run T1969 [stat too good] (normal)
5% perf/should_run/lazy-bs-alloc.run lazy-bs-alloc [stat too good] (normal)
5% perf/compiler/T12707.run T12707 [stat too good] (normal)
4% perf/compiler/T3294.run T3294 [stat too good] (normal)
1.5% perf/space_leaks/T4029.run T4029 [stat too good] (ghci)
So what is left? I have sunk so much time into this and am still not QUITE out of the woods. I was left with
Unexpected failures:
codeGen/should_compile/debug.run debug [bad stdout] (normal)
concurrent/should_run/T4030.run T4030 [bad exit code] (normal) I'm re-validating having pulled from HEAD, but I THINK that's all. Now
* I don't know how to Phab these individually
* I have not sweated through which patch is responsible for which perf improvments. Maybe Gipeda can tell?
* I have not put each error message change with the correct patch. I don't know how much that matters. So this is to say: anything you guys can do to help get this actually Done would be really helpful. I'm out of time till Monday at least. It would be great to collect those performance improvements! Thanks! Simon

I can see that
- it'd be nice to associate the perf improvements with the right patch
- it'd be nice to associate the error-message wibbles with the right patch
- it'd be nice to Phab them all so others can comment
But life is short, so I'd be perfectly happy if we were able to just commit them, provided they validate collectively. It's up to you guys.
There may be some more error message wibbles when you do full run (didn't have time to do that before leaving).
Don't squash them.. each patch does something separate... it's not a stream of successive fixes to the same thing. I've already done the squashing.
The SetLevels changes strictly subsume everything in the separate patch I sent Ben (cc ghc-devs) fixing #13255, and will conflict with it. If so, ignore the latter.
Simon
-----Original Message-----
| From: David Feuer [mailto:david@well-typed.com]
| Sent: 17 February 2017 18:33
| To: ghc-devs@haskell.org; Simon Peyton Jones

Hi, Am Freitag, den 17.02.2017, 16:41 +0000 schrieb Simon Peyton Jones via ghc-devs:
· I have not sweated through which patch is responsible for which perf improvments. Maybe Gipeda can tell?
yes it can! It does not draw nice graphs for branches yet, but it will (try to) build all the commits on the branch. Once that is done (can take a while), the branch will show up under “Branches” on https://perf.haskell.org/ghc/ Clicking on the hash next to the branch will show you the latest commit on that brach, together with its performance changes. That page also has a “parent” link that you can click to look at the previous patches in sequence. I can have a look once the patches are built. Greetings, Joachim -- Joachim “nomeata” Breitner mail@joachim-breitner.de • https://www.joachim-breitner.de/ XMPP: nomeata@joachim-breitner.de • OpenPGP-Key: 0xF0FBF51F Debian Developer: nomeata@debian.org

Hi, perf.haskell.org has built all but the last patch in this sequence, so I can now see what it has to say about where the performance changes came from: Am Freitag, den 17.02.2017, 16:41 +0000 schrieb Simon Peyton Jones via ghc-devs:
So I have ended up a with a whole series of patches, which are on wip/spj-early-inline branch 7f14d15c0e5fc2c9a81db3d0f0b01d85857b1d87 Error message wibbles accumulated from the preceding patches
Not built yet, but probably not interesting.
0499c65d9fa45e7879e1e1264fdaa15274adcba6 Improve SetLevels for join points nofib/time/hidden 0.376 + 5.59% 0.397 s econds
3b2fc0827ff6cafa34836c2d9dc710b628c990b6 Change -ddump-tc-trace output in TcErrors, slightly no change
9ffdf62b0ca72c4f35579f9d6f31a9beebf23025 Improve pretty-printing of types no change
3f346eac06399a79adf48425018ee949cee245bf Add VarSet.anyDVarSet, allDVarSet no change
912e71eb3b4ec91e805ecf2236d1033e55e2933a The Early Inline Patch
7188cd13f8e54efa764d52ca016b87b3669b29f5 Small changes to expression sizing in CoreUnfold bfc6fa3f377d11bdfcdbf82b65bf2f39cb00b90c Fix SetLevels for makeStaticPtr 8b1cfea089faacb5b95ffcc3511e05faeabb8076 Extend CSE to handle recursive bindings 50411995641802568bb27c867afe804f91e0524c Combine identical case alterantives in CSE 2e077ccc736a0b2a622b7f42b7929966bddb4ded Inline data constructor wrappers in phase 2 only b868de53dd19f639c1070089ecff21948ff33e0d Make Specialise work with casts c767ae5f04a09ef71dcb8f67a17225a52c2cc5d2 Stop uniques ending up in SPEC rule names b49ed1f0102f93ca7f62632c436b41bd240b501f Occurrence-analyse the result of rule firings 607a735dfb99bb8f0edf466ccb01e732218c42ec Add -fspec-constr-keen 67a0c1872c0515f1f12ea68097a84e02da92f45b Refactor floating of bindings (fiBind) These patches cannot be distinguished because all but the last one failed to build:
compiler/simplCore/SimplCore.hs:435:48: error: • Couldn't match type ‘CoreM ModGuts’ with ‘CoreProgram -> CoreProgram’ Expected type: DynFlags -> CoreProgram -> CoreProgram Actual type: ModGuts -> CoreM ModGuts • In the first argument of ‘doPassD’, namely ‘floatInwards’ In the expression: doPassD floatInwards In the expression: {-# SCC "FloatInwards" #-} (doPassD floatInwards) https://github.com/nomeata/ghc-speed-logs/blob/ae1b6dcd32fd2c8578ef3eee4c6f8... The overall effect of this patch was (as you already know): nofib/time/binary-trees 0.751 - 4.79% 0.715 seconds nofib/time/fannkuch-redux 4.751 - 3.85% 4.568 seconds nofib/time/integer 1.276 + 19.04% 1.519 seconds all sizes increase by 3 or 4%. tests/alloc/T10547 32406096 - 4.48% 30953160 bytes tests/alloc/T10858 259699544 - 4.94% 246866000 bytes tests/alloc/T12227 1654153320 - 35.87% 1060777528 bytes tests/alloc/T12234 75197448 - 7.02% 69918192 bytes tests/alloc/T12707 1309049328 - 5.06% 1242803272 bytes tests/alloc/T13035 90082344 - 4.04% 86438544 bytes tests/alloc/T13056 512447048 - 20.21% 408873760 bytes tests/alloc/T1969 756392264 - 19% 612713624 bytes tests/alloc/T3064 287429088 - 8.9% 261860968 bytes tests/alloc/T3294 2715661784 - 3.51% 2620404344 bytes tests/alloc/T4801 412672008 - 5.77% 388841920 bytes tests/alloc/T5321FD 470413728 - 3.67% 453148744 bytes tests/alloc/T5321Fun 500839840 - 3.11% 485276616 bytes tests/alloc/T5642 836251056 - 5.19% 792875648 bytes tests/alloc/T5837 51684016 - 3.97% 49631216 bytes tests/alloc/T6048 98489944 + 3.4% 101835168 bytes tests/alloc/T783 462334328 - 5.21% 438237272 bytes tests/alloc/T9020 775878448 - 35.27% 502248184 bytes tests/alloc/T9872a 3136944168 - 6.81% 2923428352 bytes tests/alloc/T9872b 3964092608 - 5.85% 3732226832 bytes tests/alloc/T9872c 3603773864 - 5.49% 3405843000 bytes tests/alloc/T9872d 466420232 - 5.1% 442644168 bytes tests/alloc/T9961 575612760 - 13.15% 499917080 bytes tests/alloc/lazy-bs-all 436680 - 3.77% 420224 bytes tests/alloc/parsing001 499038992 - 6.77% 465237088 bytes tests/alloc/T10547 32406096 - 4.48% 30953160 bytes tests/alloc/T10858 259699544 - 4.94% 246866000 bytes tests/alloc/T12227 1654153320 - 35.87% 1060777528 bytes tests/alloc/T12234 75197448 - 7.02% 69918192 bytes tests/alloc/T12707 1309049328 - 5.06% 1242803272 bytes tests/alloc/T13035 90082344 - 4.04% 86438544 bytes tests/alloc/T13056 512447048 - 20.21% 408873760 bytes tests/alloc/T1969 756392264 - 19% 612713624 bytes tests/alloc/T3064 287429088 - 8.9% 261860968 bytes tests/alloc/T3294 2715661784 - 3.51% 2620404344 bytes tests/alloc/T4801 412672008 - 5.77% 388841920 bytes tests/alloc/T5321FD 470413728 - 3.67% 453148744 bytes tests/alloc/T5321Fun 500839840 - 3.11% 485276616 bytes tests/alloc/T5642 836251056 - 5.19% 792875648 bytes tests/alloc/T5837 51684016 - 3.97% 49631216 bytes tests/alloc/T6048 98489944 + 3.4% 101835168 bytes tests/alloc/T783 462334328 - 5.21% 438237272 bytes tests/alloc/T9020 775878448 - 35.27% 502248184 bytes tests/alloc/T9872a 3136944168 - 6.81% 2923428352 bytes tests/alloc/T9872b 3964092608 - 5.85% 3732226832 bytes tests/alloc/T9872c 3603773864 - 5.49% 3405843000 bytes tests/alloc/T9872d 466420232 - 5.1% 442644168 bytes tests/alloc/T9961 575612760 - 13.15% 499917080 bytes tests/alloc/lazy-bs-all 436680 - 3.77% 420224 bytes tests/alloc/parsing001 499038992 - 6.77% 465237088 bytes
e90f4d7c6d3003039fa1647a3da3dafcaa75527b More tracing in SpecConstr no changes.
Well, less helpful than expected, but hard to do better given a patch series where not every patch builds. Greetings, Joachim -- Joachim “nomeata” Breitner mail@joachim-breitner.de • https://www.joachim-breitner.de/ XMPP: nomeata@joachim-breitner.de • OpenPGP-Key: 0xF0FBF51F Debian Developer: nomeata@debian.org
participants (4)
-
David Feuer
-
Joachim Breitner
-
Mikolaj Konarski
-
Simon Peyton Jones