Re: [GHC] #14610: newtype wrapping of a monadic stack kills performance

24 Dec 2017

      #14610: newtype wrapping of a monadic stack kills performance
-------------------------------------+-------------------------------------
        Reporter:  mrkkrp            |                Owner:  (none)
            Type:  bug               |               Status:  new
        Priority:  normal            |            Milestone:
       Component:  Compiler          |              Version:  8.2.2
      Resolution:                    |             Keywords:
Operating System:  Unknown/Multiple  |         Architecture:
                                     |  Unknown/Multiple
 Type of failure:  None/Unknown      |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:                    |  Differential Rev(s):
       Wiki Page:                    |
-------------------------------------+-------------------------------------
Description changed by mrkkrp:

Old description:
...
I have a project where I in one module (A) I decided to build something
like
a minimal framework for the other bigger module (B) to use. One of the
main
points of the framework is that the monadic stacks are hidden behind
`newtype`s and in those monads you can only use the functions that the
module A provides.
The module A can be found here:
https://github.com/mrkkrp/mmark/blob/master/Text/MMark/Parser/Internal.hs
There are two monadic stack wrapped with newtypes: `BParser` and
`IParser`.
The module B is this:
https://github.com/mrkkrp/mmark/blob/master/Text/MMark/Parser.hs
But it's not really relevant.
Now if I change `newtype`s to `type` synonyms like so:
{{{
type BParser a = ParsecT MMarkErr Text (State BlockState) a
type IParser a = StateT InlineState (Parsec MMarkErr Text) a
}}}
and do corresponding minor corrections, I get 2x less allocations and
almost 2x faster code (before):
{{{
Case                                           Allocated  GCs     Max
with file: data/bench-yaml-block.md              119,080    0  11,088
with file: data/bench-thematic-break.md           74,368    0  10,224
with file: data/bench-heading.md                 901,928    0   9,432
with file: data/bench-fenced-code-block.md       145,744    0   9,368
with file: data/bench-indented-code-block.md     124,312    0   9,368
with file: data/bench-unordered-list.md        2,010,496    1  10,784
with file: data/bench-ordered-list.md          2,025,016    1  10,728
with file: data/bench-blockquote.md            1,961,672    1  42,648
with file: data/bench-paragraph.md             2,084,104    1  42,648
with file: data/comprehensive.md              25,899,496   24  44,200
benchmarking with file: data/comprehensive.md
time                 3.590 ms   (3.578 ms .. 3.601 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 3.555 ms   (3.546 ms .. 3.565 ms)
std dev              31.07 μs   (24.63 μs .. 39.90 μs)
}}}
After:
{{{
Case                                           Allocated  GCs     Max
with file: data/bench-yaml-block.md              116,864    0  11,088
with file: data/bench-thematic-break.md           64,776    0  10,392
with file: data/bench-heading.md                 615,672    0   9,432
with file: data/bench-fenced-code-block.md       144,736    0   9,368
with file: data/bench-indented-code-block.md     123,352    0   9,368
with file: data/bench-unordered-list.md          795,072    0  41,568
with file: data/bench-ordered-list.md            808,808    0  41,512
with file: data/bench-blockquote.md              826,392    0  41,568
with file: data/bench-paragraph.md               881,432    0  41,568
with file: data/comprehensive.md              10,945,104   10  44,440
benchmarking with file: data/comprehensive.md
time                 2.451 ms   (2.448 ms .. 2.456 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 2.432 ms   (2.427 ms .. 2.437 ms)
std dev              15.86 μs   (13.16 μs .. 19.01 μs)
}}}
I'm not great at reading non-trivial core, but I gave it a shot and
dumped
some core. One thing I noticed that core for the module A is the same in
both cases. Then the problem is an inter-module problem, probably like
when
you don't dump definitions into interface files with `INLINEABLE` and end
up
without specialization, but here it perhaps has to do with the fact that
B
has no information about internals of the monadic stacks from A, but I'm
not
sure.
Here is the beginnings of core dumps (before):
{{{
==================== Tidy Core ====================
2017-12-23 04:55:17.967944505 UTC
Result size of Tidy Core
  = {terms: 210,169,
     types: 209,675,
     coercions: 82,492,
     joins: 973/3,753}
}}}
After:
{{{
==================== Tidy Core ====================
2017-12-23 04:58:46.386265108 UTC
Result size of Tidy Core
  = {terms: 301,256,
     types: 263,104,
     coercions: 28,560,
     joins: 1,726/5,393}
}}}
So it looks like `newtype` wrapping reduces GHC's ability to create join
points SPJ talked about recently.
I do understand that you expect a minimal reproducing example but I could
not produce it even though I spent several hours in futile attempts. I
started by creating a small project, defining a similar stack with a
`newtype` wrapper and using it in a simple parser in another module. Then
I
benchmarked the parser. There is no difference between `newtype`d and
code
and the code that uses just type synonyms. I tried different variations,
no difference.
The core output is too big for me to analyze, it's like 25 Mb and 33 Mb,
and I have no idea what's going on there.
You're welcome to experiment with the repo, there are benchmarks for
memory usage and criterion benchmarks for speed of execution:
https://github.com/mrkkrp/mmark
If you like I can create a separate branch with newtypes replaced with
type synonyms, so you can compare the both approaches easier.
I'm submitting this because my friend convinced me that it's better to
let you know (even though I could not create a minimal reproducing
example on my own) than to let it go completely unnoticed.
New description:

 I have a project where I in one module (A) I decided to build something
 like
 a minimal framework for the other bigger module (B) to use. One of the
 main
 points of the framework is that the monadic stacks are hidden behind
 `newtype`s and in those monads you can only use the functions that the
 module A provides.

 The module A can be found here:

 https://github.com/mrkkrp/mmark/blob/ghc-bug-
 newtypes/Text/MMark/Parser/Internal.hs

 There are two monadic stack wrapped with newtypes: `BParser` and
 `IParser`.

 The module B is this:

 https://github.com/mrkkrp/mmark/blob/ghc-bug-newtypes/Text/MMark/Parser.hs

 But it's not really relevant.

 Now if I change `newtype`s to `type` synonyms like so:

 {{{
 type BParser a = ParsecT MMarkErr Text (State BlockState) a
 type IParser a = StateT InlineState (Parsec MMarkErr Text) a
 }}}

 and do corresponding minor corrections, I get 2x less allocations and
 almost 2x faster code (before):

 {{{
 Case                                           Allocated  GCs     Max
 with file: data/bench-yaml-block.md              119,080    0  11,088
 with file: data/bench-thematic-break.md           74,368    0  10,224
 with file: data/bench-heading.md                 901,928    0   9,432
 with file: data/bench-fenced-code-block.md       145,744    0   9,368
 with file: data/bench-indented-code-block.md     124,312    0   9,368
 with file: data/bench-unordered-list.md        2,010,496    1  10,784
 with file: data/bench-ordered-list.md          2,025,016    1  10,728
 with file: data/bench-blockquote.md            1,961,672    1  42,648
 with file: data/bench-paragraph.md             2,084,104    1  42,648
 with file: data/comprehensive.md              25,899,496   24  44,200

 benchmarking with file: data/comprehensive.md
 time                 3.590 ms   (3.578 ms .. 3.601 ms)
                      1.000 R²   (1.000 R² .. 1.000 R²)
 mean                 3.555 ms   (3.546 ms .. 3.565 ms)
 std dev              31.07 μs   (24.63 μs .. 39.90 μs)
 }}}

 After:

 {{{
 Case                                           Allocated  GCs     Max
 with file: data/bench-yaml-block.md              116,864    0  11,088
 with file: data/bench-thematic-break.md           64,776    0  10,392
 with file: data/bench-heading.md                 615,672    0   9,432
 with file: data/bench-fenced-code-block.md       144,736    0   9,368
 with file: data/bench-indented-code-block.md     123,352    0   9,368
 with file: data/bench-unordered-list.md          795,072    0  41,568
 with file: data/bench-ordered-list.md            808,808    0  41,512
 with file: data/bench-blockquote.md              826,392    0  41,568
 with file: data/bench-paragraph.md               881,432    0  41,568
 with file: data/comprehensive.md              10,945,104   10  44,440

 benchmarking with file: data/comprehensive.md
 time                 2.451 ms   (2.448 ms .. 2.456 ms)
                      1.000 R²   (1.000 R² .. 1.000 R²)
 mean                 2.432 ms   (2.427 ms .. 2.437 ms)
 std dev              15.86 μs   (13.16 μs .. 19.01 μs)
 }}}

 I'm not great at reading non-trivial core, but I gave it a shot and dumped
 some core. One thing I noticed that core for the module A is the same in
 both cases. Then the problem is an inter-module problem, probably like
 when
 you don't dump definitions into interface files with `INLINEABLE` and end
 up
 without specialization, but here it perhaps has to do with the fact that B
 has no information about internals of the monadic stacks from A, but I'm
 not
 sure.

 Here is the beginnings of core dumps (before):

 {{{
 ==================== Tidy Core ====================
 2017-12-23 04:55:17.967944505 UTC

 Result size of Tidy Core
   = {terms: 210,169,
      types: 209,675,
      coercions: 82,492,
      joins: 973/3,753}
 }}}

 After:

 {{{
 ==================== Tidy Core ====================
 2017-12-23 04:58:46.386265108 UTC

 Result size of Tidy Core
   = {terms: 301,256,
      types: 263,104,
      coercions: 28,560,
      joins: 1,726/5,393}
 }}}

 So it looks like `newtype` wrapping reduces GHC's ability to create join
 points SPJ talked about recently.

 I do understand that you expect a minimal reproducing example but I could
 not produce it even though I spent several hours in futile attempts. I
 started by creating a small project, defining a similar stack with a
 `newtype` wrapper and using it in a simple parser in another module. Then
 I
 benchmarked the parser. There is no difference between `newtype`d and code
 and the code that uses just type synonyms. I tried different variations,
 no difference.

 The core output is too big for me to analyze, it's like 25 Mb and 33 Mb,
 and I have no idea what's going on there.

 You're welcome to experiment with the repo, there are benchmarks for
 memory usage and criterion benchmarks for speed of execution:

 https://github.com/mrkkrp/mmark

 I have created two branches I won't touch, one with newtypes and another
 one with type synonyms:

 * https://github.com/mrkkrp/mmark/tree/ghc-bug-newtypes
 * https://github.com/mrkkrp/mmark/tree/ghc-bug-type-synonyms

 Just checkout one of these and run `stack bench`.

 This is the commit that changes newtypes to type synonyms:

 https://github.com/mrkkrp/mmark/commit/759d8d4aa52dd57a393299c63e8c9b70d0d43...

 I'm submitting this because my friend convinced me that it's better to let
 you know (even though I could not create a minimal reproducing example on
 my own) than to let it go completely unnoticed.

--

-- 
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14610#comment:1
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler