Re: [GHC] #7258: Compiling DynFlags is jolly slow

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by tdammers): * Attachment "ghc-large-record-types.png" removed. Profile samples for W2.hs-style record types, comparing GHC performance for just the raw record type, derived Read, derived Show, and a hand- written getLine-based applicative record construction function. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: 7.12.1 Component: Compiler | Version: 7.6.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: -------------------------------------+------------------------------------- Changes (by thomie): * failure: None/Unknown => Compile-time performance bug -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:17 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: 7.12.1 Component: Compiler | Version: 7.6.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: -------------------------------------+------------------------------------- Changes (by bgamari): * cc: bgamari (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:18 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: 7.12.1 Component: Compiler | Version: 7.6.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: -------------------------------------+------------------------------------- Changes (by asr): * cc: asr (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:19 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: 7.12.1 Component: Compiler | Version: 7.6.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: -------------------------------------+------------------------------------- Changes (by gidyn): * cc: gidyn (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:20 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: 7.12.1 Component: Compiler | Version: 7.6.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: -------------------------------------+------------------------------------- Comment (by bgamari): I strongly suspect that some of the fixes that I've made while working on #7450 will help this although I have yet to test this. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:21 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: 7.12.1 Component: Compiler | Version: 7.6.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: -------------------------------------+------------------------------------- Comment (by bgamari): ticket:9669#comment:13 contains a summary of the performance impact of Phab:D1012 and Phab:D1040 on this and potentially-related ticket #7450. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:22 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: 8.0.1 Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by ezyang): * keywords: => deriving-perf -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:24 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by HairyDude): * cc: HairyDude (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:26 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by RyanGlScott): * cc: RyanGlScott (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:27 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by trommler): * cc: trommler (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:28 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): Tobaias found that the bottlenecks on W2 were * Pretty-printing of assembly code * Register allocation He has some improvements in progress. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:29 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #13426 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by dfeuer): * related: => #13426 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:30 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by simonpj): * related: #13426 => Comment: Apparently [https://github.com/jwaldmann/pretty-test this] is a example of non-linearity in the pretty printer library. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:31 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by heisenbug): * cc: heisenbug (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:32 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): This commit: http://git.haskell.org/ghc.git/commitdiff/19a2ba3ea436b60466fc4022f53a6631e4... ...reduces complexity in register allocation spilling from quadratic to logarithmic. The old version would run a carthesian product on the list of allocated registers (`assig`) and the list of registers to `keep`; the new version uses set operations instead, based on `UniqSet` / `UniqFW`. I've also moved things around a little to pre-filter the list of allocations to only include the interesting entries in the first place, reducing the number of list items from linear to (as far as I can tell) constant. I believe there are further optimization opportunities here, such as: - changing the current list-based code to also use set/map operations - moving the register class check into the `candidates` part - precalculating the list of candidates (not sure about this one though) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:33 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): http://git.haskell.org/ghc.git/commitdiff/375d4bd0fc2afd72617bc827bf63b5eeb2... This one makes crucial parts of the pretty-printing code stricter; the modified functions score better in profiling, but the overall effect on compilation times seems to be minimal. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:34 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by heisenbug): Replying to [comment:33 tdammers]:
This commit:
http://git.haskell.org/ghc.git/commitdiff/19a2ba3ea436b60466fc4022f53a6631e4...
...reduces complexity in register allocation spilling from quadratic to
logarithmic. The old version would run a carthesian product on the list of allocated registers (`assig`) and the list of registers to `keep`; the new version uses set operations instead, based on `UniqSet` / `UniqFW`. I've also moved things around a little to pre-filter the list of allocations to only include the interesting entries in the first place, reducing the number of list items from linear to (as far as I can tell) constant. Hmmm, what do you think about having the cartesian product for a smaller cut-off size? Manipulating the sets surely also has a (large) constant factor built-in.
I believe there are further optimization opportunities here, such as:
- changing the current list-based code to also use set/map operations - moving the register class check into the `candidates` part - precalculating the list of candidates (not sure about this one though)
-- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:35 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Replying to [comment:35 heisenbug]:
Replying to [comment:33 tdammers]:
This commit:
http://git.haskell.org/ghc.git/commitdiff/19a2ba3ea436b60466fc4022f53a6631e4...
...reduces complexity in register allocation spilling from quadratic
to logarithmic. The old version would run a carthesian product on the list of allocated registers (`assig`) and the list of registers to `keep`; the new version uses set operations instead, based on `UniqSet` / `UniqFW`. I've also moved things around a little to pre-filter the list of allocations to only include the interesting entries in the first place, reducing the number of list items from linear to (as far as I can tell) constant.
Hmmm, what do you think about having the cartesian product for a smaller cut-off size? Manipulating the sets surely also has a (large) constant factor built-in.
Could be worth it, but I'm a bit skeptical - the additional overhead of unpacking the sets into lists could easily cancel out the performance benefit, and it would make the code more complicated. I'll give it a try though. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:36 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): While chatting with tdammers about this I had a peek at the core; it seems that 5000 terms of the 7000 terms in the simplified core of `W2` are in `creadPrec_rdaO`. Moreover, much of this is repetition. I've proposed an approach for dealing with this in #14364. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:37 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by dterei): * cc: dterei (removed) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:38 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari):
Hmmm, what do you think about having the cartesian product for a smaller cut-off size? Manipulating the sets surely also has a (large) constant factor built-in.
I actually suspected this in the past, but dfeuer did some measurements in the context of `ListSetUtils` and found that sets aren't measurably different. Don't let that stop you from quickly trying though. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:39 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Replying to [comment:37 bgamari]:
While chatting with tdammers about this I had a peek at the core; it seems that 5000 terms of the 7000 terms in the simplified core of `W2` are in `creadPrec_rdaO`. Moreover, much of this is repetition. I've proposed an approach for dealing with this in #14364.
I've just done a bit more profiling, and it turns out that removing the derived `Read` instance from the `W3` example (`W2` extended to 500 fields) cuts overall compilation time down from 25.5 seconds to 3.6. So it seems that the edge case isn't just "constructors with many fields", but rather, "derived `Read` instances for constructors with many fields". -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:40 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): On further thought, #14364 is a bit of a stop-gap measure; generating more efficient `Read` instances is of course a good thing, but the register allocator edge case remains, and hand-written code (or TH-generated code) that follows a similar pattern will still hit it. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:41 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): A few things I have tried that do not seem to make a difference: - Pre-filtering the `assig` list to retain only the "interesting" entries (`InReg` and `InBoth`), and passing the filtered list around separately; while in theory this would get rid of unnecessary list filtering passes, in practice it seems that the overhead of maintaining a second list in parallel cancels out anything we may have won - Collecting the `InReg` and `InBoth` candidate lists in a single pass. The idea behind this was that we would visit each list entry just once, putting it in one bin or the other, rather than traversing the list of candidates twice. In practice however there is no measurable performance benefit. - Merging the filtering predicates (checking whether a candidate is `InReg` or `InBoth`, and checking whether the register class matches); the idea here was that doing both checks in one pass might help, but it doesn't, the performance is exactly the same. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:42 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by tdammers): * Attachment "ghc-large-record-types.png" added. Profile samples for W2.hs-style record types, comparing GHC performance for just the raw record type, derived Read, derived Show, and a hand- written getLine-based applicative record construction function. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by tdammers): * Attachment "profiles.csv" added. Sample GHC profile data for W2.hs style record type, comparing just the record type, derived Read, derived Show, and a hand-written getLine-based applicative record construction function -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by tdammers): * Attachment "ghc-large-record-types.png" added. Performance profiling data for large record types. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by tdammers): * Attachment "examples.zip" added. Code examples for the large records performance plot -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Latest profiling results: - graph plot: https://ghc.haskell.org/trac/ghc/attachment/ticket/7258/ghc- large-record-types.png - raw results: https://ghc.haskell.org/trac/ghc/attachment/ticket/7258/profiles.csv - example code: https://ghc.haskell.org/trac/ghc/attachment/ticket/7258/examples.zip The test module defines a record type of N fields, named `field0` through `field{N-1}`, called `DT`; the performance metric is the "total time" value from the profiler output for GHC compiling the module on its own. The codenames refer to different versions of the test module (all of them defining the same `DT` type though): - `nothing`: just a raw record type, no instances or functions. This is the baseline. - `show`: a derived Show instance (`deriving (Show)`) - `read`: a derived Read instance (`deriving (Read)`) - `read-appl`: a hand-written Read instance using applicative syntax (`DT <$> a1 <*> a2 <*> a3 <*> ...`) - `getline`: a hand-written function consisting of a series of N monadic binds followed by a `return` of an N-argument call to the `DT` constructor. This is the same shape as what the derived Read and Show instances produce internally. - `getline-appl`: a hand-written function that reads fields from stdin using `getLine` and applicatively combines them into a value of our record type: `DT <$> (read <$> getLine) <*> (read <$> getLine) <*> ...` - `getline-standalone`: a set of N functions performing one `getLine` call each: `getlD :: Int -> IO Int; getlD i = read <$> getLine` - `show-standalone`: a set of N functions, each constructing a string by combining an integer argument with the function's index The `-standalone` and `nothing` flavors were added as controls. Looking at the plot, the pattern is pretty obvious: the applicative versions show performance patterns very similar to the control, while the three versions that use many monadic binds all exhibit clearly nonlinear performance. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:43 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by tdammers): * Attachment "examples.zip" added. Code examples for the large records performance plot -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): So apparently, good shape (linear-ish performance): {{{ D <$> a1 <*> a2 <*> a2 <*> ... <*> aN }}} Bad shape (exponential-ish performance): {{{ do v1 <- a1 v2 <- a2 v3 <- a3 ... vN <- aN return $ D v1 v2 v3 ... vN }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:44 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Core for monadic-bind style Read instance: {{{ [1 of 1] Compiling D ( examples/t-10-read.hs, examples/t-10-read.o ) ==================== Tidy Core ==================== Result size of Tidy Core = {terms: 373, types: 324, coercions: 0} -- RHS size: {terms: 269, types: 160, coercions: 0} $creadPrec_r1Hl $creadPrec_r1Hl = parens (prec (I# 11#) (>> $fMonadReadPrec (expectP (Ident (unpackCString# "DT"#))) (>> $fMonadReadPrec (expectP (Punc (unpackCString# "{"#))) (>> $fMonadReadPrec (expectP (Ident (unpackCString# "field0"#))) (>> $fMonadReadPrec (expectP (Punc (unpackCString# "="#))) (>>= $fMonadReadPrec (reset (readPrec $fReadInt)) (\ a1_a1tq -> >> $fMonadReadPrec (expectP (Punc (unpackCString# ","#))) (>> $fMonadReadPrec (expectP (Ident (unpackCString# "field2"#))) (>> $fMonadReadPrec (expectP (Punc (unpackCString# "="#))) (>>= $fMonadReadPrec (reset (readPrec $fReadInt)) (\ a2_a1tr -> >> $fMonadReadPrec (expectP (Punc (unpackCString# ","#))) (>> $fMonadReadPrec (expectP (Ident (unpackCString# "field3"#))) (>> $fMonadReadPrec (expectP (Punc (unpackCString# "="#))) (>>= $fMonadReadPrec (reset (readPrec $fReadInt)) (\ a3_a1ts -> >> $fMonadReadPrec (expectP (Punc (unpackCString# ","#))) (>> $fMonadReadPrec (expectP (Ident (unpackCString# "field4"#))) (>> $fMonadReadPrec (expectP (Punc (unpackCString# "="#))) (>>= $fMonadReadPrec (reset (readPrec $fReadInt)) (\ a4_a1tt -> >> $fMonadReadPrec (expectP (Punc (unpackCString# ","#))) (>> $fMonadReadPrec (expectP (Ident (unpackCString# "field5"#))) (>> $fMonadReadPrec (expectP (Punc (unpackCString# "="#))) (>>= $fMonadReadPrec (reset (readPrec $fReadInt)) (\ a5_a1tu ->
$fMonadReadPrec (expectP (Punc (unpackCString# ","#))) (>> $fMonadReadPrec (expectP (Ident (unpackCString# "field6"#))) (>> $fMonadReadPrec (expectP (Punc (unpackCString# "="#))) (>>= $fMonadReadPrec (reset (readPrec $fReadInt)) (\ a6_a1tv ->
$fMonadReadPrec (expectP (Punc (unpackCString# ","#))) (>> $fMonadReadPrec (expectP (Ident (unpackCString# "field7"#))) (>> $fMonadReadPrec (expectP (Punc (unpackCString# "="#))) (>>= $fMonadReadPrec (reset (readPrec $fReadInt)) (\ a7_a1tw ->
$fMonadReadPrec (expectP (Punc (unpackCString# ","#))) (>> $fMonadReadPrec (expectP (Ident (unpackCString# "field8"#))) (>> $fMonadReadPrec (expectP (Punc (unpackCString# "="#))) (>>= $fMonadReadPrec (reset (readPrec $fReadInt)) (\ a8_a1tx ->
$fMonadReadPrec (expectP (Punc (unpackCString# ","#))) (>> $fMonadReadPrec (expectP (Ident (unpackCString# "field9"#))) (>> $fMonadReadPrec (expectP (Punc (unpackCString# "="#))) (>>= $fMonadReadPrec (reset (readPrec $fReadInt)) (\ a9_a1ty ->
$fMonadReadPrec (expectP (Punc (unpackCString# ","#))) (>> $fMonadReadPrec (expectP (Ident (unpackCString# "field10"#))) (>> $fMonadReadPrec (expectP (Punc (unpackCString# "="#))) (>>= $fMonadReadPrec (reset (readPrec $fReadInt)) (\ a10_a1tz ->
$fMonadReadPrec (expectP (Punc (unpackCString# "}"#))) (return $fMonadReadPrec (DT a1_a1tq a2_a1tr a3_a1ts a4_a1tt a5_a1tu a6_a1tv a7_a1tw a8_a1tx a9_a1ty a10_a1tz)))))))))))))))))))))))))))))))))))))))))))))
Rec { -- RHS size: {terms: 5, types: 1, coercions: 0} $fReadDT $fReadDT = C:Read $creadsPrec_r1I5 $creadList_r1I6 $creadPrec_r1Hl $creadListPrec_r1I7 -- RHS size: {terms: 2, types: 1, coercions: 0} $creadsPrec_r1I5 $creadsPrec_r1I5 = $dmreadsPrec $fReadDT -- RHS size: {terms: 2, types: 1, coercions: 0} $creadList_r1I6 $creadList_r1I6 = readListDefault $fReadDT -- RHS size: {terms: 2, types: 1, coercions: 0} $creadListPrec_r1I7 $creadListPrec_r1I7 = readListPrecDefault $fReadDT end Rec } -- RHS size: {terms: 5, types: 12, coercions: 0} field9 field9 = \ ds_d1H8 -> case ds_d1H8 of _ { DT ds1_d1H9 ds2_d1Ha ds3_d1Hb ds4_d1Hc ds5_d1Hd ds6_d1He ds7_d1Hf ds8_d1Hg ds9_d1Hh ds10_d1Hi -> ds9_d1Hh } -- RHS size: {terms: 5, types: 12, coercions: 0} field8 field8 = \ ds_d1GX -> case ds_d1GX of _ { DT ds1_d1GY ds2_d1GZ ds3_d1H0 ds4_d1H1 ds5_d1H2 ds6_d1H3 ds7_d1H4 ds8_d1H5 ds9_d1H6 ds10_d1H7 -> ds8_d1H5 } -- RHS size: {terms: 5, types: 12, coercions: 0} field7 field7 = \ ds_d1GM -> case ds_d1GM of _ { DT ds1_d1GN ds2_d1GO ds3_d1GP ds4_d1GQ ds5_d1GR ds6_d1GS ds7_d1GT ds8_d1GU ds9_d1GV ds10_d1GW -> ds7_d1GT } -- RHS size: {terms: 5, types: 12, coercions: 0} field6 field6 = \ ds_d1GB -> case ds_d1GB of _ { DT ds1_d1GC ds2_d1GD ds3_d1GE ds4_d1GF ds5_d1GG ds6_d1GH ds7_d1GI ds8_d1GJ ds9_d1GK ds10_d1GL -> ds6_d1GH } -- RHS size: {terms: 5, types: 12, coercions: 0} field5 field5 = \ ds_d1Gq -> case ds_d1Gq of _ { DT ds1_d1Gr ds2_d1Gs ds3_d1Gt ds4_d1Gu ds5_d1Gv ds6_d1Gw ds7_d1Gx ds8_d1Gy ds9_d1Gz ds10_d1GA -> ds5_d1Gv } -- RHS size: {terms: 5, types: 12, coercions: 0} field4 field4 = \ ds_d1Gf -> case ds_d1Gf of _ { DT ds1_d1Gg ds2_d1Gh ds3_d1Gi ds4_d1Gj ds5_d1Gk ds6_d1Gl ds7_d1Gm ds8_d1Gn ds9_d1Go ds10_d1Gp -> ds4_d1Gj } -- RHS size: {terms: 5, types: 12, coercions: 0} field3 field3 = \ ds_d1G4 -> case ds_d1G4 of _ { DT ds1_d1G5 ds2_d1G6 ds3_d1G7 ds4_d1G8 ds5_d1G9 ds6_d1Ga ds7_d1Gb ds8_d1Gc ds9_d1Gd ds10_d1Ge -> ds3_d1G7 } -- RHS size: {terms: 5, types: 12, coercions: 0} field2 field2 = \ ds_d1FT -> case ds_d1FT of _ { DT ds1_d1FU ds2_d1FV ds3_d1FW ds4_d1FX ds5_d1FY ds6_d1FZ ds7_d1G0 ds8_d1G1 ds9_d1G2 ds10_d1G3 -> ds2_d1FV } -- RHS size: {terms: 5, types: 12, coercions: 0} field10 field10 = \ ds_d1FI -> case ds_d1FI of _ { DT ds1_d1FJ ds2_d1FK ds3_d1FL ds4_d1FM ds5_d1FN ds6_d1FO ds7_d1FP ds8_d1FQ ds9_d1FR ds10_d1FS -> ds10_d1FS } -- RHS size: {terms: 5, types: 12, coercions: 0} field0 field0 = \ ds_d1Fx -> case ds_d1Fx of _ { DT ds1_d1Fy ds2_d1Fz ds3_d1FA ds4_d1FB ds5_d1FC ds6_d1FD ds7_d1FE ds8_d1FF ds9_d1FG ds10_d1FH -> ds1_d1Fy } -- RHS size: {terms: 2, types: 0, coercions: 0} $trModule1_r1I8 $trModule1_r1I8 = TrNameS "main"# -- RHS size: {terms: 2, types: 0, coercions: 0} $trModule2_r1I9 $trModule2_r1I9 = TrNameS "D"# -- RHS size: {terms: 3, types: 0, coercions: 0} $trModule $trModule = Module $trModule1_r1I8 $trModule2_r1I9 -- RHS size: {terms: 2, types: 0, coercions: 0} $tc'DT1_r1Ia $tc'DT1_r1Ia = TrNameS "'DT"# -- RHS size: {terms: 5, types: 0, coercions: 0} $tc'DT $tc'DT = TyCon 9521127001609462311## 17424978011088396301## $trModule $tc'DT1_r1Ia -- RHS size: {terms: 2, types: 0, coercions: 0} $tcDT1_r1Ib $tcDT1_r1Ib = TrNameS "DT"# -- RHS size: {terms: 5, types: 0, coercions: 0} $tcDT $tcDT = TyCon 14693474152448962618## 5168028270650093369## $trModule $tcDT1_r1Ib }}} And for applicative-style Read implementation: {{{ [1 of 1] Compiling D ( examples/t-10-read-appl.hs, examples/t-10-read-appl.o ) ==================== Tidy Core ==================== Result size of Tidy Core = {terms: 272, types: 324, coercions: 0} -- RHS size: {terms: 168, types: 160, coercions: 0} $creadsPrec_r1Gi $creadsPrec_r1Gi = \ p_aNu -> readP_to_S (<* $fApplicativeReadP (*> $fApplicativeReadP (string (unpackCString# "DT{"#)) (<*> $fApplicativeReadP (<*> $fApplicativeReadP (<*> $fApplicativeReadP (<*> $fApplicativeReadP (<*> $fApplicativeReadP (<*> $fApplicativeReadP (<*> $fApplicativeReadP (<*> $fApplicativeReadP (<*> $fApplicativeReadP (<$> $fFunctorReadP DT (*> $fApplicativeReadP (string (unpackCString# "field0="#)) (readS_to_P (readsPrec $fReadInt p_aNu)))) (*> $fApplicativeReadP (string (unpackCString# ","#)) (*> $fApplicativeReadP (string (unpackCString# "field1="#)) (readS_to_P (readsPrec $fReadInt p_aNu))))) (*> $fApplicativeReadP (string (unpackCString# ","#)) (*> $fApplicativeReadP (string (unpackCString# "field2="#)) (readS_to_P (readsPrec $fReadInt p_aNu))))) (*> $fApplicativeReadP (string (unpackCString# ","#)) (*> $fApplicativeReadP (string (unpackCString# "field3="#)) (readS_to_P (readsPrec $fReadInt p_aNu))))) (*> $fApplicativeReadP (string (unpackCString# ","#)) (*> $fApplicativeReadP (string (unpackCString# "field4="#)) (readS_to_P (readsPrec $fReadInt p_aNu))))) (*> $fApplicativeReadP (string (unpackCString# ","#)) (*> $fApplicativeReadP (string (unpackCString# "field5="#)) (readS_to_P (readsPrec $fReadInt p_aNu))))) (*> $fApplicativeReadP (string (unpackCString# ","#)) (*> $fApplicativeReadP (string (unpackCString# "field6="#)) (readS_to_P (readsPrec $fReadInt p_aNu))))) (*> $fApplicativeReadP (string (unpackCString# ","#)) (*> $fApplicativeReadP (string (unpackCString# "field7="#)) (readS_to_P (readsPrec $fReadInt p_aNu))))) (*> $fApplicativeReadP (string (unpackCString# ","#)) (*> $fApplicativeReadP (string (unpackCString# "field8="#)) (readS_to_P (readsPrec $fReadInt p_aNu))))) (*> $fApplicativeReadP (string (unpackCString# ","#)) (*> $fApplicativeReadP (string (unpackCString# "field9="#)) (readS_to_P (readsPrec $fReadInt p_aNu)))))) (string (unpackCString# "}"#))) Rec { -- RHS size: {terms: 5, types: 1, coercions: 0} $fReadDT $fReadDT = C:Read $creadsPrec_r1Gi $creadList_r1GT $creadPrec_r1GU $creadListPrec_r1GV -- RHS size: {terms: 2, types: 1, coercions: 0} $creadList_r1GT $creadList_r1GT = $dmreadList $fReadDT -- RHS size: {terms: 2, types: 1, coercions: 0} $creadPrec_r1GU $creadPrec_r1GU = $dmreadPrec $fReadDT -- RHS size: {terms: 2, types: 1, coercions: 0} $creadListPrec_r1GV $creadListPrec_r1GV = $dmreadListPrec $fReadDT end Rec } -- RHS size: {terms: 5, types: 12, coercions: 0} field9 field9 = \ ds_d1G5 -> case ds_d1G5 of _ { DT ds1_d1G6 ds2_d1G7 ds3_d1G8 ds4_d1G9 ds5_d1Ga ds6_d1Gb ds7_d1Gc ds8_d1Gd ds9_d1Ge ds10_d1Gf -> ds9_d1Ge } -- RHS size: {terms: 5, types: 12, coercions: 0} field8 field8 = \ ds_d1FU -> case ds_d1FU of _ { DT ds1_d1FV ds2_d1FW ds3_d1FX ds4_d1FY ds5_d1FZ ds6_d1G0 ds7_d1G1 ds8_d1G2 ds9_d1G3 ds10_d1G4 -> ds8_d1G2 } -- RHS size: {terms: 5, types: 12, coercions: 0} field7 field7 = \ ds_d1FJ -> case ds_d1FJ of _ { DT ds1_d1FK ds2_d1FL ds3_d1FM ds4_d1FN ds5_d1FO ds6_d1FP ds7_d1FQ ds8_d1FR ds9_d1FS ds10_d1FT -> ds7_d1FQ } -- RHS size: {terms: 5, types: 12, coercions: 0} field6 field6 = \ ds_d1Fy -> case ds_d1Fy of _ { DT ds1_d1Fz ds2_d1FA ds3_d1FB ds4_d1FC ds5_d1FD ds6_d1FE ds7_d1FF ds8_d1FG ds9_d1FH ds10_d1FI -> ds6_d1FE } -- RHS size: {terms: 5, types: 12, coercions: 0} field5 field5 = \ ds_d1Fn -> case ds_d1Fn of _ { DT ds1_d1Fo ds2_d1Fp ds3_d1Fq ds4_d1Fr ds5_d1Fs ds6_d1Ft ds7_d1Fu ds8_d1Fv ds9_d1Fw ds10_d1Fx -> ds5_d1Fs } -- RHS size: {terms: 5, types: 12, coercions: 0} field4 field4 = \ ds_d1Fc -> case ds_d1Fc of _ { DT ds1_d1Fd ds2_d1Fe ds3_d1Ff ds4_d1Fg ds5_d1Fh ds6_d1Fi ds7_d1Fj ds8_d1Fk ds9_d1Fl ds10_d1Fm -> ds4_d1Fg } -- RHS size: {terms: 5, types: 12, coercions: 0} field3 field3 = \ ds_d1F1 -> case ds_d1F1 of _ { DT ds1_d1F2 ds2_d1F3 ds3_d1F4 ds4_d1F5 ds5_d1F6 ds6_d1F7 ds7_d1F8 ds8_d1F9 ds9_d1Fa ds10_d1Fb -> ds3_d1F4 } -- RHS size: {terms: 5, types: 12, coercions: 0} field2 field2 = \ ds_d1EQ -> case ds_d1EQ of _ { DT ds1_d1ER ds2_d1ES ds3_d1ET ds4_d1EU ds5_d1EV ds6_d1EW ds7_d1EX ds8_d1EY ds9_d1EZ ds10_d1F0 -> ds2_d1ES } -- RHS size: {terms: 5, types: 12, coercions: 0} field10 field10 = \ ds_d1EF -> case ds_d1EF of _ { DT ds1_d1EG ds2_d1EH ds3_d1EI ds4_d1EJ ds5_d1EK ds6_d1EL ds7_d1EM ds8_d1EN ds9_d1EO ds10_d1EP -> ds10_d1EP } -- RHS size: {terms: 5, types: 12, coercions: 0} field0 field0 = \ ds_d1Eu -> case ds_d1Eu of _ { DT ds1_d1Ev ds2_d1Ew ds3_d1Ex ds4_d1Ey ds5_d1Ez ds6_d1EA ds7_d1EB ds8_d1EC ds9_d1ED ds10_d1EE -> ds1_d1Ev } -- RHS size: {terms: 2, types: 0, coercions: 0} $trModule1_r1GW $trModule1_r1GW = TrNameS "main"# -- RHS size: {terms: 2, types: 0, coercions: 0} $trModule2_r1GX $trModule2_r1GX = TrNameS "D"# -- RHS size: {terms: 3, types: 0, coercions: 0} $trModule $trModule = Module $trModule1_r1GW $trModule2_r1GX -- RHS size: {terms: 2, types: 0, coercions: 0} $tc'DT1_r1GY $tc'DT1_r1GY = TrNameS "'DT"# -- RHS size: {terms: 5, types: 0, coercions: 0} $tc'DT $tc'DT = TyCon 9521127001609462311## 17424978011088396301## $trModule $tc'DT1_r1GY -- RHS size: {terms: 2, types: 0, coercions: 0} $tcDT1_r1GZ $tcDT1_r1GZ = TrNameS "DT"# -- RHS size: {terms: 5, types: 0, coercions: 0} $tcDT $tcDT = TyCon 14693474152448962618## 5168028270650093369## $trModule $tcDT1_r1GZ }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:45 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): getLine examples: monadic-bind style: {{{ [1 of 1] Compiling D ( examples/t-10-getline.hs, examples/t-10-getline.o ) ==================== Tidy Core ==================== Result size of Tidy Core = {terms: 182, types: 231, coercions: 0} -- RHS size: {terms: 5, types: 12, coercions: 0} field10 field10 = \ ds_d1JC -> case ds_d1JC of _ { DT ds1_d1JD ds2_d1JE ds3_d1JF ds4_d1JG ds5_d1JH ds6_d1JI ds7_d1JJ ds8_d1JK ds9_d1JL ds10_d1JM -> ds10_d1JM } -- RHS size: {terms: 5, types: 12, coercions: 0} field9 field9 = \ ds_d1Jr -> case ds_d1Jr of _ { DT ds1_d1Js ds2_d1Jt ds3_d1Ju ds4_d1Jv ds5_d1Jw ds6_d1Jx ds7_d1Jy ds8_d1Jz ds9_d1JA ds10_d1JB -> ds9_d1JA } -- RHS size: {terms: 5, types: 12, coercions: 0} field8 field8 = \ ds_d1Jg -> case ds_d1Jg of _ { DT ds1_d1Jh ds2_d1Ji ds3_d1Jj ds4_d1Jk ds5_d1Jl ds6_d1Jm ds7_d1Jn ds8_d1Jo ds9_d1Jp ds10_d1Jq -> ds8_d1Jo } -- RHS size: {terms: 5, types: 12, coercions: 0} field7 field7 = \ ds_d1J5 -> case ds_d1J5 of _ { DT ds1_d1J6 ds2_d1J7 ds3_d1J8 ds4_d1J9 ds5_d1Ja ds6_d1Jb ds7_d1Jc ds8_d1Jd ds9_d1Je ds10_d1Jf -> ds7_d1Jc } -- RHS size: {terms: 5, types: 12, coercions: 0} field6 field6 = \ ds_d1IU -> case ds_d1IU of _ { DT ds1_d1IV ds2_d1IW ds3_d1IX ds4_d1IY ds5_d1IZ ds6_d1J0 ds7_d1J1 ds8_d1J2 ds9_d1J3 ds10_d1J4 -> ds6_d1J0 } -- RHS size: {terms: 5, types: 12, coercions: 0} field5 field5 = \ ds_d1IJ -> case ds_d1IJ of _ { DT ds1_d1IK ds2_d1IL ds3_d1IM ds4_d1IN ds5_d1IO ds6_d1IP ds7_d1IQ ds8_d1IR ds9_d1IS ds10_d1IT -> ds5_d1IO } -- RHS size: {terms: 5, types: 12, coercions: 0} field4 field4 = \ ds_d1Iy -> case ds_d1Iy of _ { DT ds1_d1Iz ds2_d1IA ds3_d1IB ds4_d1IC ds5_d1ID ds6_d1IE ds7_d1IF ds8_d1IG ds9_d1IH ds10_d1II -> ds4_d1IC } -- RHS size: {terms: 5, types: 12, coercions: 0} field3 field3 = \ ds_d1In -> case ds_d1In of _ { DT ds1_d1Io ds2_d1Ip ds3_d1Iq ds4_d1Ir ds5_d1Is ds6_d1It ds7_d1Iu ds8_d1Iv ds9_d1Iw ds10_d1Ix -> ds3_d1Iq } -- RHS size: {terms: 5, types: 12, coercions: 0} field2 field2 = \ ds_d1Ic -> case ds_d1Ic of _ { DT ds1_d1Id ds2_d1Ie ds3_d1If ds4_d1Ig ds5_d1Ih ds6_d1Ii ds7_d1Ij ds8_d1Ik ds9_d1Il ds10_d1Im -> ds2_d1Ie } -- RHS size: {terms: 5, types: 12, coercions: 0} field0 field0 = \ ds_d1I1 -> case ds_d1I1 of _ { DT ds1_d1I2 ds2_d1I3 ds3_d1I4 ds4_d1I5 ds5_d1I6 ds6_d1I7 ds7_d1I8 ds8_d1I9 ds9_d1Ia ds10_d1Ib -> ds1_d1I2 } -- RHS size: {terms: 2, types: 0, coercions: 0} $trModule1_r1EM $trModule1_r1EM = TrNameS "main"# -- RHS size: {terms: 2, types: 0, coercions: 0} $trModule2_r1Ke $trModule2_r1Ke = TrNameS "D"# -- RHS size: {terms: 3, types: 0, coercions: 0} $trModule $trModule = Module $trModule1_r1EM $trModule2_r1Ke -- RHS size: {terms: 2, types: 0, coercions: 0} $tc'DT1_r1Kf $tc'DT1_r1Kf = TrNameS "'DT"# -- RHS size: {terms: 5, types: 0, coercions: 0} $tc'DT $tc'DT = TyCon 9521127001609462311## 17424978011088396301## $trModule $tc'DT1_r1Kf -- RHS size: {terms: 2, types: 0, coercions: 0} $tcDT1_r1Kg $tcDT1_r1Kg = TrNameS "DT"# -- RHS size: {terms: 5, types: 0, coercions: 0} $tcDT $tcDT = TyCon 14693474152448962618## 5168028270650093369## $trModule $tcDT1_r1Kg -- RHS size: {terms: 93, types: 82, coercions: 0} getlD getlD =
= $fMonadIO (<$> $fFunctorIO (read $fReadInt) getLine) (\ field1_aGE -> >>= $fMonadIO (<$> $fFunctorIO (read $fReadInt) getLine) (\ field12_aGF -> >>= $fMonadIO (<$> $fFunctorIO (read $fReadInt) getLine) (\ field13_aGG -> >>= $fMonadIO (<$> $fFunctorIO (read $fReadInt) getLine) (\ field14_aGH -> >>= $fMonadIO (<$> $fFunctorIO (read $fReadInt) getLine) (\ field15_aGI -> >>= $fMonadIO (<$> $fFunctorIO (read $fReadInt) getLine) (\ field16_aGJ -> >>= $fMonadIO (<$> $fFunctorIO (read $fReadInt) getLine) (\ field17_aGK -> >>= $fMonadIO (<$> $fFunctorIO (read $fReadInt) getLine) (\ field18_aGL -> >>= $fMonadIO (<$> $fFunctorIO (read $fReadInt) getLine) (\ field19_aGM -> >>= $fMonadIO (<$> $fFunctorIO (read $fReadInt) getLine) (\ field20_aGN -> return $fMonadIO (DT field1_aGE field12_aGF field13_aGG field14_aGH field15_aGI field16_aGJ field17_aGK field18_aGL field19_aGM field20_aGN)))))))))))
}}} vs. applicative: {{{ [1 of 1] Compiling D ( examples/t-10-getline-appl.hs, examples/t-10-getline-appl.o ) ==================== Tidy Core ==================== Result size of Tidy Core = {terms: 160, types: 264, coercions: 0} -- RHS size: {terms: 5, types: 12, coercions: 0} field10 field10 = \ ds_d1Je -> case ds_d1Je of _ { DT ds1_d1Jf ds2_d1Jg ds3_d1Jh ds4_d1Ji ds5_d1Jj ds6_d1Jk ds7_d1Jl ds8_d1Jm ds9_d1Jn ds10_d1Jo -> ds10_d1Jo } -- RHS size: {terms: 5, types: 12, coercions: 0} field9 field9 = \ ds_d1J3 -> case ds_d1J3 of _ { DT ds1_d1J4 ds2_d1J5 ds3_d1J6 ds4_d1J7 ds5_d1J8 ds6_d1J9 ds7_d1Ja ds8_d1Jb ds9_d1Jc ds10_d1Jd -> ds9_d1Jc } -- RHS size: {terms: 5, types: 12, coercions: 0} field8 field8 = \ ds_d1IS -> case ds_d1IS of _ { DT ds1_d1IT ds2_d1IU ds3_d1IV ds4_d1IW ds5_d1IX ds6_d1IY ds7_d1IZ ds8_d1J0 ds9_d1J1 ds10_d1J2 -> ds8_d1J0 } -- RHS size: {terms: 5, types: 12, coercions: 0} field7 field7 = \ ds_d1IH -> case ds_d1IH of _ { DT ds1_d1II ds2_d1IJ ds3_d1IK ds4_d1IL ds5_d1IM ds6_d1IN ds7_d1IO ds8_d1IP ds9_d1IQ ds10_d1IR -> ds7_d1IO } -- RHS size: {terms: 5, types: 12, coercions: 0} field6 field6 = \ ds_d1Iw -> case ds_d1Iw of _ { DT ds1_d1Ix ds2_d1Iy ds3_d1Iz ds4_d1IA ds5_d1IB ds6_d1IC ds7_d1ID ds8_d1IE ds9_d1IF ds10_d1IG -> ds6_d1IC } -- RHS size: {terms: 5, types: 12, coercions: 0} field5 field5 = \ ds_d1Il -> case ds_d1Il of _ { DT ds1_d1Im ds2_d1In ds3_d1Io ds4_d1Ip ds5_d1Iq ds6_d1Ir ds7_d1Is ds8_d1It ds9_d1Iu ds10_d1Iv -> ds5_d1Iq } -- RHS size: {terms: 5, types: 12, coercions: 0} field4 field4 = \ ds_d1Ia -> case ds_d1Ia of _ { DT ds1_d1Ib ds2_d1Ic ds3_d1Id ds4_d1Ie ds5_d1If ds6_d1Ig ds7_d1Ih ds8_d1Ii ds9_d1Ij ds10_d1Ik -> ds4_d1Ie } -- RHS size: {terms: 5, types: 12, coercions: 0} field3 field3 = \ ds_d1HZ -> case ds_d1HZ of _ { DT ds1_d1I0 ds2_d1I1 ds3_d1I2 ds4_d1I3 ds5_d1I4 ds6_d1I5 ds7_d1I6 ds8_d1I7 ds9_d1I8 ds10_d1I9 -> ds3_d1I2 } -- RHS size: {terms: 5, types: 12, coercions: 0} field2 field2 = \ ds_d1HO -> case ds_d1HO of _ { DT ds1_d1HP ds2_d1HQ ds3_d1HR ds4_d1HS ds5_d1HT ds6_d1HU ds7_d1HV ds8_d1HW ds9_d1HX ds10_d1HY -> ds2_d1HQ } -- RHS size: {terms: 5, types: 12, coercions: 0} field0 field0 = \ ds_d1HD -> case ds_d1HD of _ { DT ds1_d1HE ds2_d1HF ds3_d1HG ds4_d1HH ds5_d1HI ds6_d1HJ ds7_d1HK ds8_d1HL ds9_d1HM ds10_d1HN -> ds1_d1HE } -- RHS size: {terms: 2, types: 0, coercions: 0} $trModule1_r1Ey $trModule1_r1Ey = TrNameS "main"# -- RHS size: {terms: 2, types: 0, coercions: 0} $trModule2_r1JQ $trModule2_r1JQ = TrNameS "D"# -- RHS size: {terms: 3, types: 0, coercions: 0} $trModule $trModule = Module $trModule1_r1Ey $trModule2_r1JQ -- RHS size: {terms: 2, types: 0, coercions: 0} $tc'DT1_r1JR $tc'DT1_r1JR = TrNameS "'DT"# -- RHS size: {terms: 5, types: 0, coercions: 0} $tc'DT $tc'DT = TyCon 9521127001609462311## 17424978011088396301## $trModule $tc'DT1_r1JR -- RHS size: {terms: 2, types: 0, coercions: 0} $tcDT1_r1JS $tcDT1_r1JS = TrNameS "DT"# -- RHS size: {terms: 5, types: 0, coercions: 0} $tcDT $tcDT = TyCon 14693474152448962618## 5168028270650093369## $trModule $tcDT1_r1JS -- RHS size: {terms: 71, types: 115, coercions: 0} getlD getlD = <*> $fApplicativeIO (<*> $fApplicativeIO (<*> $fApplicativeIO (<*> $fApplicativeIO (<*> $fApplicativeIO (<*> $fApplicativeIO (<*> $fApplicativeIO (<*> $fApplicativeIO (<*> $fApplicativeIO (<$> $fFunctorIO DT (<$> $fFunctorIO (read $fReadInt) getLine)) (<$> $fFunctorIO (read $fReadInt) getLine)) (<$> $fFunctorIO (read $fReadInt) getLine)) (<$> $fFunctorIO (read $fReadInt) getLine)) (<$> $fFunctorIO (read $fReadInt) getLine)) (<$> $fFunctorIO (read $fReadInt) getLine)) (<$> $fFunctorIO (read $fReadInt) getLine)) (<$> $fFunctorIO (read $fReadInt) getLine)) (<$> $fFunctorIO (read $fReadInt) getLine)) (<$> $fFunctorIO (read $fReadInt) getLine) }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:46 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by tdammers): * Attachment "ghc-large-record-types-optimized.png" added. Performance profiling data for large record types, using -O2. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): Ha. Reminds me of #13253. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:47 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): Is the labelling in the graph after comment:46 correct? read-appl looks worst by a long way, which wasn't true bef.re Ditto geline-appl seems worse than getline (which is invisible). -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:48 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): Here's a mystery. * I took `t-10-getline.hs` and `t-10-getine-appl.hs` from `examples.zip` just before comment:44 * I compiled both with `-O -dshow-passes -ddump-simpl`. * Both produced ''identical'' Core. In the early stages `getline-appl` is a little bigger, but they become identical after the first run of the simplifier following specialise/float-out. (I.e. early) I am at a loss for how to reconcile these results with the dramatically worse compile times for `getline` compared with `getline-appl` reported in the graph plot in comment:43. Tobias, any ideas? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:49 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Replying to [comment:48 simonpj]:
Is the labelling in the graph after comment:46 correct? read-appl looks worst by a long way, which wasn't true bef.re Ditto geline-appl seems worse than getline (which is invisible).
Yes, it is correct; the difference between this new graph and the previous one is that the new one uses `-O2`, while the old one doesn't use any optimizations. The profiler breakdown looks radically different between the optimized and unoptimized versions, for *all* examples - RegAlloc- linear and ppr, which are the most serious offenders in the unoptimized run, account for less than 1% of execution time in the optimized case. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:50 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

Here's a mystery.
* I took `t-10-getline.hs` and `t-10-getine-appl.hs` from `examples.zip` just before comment:44
* I compiled both with `-O -dshow-passes -ddump-simpl`.
* Both produced ''identical'' Core. In the early stages `getline-appl` is a little bigger, but they become identical after the first run of the simplifier following specialise/float-out. (I.e. early)
I am at a loss for how to reconcile these results with the dramatically worse compile times for `getline` compared with `getline-appl` reported in
#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Replying to [comment:49 simonpj]: the graph plot in comment:43.
Tobias, any ideas?
Hmm, maybe `-O` vs. `-O2`? I'll run a few trials myself. Other thought, though probably not really an explanation: the t-10-XXXX examples are maybe too small to produce useful metrics, the performance differences only start to get dramatic past 100 fields or so. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:51 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj):
the t-10-XXXX examples are maybe too small to produce useful metrics,
I wasn't measuring performance; I was just eye-balling the (identical) code.
The profiler breakdown looks radically different between the optimized and unoptimized versions, for *all* examples
OK, so there are two perhaps-different problems * With -O compiling `DynFlags` is jolly slow. (The original bug report.) * Without -O we get non-linear stuff going on (comment:43) I think the -O case is the higher-priority item. (Though I accept that compiling without -O is supposed to be fast.) I would stick to -O for now, not -O2. The latter involves `SpecConstr` and `LiberateCase` both of which can generate a lot of code. Again let's not ignore -O2 but let's nail -O first. So what do the -O graphs look like? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:52 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): I don't have -O graphs yet, unfortunately it takes a good while to generate those (I aborted the -O2 one after running all night and only making it to 420 fields). I'll start up a job and report back. One-shot experiments do suggest though that pretty-printing and register allocation aren't going to be bottlenecks here either. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:53 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow
-------------------------------------+-------------------------------------
Reporter: simonpj | Owner: simonpj
Type: bug | Status: new
Priority: normal | Milestone:
Component: Compiler | Version: 7.6.1
Resolution: | Keywords: deriving-perf
Operating System: Unknown/Multiple | Architecture:
Type of failure: Compile-time | Unknown/Multiple
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: | Differential Rev(s):
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by Ben Gamari

#7258: Compiling DynFlags is jolly slow
-------------------------------------+-------------------------------------
Reporter: simonpj | Owner: simonpj
Type: bug | Status: new
Priority: normal | Milestone:
Component: Compiler | Version: 7.6.1
Resolution: | Keywords: deriving-perf
Operating System: Unknown/Multiple | Architecture:
Type of failure: Compile-time | Unknown/Multiple
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: | Differential Rev(s):
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by Ben Gamari

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by tdammers): * Attachment "ghc-large-record-types-optimized.png" added. Performance graph for various examples, compiling with -O -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by tdammers): * Attachment "ghc-large-record-types-optimized-O2.png" added. Performance graph for various examples, compiling with -O2 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Updated performance graph: https://ghc.haskell.org/trac/ghc/attachment/ticket/7258/ghc-large-record- types-optimized.png (this one uses -O, not -O2) Key observation: read, read-appl and show "explode" while all the other examples, including getline-appl and getline, behave properly. A sample profile like the one for 400-field read, shows that the bottleneck is no longer ASM output or register allocation anymore once -O is involved: {{{ Thu Oct 26 01:25 2017 Time and Allocation Profiling Report (Final) ghc-stage2 +RTS -p -h -RTS -B/home/tobias/well- typed/devel/ghc/inplace/lib -B/home/tobias/well- typed/devel/ghc/inplace/lib -O -fforce-recomp -c examples/t-400-read.hs total time = 128.62 secs (128624 ticks @ 1000 us, 1 processor) total alloc = 37,659,351,320 bytes (excludes profiling overheads) COST CENTRE MODULE SRC %time %alloc sink CmmPipeline compiler/cmm/CmmPipeline.hs:(104,13)-(105,59) 77.8 23.6 SimplTopBinds SimplCore compiler/simplCore/SimplCore.hs:761:39-74 9.6 30.0 FloatOutwards SimplCore compiler/simplCore/SimplCore.hs:471:40-66 3.0 10.4 RegAlloc-linear AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(658,27)-(660,55) 1.6 5.0 OccAnal SimplCore compiler/simplCore/SimplCore.hs:(739,22)-(740,67) 1.1 4.2 pprNativeCode AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(529,37)-(530,65) 1.0 4.7 NewStranal SimplCore compiler/simplCore/SimplCore.hs:480:40-63 0.9 4.2 CoreTidy HscMain compiler/main/HscMain.hs:1253:27-67 0.7 2.9 StgCmm HscMain compiler/main/HscMain.hs:(1426,13)-(1427,62) 0.6 1.9 regLiveness AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(591,17)-(593,52) 0.5 1.4 genMachCode AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(580,17)-(582,62) 0.4 1.3 NativeCodeGen CodeOutput compiler/main/CodeOutput.hs:171:18-78 0.4 1.6 CommonSubExpr SimplCore compiler/simplCore/SimplCore.hs:462:40-56 0.2 1.1 deSugar HscMain compiler/main/HscMain.hs:511:7-44 0.2 1.1 }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:56 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj):
read, read-appl and show "explode" while all the other examples, including getline-appl and getline, behave properly.
Very good! {{{ sink CmmPipeline compiler/cmm/CmmPipeline.hs:(104,13)-(105,59) 77.8 23.6 SimplTopBinds SimplCore compiler/simplCore/SimplCore.hs:761:39-74 9.6 30.0 FloatOutwards SimplCore compiler/simplCore/SimplCore.hs:471:40-66 3.0 10.4 }}} What does `-dshow-passes` show about the program size? I.e. are we getting very large Core? So `sink` in C-- seems very slow. Nearly 4 bytes in 5 of all allocation is in tis pass alone. Is this a non-linear algorithmic effect. e.g if you double the size of the input, does the `sink` time go up non-linearly while `simplTopBinds` goes up linearly? That's my guess. (Related question: does `sink` show up a a big item when compiling "normal" programs?) Guessing: the "shape" of the C-- code makes `sink` behave very badly. Can you characterise what that shape is, and where the algorithmic badness comes from? This is good... fixing these performance bumps will likely help ALL programs, even ones that don't hit the really bad spots. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:57 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Replying to [comment:57 simonpj]:
read, read-appl and show "explode" while all the other examples, including getline-appl and getline, behave properly.
Very good!
{{{ sink CmmPipeline compiler/cmm/CmmPipeline.hs:(104,13)-(105,59) 77.8 23.6 SimplTopBinds SimplCore compiler/simplCore/SimplCore.hs:761:39-74 9.6 30.0 FloatOutwards SimplCore compiler/simplCore/SimplCore.hs:471:40-66 3.0 10.4 }}}
What does `-dshow-passes` show about the program size? I.e. are we getting very large Core?
For a 100-line read example: {{{ Glasgow Haskell Compiler, Version 8.3.20171016, stage 2 booted by GHC version 8.0.1 Using binary package database: /home/tobias/well- typed/devel/ghc/inplace/lib/package.conf.d/package.cache package flags [] loading package database /home/tobias/well- typed/devel/ghc/inplace/lib/package.conf.d wired-in package ghc-prim mapped to ghc-prim-0.5.2.0 wired-in package integer-gmp mapped to integer-gmp-1.0.1.0 wired-in package base mapped to base-4.11.0.0 wired-in package rts mapped to rts wired-in package template-haskell mapped to template-haskell-2.13.0.0 wired-in package ghc mapped to ghc-8.3 wired-in package dph-seq not found. wired-in package dph-par not found. *** Checking old interface for D (use -ddump-hi-diffs for more details): *** Parser [D]: !!! Parser [D]: finished in 3.22 milliseconds, allocated 3.170 megabytes *** Renamer/typechecker [D]: !!! Renamer/typechecker [D]: finished in 124.33 milliseconds, allocated 72.473 megabytes *** Desugar [D]: Result size of Desugar (after optimization) = {terms: 2,667, types: 11,534, coercions: 0, joins: 0/0} !!! Desugar [D]: finished in 174.52 milliseconds, allocated 59.456 megabytes *** Simplifier [D]: Result size of Simplifier iteration=1 = {terms: 7,375, types: 18,990, coercions: 6,881, joins: 0/913} Result size of Simplifier iteration=2 = {terms: 3,722, types: 13,604, coercions: 1,351, joins: 0/198} Result size of Simplifier iteration=3 = {terms: 3,319, types: 12,598, coercions: 1,342, joins: 0/1} Result size of Simplifier = {terms: 3,319, types: 12,598, coercions: 1,342, joins: 0/1} !!! Simplifier [D]: finished in 1306.06 milliseconds, allocated 749.648 megabytes *** Specialise [D]: Result size of Specialise = {terms: 3,319, types: 12,598, coercions: 1,342, joins: 0/1} !!! Specialise [D]: finished in 19.96 milliseconds, allocated 17.006 megabytes *** Float out(FOS {Lam = Just 0, Consts = True, OverSatApps = False}) [D]: Result size of Float out(FOS {Lam = Just 0, Consts = True, OverSatApps = False}) = {terms: 4,923, types: 15,504, coercions: 1,342, joins: 0/0} !!! Float out(FOS {Lam = Just 0, Consts = True, OverSatApps = False}) [D]: finished in 122.82 milliseconds, allocated 103.551 megabytes *** Simplifier [D]: Result size of Simplifier iteration=1 = {terms: 4,115, types: 13,888, coercions: 1,342, joins: 0/0} Result size of Simplifier = {terms: 4,115, types: 13,888, coercions: 1,342, joins: 0/0} !!! Simplifier [D]: finished in 216.40 milliseconds, allocated 122.592 megabytes *** Simplifier [D]: Result size of Simplifier iteration=1 = {terms: 4,115, types: 13,282, coercions: 1,342, joins: 0/0} Result size of Simplifier = {terms: 4,115, types: 13,282, coercions: 1,342, joins: 0/0} !!! Simplifier [D]: finished in 183.50 milliseconds, allocated 131.494 megabytes *** Simplifier [D]: Result size of Simplifier iteration=1 = {terms: 4,523, types: 14,200, coercions: 1,342, joins: 0/0} Result size of Simplifier = {terms: 4,523, types: 14,200, coercions: 1,342, joins: 0/0} !!! Simplifier [D]: finished in 183.89 milliseconds, allocated 126.382 megabytes *** Float inwards [D]: Result size of Float inwards = {terms: 4,523, types: 14,200, coercions: 1,342, joins: 0/0} !!! Float inwards [D]: finished in 8.67 milliseconds, allocated 4.238 megabytes *** Called arity analysis [D]: Result size of Called arity analysis = {terms: 4,523, types: 14,200, coercions: 1,342, joins: 0/0} !!! Called arity analysis [D]: finished in 5.66 milliseconds, allocated 5.974 megabytes *** Simplifier [D]: Result size of Simplifier = {terms: 4,523, types: 14,200, coercions: 1,342, joins: 0/0} !!! Simplifier [D]: finished in 67.09 milliseconds, allocated 62.910 megabytes *** Demand analysis [D]: Result size of Demand analysis = {terms: 4,523, types: 14,200, coercions: 1,342, joins: 0/0} !!! Demand analysis [D]: finished in 170.27 milliseconds, allocated 91.193 megabytes *** Worker Wrapper binds [D]: Result size of Worker Wrapper binds = {terms: 4,541, types: 14,225, coercions: 1,342, joins: 0/3} !!! Worker Wrapper binds [D]: finished in 5.09 milliseconds, allocated 3.614 megabytes *** Simplifier [D]: Result size of Simplifier iteration=1 = {terms: 4,530, types: 14,215, coercions: 1,342, joins: 0/0} Result size of Simplifier = {terms: 4,530, types: 14,215, coercions: 1,342, joins: 0/0} !!! Simplifier [D]: finished in 351.93 milliseconds, allocated 219.389 megabytes *** Float out(FOS {Lam = Just 0, Consts = True, OverSatApps = True}) [D]: Result size of Float out(FOS {Lam = Just 0, Consts = True, OverSatApps = True}) = {terms: 4,530, types: 14,215, coercions: 1,342, joins: 0/0} !!! Float out(FOS {Lam = Just 0, Consts = True, OverSatApps = True}) [D]: finished in 268.13 milliseconds, allocated 280.555 megabytes *** Common sub-expression [D]: Result size of Common sub-expression = {terms: 3,740, types: 13,522, coercions: 1,342, joins: 0/0} !!! Common sub-expression [D]: finished in 38.50 milliseconds, allocated 48.593 megabytes *** Float inwards [D]: Result size of Float inwards = {terms: 3,740, types: 13,522, coercions: 1,342, joins: 0/0} !!! Float inwards [D]: finished in 5.37 milliseconds, allocated 3.369 megabytes *** Simplifier [D]: Result size of Simplifier = {terms: 2,952, types: 12,337, coercions: 1,342, joins: 0/0} !!! Simplifier [D]: finished in 140.01 milliseconds, allocated 98.855 megabytes *** Demand analysis [D]: Result size of Demand analysis = {terms: 2,952, types: 12,337, coercions: 1,342, joins: 0/0} !!! Demand analysis [D]: finished in 77.63 milliseconds, allocated 86.076 megabytes *** CoreTidy [D]: Result size of Tidy Core = {terms: 2,952, types: 12,337, coercions: 1,342, joins: 0/0} !!! CoreTidy [D]: finished in 54.47 milliseconds, allocated 69.051 megabytes Created temporary directory: /tmp/ghc27776_0 *** CorePrep [D]: Result size of CorePrep = {terms: 3,569, types: 13,155, coercions: 1,342, joins: 0/204} !!! CorePrep [D]: finished in 25.73 milliseconds, allocated 32.097 megabytes *** Stg2Stg: *** CodeGen [D]: !!! CodeGen [D]: finished in 2109.26 milliseconds, allocated 1630.642 megabytes *** Assembler: *** Deleting temp files: Warning: deleting non-existent /tmp/ghc27776_0/ghc_2.c *** Deleting temp dirs: }}} At 200 lines, we're looking at: {{{ Glasgow Haskell Compiler, Version 8.3.20171016, stage 2 booted by GHC version 8.0.1 Using binary package database: /home/tobias/well- typed/devel/ghc/inplace/lib/package.conf.d/package.cache package flags [] loading package database /home/tobias/well- typed/devel/ghc/inplace/lib/package.conf.d wired-in package ghc-prim mapped to ghc-prim-0.5.2.0 wired-in package integer-gmp mapped to integer-gmp-1.0.1.0 wired-in package base mapped to base-4.11.0.0 wired-in package rts mapped to rts wired-in package template-haskell mapped to template-haskell-2.13.0.0 wired-in package ghc mapped to ghc-8.3 wired-in package dph-seq not found. wired-in package dph-par not found. *** Checking old interface for D (use -ddump-hi-diffs for more details): *** Parser [D]: !!! Parser [D]: finished in 3.80 milliseconds, allocated 6.207 megabytes *** Renamer/typechecker [D]: !!! Renamer/typechecker [D]: finished in 230.12 milliseconds, allocated 138.058 megabytes *** Desugar [D]: Result size of Desugar (after optimization) = {terms: 5,267, types: 43,034, coercions: 0, joins: 0/0} !!! Desugar [D]: finished in 289.66 milliseconds, allocated 201.767 megabytes *** Simplifier [D]: Result size of Simplifier iteration=1 = {terms: 14,575, types: 57,790, coercions: 13,681, joins: 0/1,813} Result size of Simplifier iteration=2 = {terms: 7,322, types: 47,104, coercions: 2,651, joins: 0/398} Result size of Simplifier iteration=3 = {terms: 6,519, types: 45,098, coercions: 2,642, joins: 0/1} Result size of Simplifier = {terms: 6,519, types: 45,098, coercions: 2,642, joins: 0/1} !!! Simplifier [D]: finished in 5652.80 milliseconds, allocated 2700.163 megabytes *** Specialise [D]: Result size of Specialise = {terms: 6,519, types: 45,098, coercions: 2,642, joins: 0/1} !!! Specialise [D]: finished in 62.79 milliseconds, allocated 58.261 megabytes *** Float out(FOS {Lam = Just 0, Consts = True, OverSatApps = False}) [D]: Result size of Float out(FOS {Lam = Just 0, Consts = True, OverSatApps = False}) = {terms: 9,723, types: 50,904, coercions: 2,642, joins: 0/0} !!! Float out(FOS {Lam = Just 0, Consts = True, OverSatApps = False}) [D]: finished in 401.80 milliseconds, allocated 374.441 megabytes *** Simplifier [D]: Result size of Simplifier iteration=1 = {terms: 8,115, types: 47,688, coercions: 2,642, joins: 0/0} Result size of Simplifier = {terms: 8,115, types: 47,688, coercions: 2,642, joins: 0/0} !!! Simplifier [D]: finished in 754.47 milliseconds, allocated 393.159 megabytes *** Simplifier [D]: Result size of Simplifier iteration=1 = {terms: 8,115, types: 46,482, coercions: 2,642, joins: 0/0} Result size of Simplifier = {terms: 8,115, types: 46,482, coercions: 2,642, joins: 0/0} !!! Simplifier [D]: finished in 579.29 milliseconds, allocated 411.892 megabytes *** Simplifier [D]: Result size of Simplifier iteration=1 = {terms: 8,923, types: 48,300, coercions: 2,642, joins: 0/0} Result size of Simplifier = {terms: 8,923, types: 48,300, coercions: 2,642, joins: 0/0} !!! Simplifier [D]: finished in 630.25 milliseconds, allocated 382.851 megabytes *** Float inwards [D]: Result size of Float inwards = {terms: 8,923, types: 48,300, coercions: 2,642, joins: 0/0} !!! Float inwards [D]: finished in 8.17 milliseconds, allocated 8.300 megabytes *** Called arity analysis [D]: Result size of Called arity analysis = {terms: 8,923, types: 48,300, coercions: 2,642, joins: 0/0} !!! Called arity analysis [D]: finished in 21.61 milliseconds, allocated 11.860 megabytes *** Simplifier [D]: Result size of Simplifier = {terms: 8,923, types: 48,300, coercions: 2,642, joins: 0/0} !!! Simplifier [D]: finished in 362.68 milliseconds, allocated 190.967 megabytes *** Demand analysis [D]: Result size of Demand analysis = {terms: 8,923, types: 48,300, coercions: 2,642, joins: 0/0} !!! Demand analysis [D]: finished in 551.36 milliseconds, allocated 339.529 megabytes *** Worker Wrapper binds [D]: Result size of Worker Wrapper binds = {terms: 8,941, types: 48,325, coercions: 2,642, joins: 0/3} !!! Worker Wrapper binds [D]: finished in 7.70 milliseconds, allocated 11.542 megabytes *** Simplifier [D]: Result size of Simplifier iteration=1 = {terms: 8,930, types: 48,315, coercions: 2,642, joins: 0/0} Result size of Simplifier = {terms: 8,930, types: 48,315, coercions: 2,642, joins: 0/0} !!! Simplifier [D]: finished in 1307.95 milliseconds, allocated 727.784 megabytes *** Float out(FOS {Lam = Just 0, Consts = True, OverSatApps = True}) [D]: Result size of Float out(FOS {Lam = Just 0, Consts = True, OverSatApps = True}) = {terms: 8,930, types: 48,315, coercions: 2,642, joins: 0/0} !!! Float out(FOS {Lam = Just 0, Consts = True, OverSatApps = True}) [D]: finished in 1393.21 milliseconds, allocated 1078.391 megabytes *** Common sub-expression [D]: Result size of Common sub-expression = {terms: 7,340, types: 46,922, coercions: 2,642, joins: 0/0} !!! Common sub-expression [D]: finished in 227.12 milliseconds, allocated 194.184 megabytes *** Float inwards [D]: Result size of Float inwards = {terms: 7,340, types: 46,922, coercions: 2,642, joins: 0/0} !!! Float inwards [D]: finished in 11.63 milliseconds, allocated 6.545 megabytes *** Simplifier [D]: Result size of Simplifier = {terms: 5,752, types: 44,537, coercions: 2,642, joins: 0/0} !!! Simplifier [D]: finished in 590.60 milliseconds, allocated 349.746 megabytes *** Demand analysis [D]: Result size of Demand analysis = {terms: 5,752, types: 44,537, coercions: 2,642, joins: 0/0} !!! Demand analysis [D]: finished in 485.19 milliseconds, allocated 330.909 megabytes *** CoreTidy [D]: Result size of Tidy Core = {terms: 5,752, types: 44,537, coercions: 2,642, joins: 0/0} !!! CoreTidy [D]: finished in 293.13 milliseconds, allocated 258.898 megabytes Created temporary directory: /tmp/ghc18725_0 *** CorePrep [D]: Result size of CorePrep = {terms: 6,969, types: 46,155, coercions: 2,642, joins: 0/404} !!! CorePrep [D]: finished in 163.49 milliseconds, allocated 113.842 megabytes *** Stg2Stg: *** CodeGen [D]: !!! CodeGen [D]: finished in 23041.19 milliseconds, allocated 12665.144 megabytes *** Assembler: *** Deleting temp files: Warning: deleting non-existent /tmp/ghc18725_0/ghc_2.c *** Deleting temp dirs: }}} Not sure whether this qualifies as "unusually large". `-ddump-cmm` shows that the C-- for a 100-field read instance is more than twice as large as that for the 100-field getline example (39234 lines vs 14895), but size alone doesn't explain the huge difference. However, at 300 fields, the read example produces 237541 lines of C--, while getline grows roughly linearly, to 47312 lines, meaning that there is definitely some sort of nonlinearity going on wrt C-- code size.
So `sink` in C-- seems very slow. Nearly 4 bytes in 5 of all allocation is in tis pass alone. Is this a non-linear algorithmic effect. e.g if you double the size of the input, does the `sink` time go up non-linearly while `simplTopBinds` goes up linearly? That's my guess.
Yes, in fact it does, quite clearly so: 10-field example: {{{ Wed Oct 25 14:57 2017 Time and Allocation Profiling Report (Final) ghc-stage2 +RTS -p -h -RTS -B/home/tobias/well- typed/devel/ghc/inplace/lib -B/home/tobias/well- typed/devel/ghc/inplace/lib -O -fforce-recomp -c examples/t-10-read.hs total time = 0.19 secs (188 ticks @ 1000 us, 1 processor) total alloc = 134,109,112 bytes (excludes profiling overheads) COST CENTRE MODULE SRC %time %alloc SimplTopBinds SimplCore compiler/simplCore/SimplCore.hs:761:39-74 35.6 28.7 tc_rn_src_decls TcRnDriver compiler/typecheck/TcRnDriver.hs:(493,4)-(555,7) 9.0 9.0 OccAnal SimplCore compiler/simplCore/SimplCore.hs:(739,22)-(740,67) 4.8 5.5 FloatOutwards SimplCore compiler/simplCore/SimplCore.hs:471:40-66 4.3 5.1 pprNativeCode AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(529,37)-(530,65) 3.7 3.5 tcRnImports TcRnDriver compiler/typecheck/TcRnDriver.hs:240:20-50 3.7 3.6 StgCmm HscMain compiler/main/HscMain.hs:(1426,13)-(1427,62) 3.7 2.1 sink CmmPipeline compiler/cmm/CmmPipeline.hs:(104,13)-(105,59) 3.2 1.5 setSessionDynFlags GHC compiler/main/GHC.hs:(578,1)-(584,16) 3.2 4.7 NativeCodeGen CodeOutput compiler/main/CodeOutput.hs:171:18-78 2.1 0.9 NewStranal SimplCore compiler/simplCore/SimplCore.hs:480:40-63 2.1 2.9 bin_tycldecls HscTypes compiler/main/HscTypes.hs:1085:52-57 1.6 1.0 initGhcMonad GHC compiler/main/GHC.hs:(493,1)-(503,25) 1.6 4.7 MAIN MAIN <built-in> 1.1 0.3 fixStgRegisters AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:566:17-42 1.1 0.2 genMachCode AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(580,17)-(582,62) 1.1 1.3 shortcutBranches AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:694:17-54 1.1 0.2 elimCommonBlocks CmmPipeline compiler/cmm/CmmPipeline.hs:(71,13)-(72,76) 1.1 1.0 CAF PrimOp <entire-module> 1.1 0.5 CAF PrelInfo <entire-module> 1.1 1.0 deSugar HscMain compiler/main/HscMain.hs:511:7-44 1.1 0.9 CorePrep HscMain compiler/main/HscMain.hs:(1313,24)-(1314,57) 1.1 0.6 withCleanupSession GHC compiler/main/GHC.hs:(466,1)-(475,37) 1.1 0.8 regLiveness AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(591,17)-(593,52) 0.5 2.0 RegAlloc-linear AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(658,27)-(660,55) 0.5 2.4 layoutStack CmmPipeline compiler/cmm/CmmPipeline.hs:(97,13)-(99,40) 0.5 1.0 CoreTidy HscMain compiler/main/HscMain.hs:1253:27-67 0.5 1.0 }}} 100 fields: {{{ Wed Oct 25 15:00 2017 Time and Allocation Profiling Report (Final) ghc-stage2 +RTS -p -h -RTS -B/home/tobias/well- typed/devel/ghc/inplace/lib -B/home/tobias/well- typed/devel/ghc/inplace/lib -O -fforce-recomp -c examples/t-100-read.hs total time = 3.10 secs (3103 ticks @ 1000 us, 1 processor) total alloc = 2,397,379,664 bytes (excludes profiling overheads) COST CENTRE MODULE SRC %time %alloc SimplTopBinds SimplCore compiler/simplCore/SimplCore.hs:761:39-74 38.9 36.6 sink CmmPipeline compiler/cmm/CmmPipeline.hs:(104,13)-(105,59) 16.0 7.3 FloatOutwards SimplCore compiler/simplCore/SimplCore.hs:471:40-66 7.9 10.8 OccAnal SimplCore compiler/simplCore/SimplCore.hs:(739,22)-(740,67) 4.3 5.8 pprNativeCode AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(529,37)-(530,65) 4.1 5.3 RegAlloc-linear AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(658,27)-(660,55) 3.8 4.6 StgCmm HscMain compiler/main/HscMain.hs:(1426,13)-(1427,62) 2.8 2.5 NewStranal SimplCore compiler/simplCore/SimplCore.hs:480:40-63 2.8 4.8 regLiveness AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(591,17)-(593,52) 1.8 1.9 CoreTidy HscMain compiler/main/HscMain.hs:1253:27-67 1.8 2.9 genMachCode AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(580,17)-(582,62) 1.8 1.6 tc_rn_src_decls TcRnDriver compiler/typecheck/TcRnDriver.hs:(493,4)-(555,7) 1.3 1.3 NativeCodeGen CodeOutput compiler/main/CodeOutput.hs:171:18-78 1.0 1.7 CommonSubExpr SimplCore compiler/simplCore/SimplCore.hs:462:40-56 0.8 1.2 deSugar HscMain compiler/main/HscMain.hs:511:7-44 0.8 1.3 }}} 400 fields: {{{ Thu Oct 26 01:25 2017 Time and Allocation Profiling Report (Final) ghc-stage2 +RTS -p -h -RTS -B/home/tobias/well- typed/devel/ghc/inplace/lib -B/home/tobias/well- typed/devel/ghc/inplace/lib -O -fforce-recomp -c examples/t-400-read.hs total time = 128.62 secs (128624 ticks @ 1000 us, 1 processor) total alloc = 37,659,351,320 bytes (excludes profiling overheads) COST CENTRE MODULE SRC %time %alloc sink CmmPipeline compiler/cmm/CmmPipeline.hs:(104,13)-(105,59) 77.8 23.6 SimplTopBinds SimplCore compiler/simplCore/SimplCore.hs:761:39-74 9.6 30.0 FloatOutwards SimplCore compiler/simplCore/SimplCore.hs:471:40-66 3.0 10.4 RegAlloc-linear AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(658,27)-(660,55) 1.6 5.0 OccAnal SimplCore compiler/simplCore/SimplCore.hs:(739,22)-(740,67) 1.1 4.2 pprNativeCode AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(529,37)-(530,65) 1.0 4.7 NewStranal SimplCore compiler/simplCore/SimplCore.hs:480:40-63 0.9 4.2 CoreTidy HscMain compiler/main/HscMain.hs:1253:27-67 0.7 2.9 StgCmm HscMain compiler/main/HscMain.hs:(1426,13)-(1427,62) 0.6 1.9 regLiveness AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(591,17)-(593,52) 0.5 1.4 genMachCode AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(580,17)-(582,62) 0.4 1.3 NativeCodeGen CodeOutput compiler/main/CodeOutput.hs:171:18-78 0.4 1.6 CommonSubExpr SimplCore compiler/simplCore/SimplCore.hs:462:40-56 0.2 1.1 deSugar HscMain compiler/main/HscMain.hs:511:7-44 0.2 1.1 }}}
(Related question: does `sink` show up a a big item when compiling "normal" programs?)
It does, but nowhere near this dominantly. E.g. getline, 400 fields: {{{ Thu Oct 26 02:54 2017 Time and Allocation Profiling Report (Final) ghc-stage2 +RTS -p -h -RTS -B/home/tobias/well- typed/devel/ghc/inplace/lib -B/home/tobias/well- typed/devel/ghc/inplace/lib -O -fforce-recomp -c examples/t-400-getline.hs total time = 14.13 secs (14126 ticks @ 1000 us, 1 processor) total alloc = 12,479,030,504 bytes (excludes profiling overheads) COST CENTRE MODULE SRC %time %alloc sink CmmPipeline compiler/cmm/CmmPipeline.hs:(104,13)-(105,59) 31.7 3.3 SimplTopBinds SimplCore compiler/simplCore/SimplCore.hs:761:39-74 24.7 39.0 OccAnal SimplCore compiler/simplCore/SimplCore.hs:(739,22)-(740,67) 7.2 9.7 NewStranal SimplCore compiler/simplCore/SimplCore.hs:480:40-63 6.5 10.4 CoreTidy HscMain compiler/main/HscMain.hs:1253:27-67 5.2 7.6 FloatOutwards SimplCore compiler/simplCore/SimplCore.hs:471:40-66 4.1 6.2 StgCmm HscMain compiler/main/HscMain.hs:(1426,13)-(1427,62) 2.0 1.6 deSugar HscMain compiler/main/HscMain.hs:511:7-44 2.0 3.3 CommonSubExpr SimplCore compiler/simplCore/SimplCore.hs:462:40-56 1.9 3.2 CorePrep HscMain compiler/main/HscMain.hs:(1313,24)-(1314,57) 1.4 2.5 Specialise SimplCore compiler/simplCore/SimplCore.hs:486:40-50 0.8 1.3 StgCse SimplStg compiler/simplStg/SimplStg.hs:(88,14)-(91,44) 0.5 1.3 }}}
Guessing: the "shape" of the C-- code makes `sink` behave very badly. Can you characterise what that shape is, and where the algorithmic badness comes from?
Not yet, but I have a hunch that it's simply a matter of producing too much of it. Maybe there is quadratic behavior there, where the number of lines of C-- grows proportionally with the square of the number of fields? That seems like a plausible explanation so far. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:58 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj):
Maybe there is quadratic behavior there, where the number of lines of C-- grows proportionally with the square of the number of fields?
But then all the C-- passes would get slower, wouldn't they? Still it'd be easy to measure code size of * Core * C-- as the number of fields increases. My guess is that there's an algorithmic blow-up (quadratic or worse) in `sink`, so that it starts to dominate when there is a very large `CmmProc`. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:59 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by alexbiehl): Tobias, I had a quick glimpse over CmmSink and [https://github.com/ghc/ghc/blob/master/compiler/cmm/CmmSink.hs#L409 `skipped`] looks suspicious: For every assignment `l = e` which is not sunk `l` is [https://github.com/ghc/ghc/blob/master/compiler/cmm/CmmSink.hs#L409 consed] to `skipped`. And for each assignment `l = e` is checked a) is it element of sink and b) is there any register in skipped mentioned in `e`: {{{ cannot_inline = skipped `regsUsedIn` rhs -- Note [dependent assignments] || l `elem` skipped || not (okToInline dflags rhs node) }}} Considering the amount of code generated this might be a thing. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:60 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): Indeed. Actually any use of `elem` on a long list is suspicious. c.f. `ListSetOpts.unionLists` which warns if it sees long lists. Maybe we could have `ListSetOps.checkedElem` which checks for over-long lists and warns. And change all uses of `elem` into `checkedElem`. There may be more of these land-mines! -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:61 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by tdammers): * Attachment "ghc-large-record-types-cmm-size.png" added. Size of C-- for large record examples -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): More data! https://ghc.haskell.org/trac/ghc/attachment/ticket/7258/ghc- large-record-types-cmm-size.png shows C-- code size growth as the number of fields grows for the same generated examples as in the other graphs. getline and getline-appl generate the exact same C--, so there's no point including both in the graph; they are the only ones that grow linearly among the "interesting" examples. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:62 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): Interesting. I can't tell whether that's as non-linear as the runtime, but the bad cases are clearly big! What happens to the size of Core? (Neglect types and coercions which are erased.) And finally does the Core-to-Cmm size ratio go up? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:63 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

Indeed. Actually any use of `elem` on a long list is suspicious. c.f. `ListSetOpts.unionLists` which warns if it sees long lists.
Maybe we could have `ListSetOps.checkedElem` which checks for over-long
#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Replying to [comment:61 simonpj]: lists and warns. And change all uses of `elem` into `checkedElem`. There may be more of these land-mines! What about using a different kind of data structure like a map or similar, that has logarithmic lookup performance? That would work even if we cannot reduce the size of the list, but if the list is short-lived, constructing the map might introduce prohibitive overhead? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:64 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Replying to [comment:63 simonpj]:
Interesting. I can't tell whether that's as non-linear as the runtime, but the bad cases are clearly big!
What happens to the size of Core? (Neglect types and coercions which are erased.)
And finally does the Core-to-Cmm size ratio go up?
I'll check that next, unfortunately this means I have to run all those examples again... -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:65 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj):
What about using a different kind of data structure like a map or similar
Of course that the first thing we should do, once we are convinced that's where the time is going! (We only ave suggestive evidence so far.) Small maps should be cheap; that's the business of the containers library. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:66 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

What about using a different kind of data structure like a map or similar
Of course that the first thing we should do, once we are convinced
#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by alexbiehl): Replying to [comment:66 simonpj]: that's where the time is going! (We only ave suggestive evidence so far.) Small maps should be cheap; that's the business of the containers library. `LocalReg` is `Uniquable` so something like this should work: {{{ type LocalRegSet = UniqSet LocalReg emptyLocalRegSet :: LocalRegSet emptyLocalRegSet = emptyUniqSet nullLocalRegSet :: LocalRegSet -> Bool nullLocalRegSet = isEmptyUniqSet elemLocalRegSet :: LocalReg -> LocalRegSet -> Bool elemLocalRegSet = elementOfUniqSet insertLocalRegSet :: LocalRegSet -> LocalReg -> LocalRegSet insertLocalRegSet = addOneToUniqSet cannot_inline = skipped `regsUsedIn` rhs -- Note [dependent assignments] || l `elemLocalRegSet` skipped || not (okToInline dflags rhs node) regsUsedIn :: LocalRegSet -> CmmExpr -> Bool regsUsedIn ls _ | nullLocalRegSet ls = False regsUsedIn ls e = wrapRecExpf f e False where f (CmmReg (CmmLocal l)) _ | l `elemLocalRegSet` ls = True f (CmmRegOff (CmmLocal l) _) _ | l `elemLocalRegSet` ls = True f _ z = z }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:67 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by tdammers): * Attachment "ghc-large-record-types-cmm-per-core.png" added. C-- / Core size ratio for large record modules -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Interesting: https://ghc.haskell.org/trac/ghc/attachment/ticket/7258/ghc- large-record-types-cmm-per-core.png For the three offenders (read, show, read-appl), the C--/Core ratio goes *up* as the total code size grows, for the well-behaved examples it goes down. The latter is expected (constant overhead + linear code size), the former hints at some sort of nonlinearity. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:68 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by tdammers): * Attachment "ghc-large-record-types-cmm-per-core.png" added. C-- / Core size ratio for large record modules -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): That is indeed interesting. * The bad cases are `read`, `read-appl` and `show` (comment:56) * They all generate very big Cmm (png just before (comment:62) * They all generate unusually large Core-to-Cmm ratio (comment:68) * This very large Cmm makes `sink` go bananas. Is the large Cmm accounted for by the big Core-to-Cmm ratio, or do these examples also have Core sizes that scale badly (look at term size only)? I'm guessing that the Core size does not scale badly. If that was so our questions would be * Why do we generate so much Cmm per unit of Core * Can we make `sink` scale better Both are worth working on. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:69 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by alexbiehl): I tried the suggestion in comment [comment:67] and got these results for compiling `W2.hs` Vanilla: {{{ COST CENTRE MODULE SRC %time %alloc sink CmmPipeline compiler/cmm/CmmPipeline.hs:(104,13)-(105,59) 55.7 15.9 SimplTopBinds SimplCore compiler/simplCore/SimplCore.hs:761:39-74 19.5 30.6 FloatOutwards SimplCore compiler/simplCore/SimplCore.hs:471:40-66 4.2 9.0 RegAlloc-linear AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(658,27)-(660,55) 4.0 11.1 pprNativeCode AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(529,37)-(530,65) 2.8 6.3 NewStranal SimplCore compiler/simplCore/SimplCore.hs:480:40-63 1.6 3.7 OccAnal SimplCore compiler/simplCore/SimplCore.hs:(739,22)-(740,67) 1.5 3.5 StgCmm HscMain compiler/main/HscMain.hs:(1426,13)-(1427,62) 1.2 2.4 regLiveness AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(591,17)-(593,52) 1.2 1.9 genMachCode AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(580,17)-(582,62) 0.9 1.8 NativeCodeGen CodeOutput compiler/main/CodeOutput.hs:171:18-78 0.9 2.1 CoreTidy HscMain compiler/main/HscMain.hs:1253:27-67 0.8 1.9 }}} `skpped` as UniqSet: {{{ COST CENTRE MODULE SRC %time %alloc SimplTopBinds SimplCore compiler/simplCore/SimplCore.hs:761:39-74 38.3 29.0 sink CmmPipeline compiler/cmm/CmmPipeline.hs:(104,13)-(105,59) 13.2 20.3 RegAlloc-linear AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(658,27)-(660,55) 8.3 10.5 FloatOutwards SimplCore compiler/simplCore/SimplCore.hs:471:40-66 8.1 8.5 pprNativeCode AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(529,37)-(530,65) 5.4 5.9 NewStranal SimplCore compiler/simplCore/SimplCore.hs:480:40-63 3.1 3.5 OccAnal SimplCore compiler/simplCore/SimplCore.hs:(739,22)-(740,67) 2.9 3.3 StgCmm HscMain compiler/main/HscMain.hs:(1426,13)-(1427,62) 2.3 2.3 regLiveness AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(591,17)-(593,52) 2.1 1.8 NativeCodeGen CodeOutput compiler/main/CodeOutput.hs:171:18-78 1.7 2.0 genMachCode AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(580,17)-(582,62) 1.6 1.7 CoreTidy HscMain compiler/main/HscMain.hs:1253:27-67 1.4 1.8 foldNodesBwdOO Hoopl.Dataflow compiler/cmm/Hoopl/Dataflow.hs:(397,1)-(403,17) 1.1 0.8 }}} Good results I think. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:70 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by tdammers): * Attachment "ghc-large-record-types-core-size.png" added. Core size for large record modules -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Replying to [comment:69 simonpj]:
Is the large Cmm accounted for by the big Core-to-Cmm ratio, or do these examples also have Core sizes that scale badly (look at term size only)?
I'm guessing that the Core size does not scale badly. If that was so our questions would be
* Why do we generate so much Cmm per unit of Core * Can we make `sink` scale better
Both are worth working on.
https://ghc.haskell.org/trac/ghc/attachment/ticket/7258/ghc-large-record- types-core-size.png suggests Core size has a slight nonlinearity to it, the behavior is roughly the same for all examples, so I believe your guess that Core size doesn't scale as badly as C-- size does is justified. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:71 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by tdammers): * Attachment "examples100.zip" added. 100-field record examples -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Replying to [comment:70 alexbiehl]:
I tried the suggestion in comment [comment:67] and got these results for
compiling `W2.hs`
[...]
Good results I think. Though I need to check if allocation goes up
significantly in common cases. You could try running the examples in examples100.zip (https://ghc.haskell.org/trac/ghc/attachment/ticket/7258/examples100.zip) to see whether any of them performs worse, that would be a good first indication. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:72 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by tdammers): * Attachment "ghc-stage2-7258-deeper-SCC.prof" added. Profiling output with more SCC's added -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Profiling a bit deeper (see https://ghc.haskell.org/trac/ghc/attachment/ticket/7258/ghc-stage2-7258 -deeper-SCC.prof) hints at a possible nonlinearity in `tryToInline`, particularly when recursing through the `keep` function. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:73 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by tdammers): * Attachment "ghc-large-record-types-optimized-2.png" added. Performance graph for various examples, compiling with -O, using UniqSet (as in comment #67) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Modifications as proposed in comments [comment:67] / [comment:70] appear to improve performance for the "bad" cases, and not harm the "good" example cases significantly. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:74 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): During the call Simon hypothesized that we should keep the free variables lists in mind when comparing STG. I stated that I didn't think that we currently had a way to print closures' free variable lists. I've checked and this isn't quite true: the pretty-printer will print free variable lists when `-dppr-debug` is passed. I've broken this out into its own flag in Phab:D4140. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:75 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by tdammers): * Attachment "numTerms.png" added. Generated Core, number of terms -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by tdammers): * Attachment "simplSize.png" added. Generated Core, total size (in LOC) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by tdammers): * Attachment "stgSize.png" added. Generated STG, total size (in LOC) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by tdammers): * Attachment "cmmSize.png" added. Generated C--, total size (in LOC) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by tdammers): * Attachment "totalTime.png" added. Total execution time (CPU, seconds) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Some better data extraction code for profiles generated earlier yields the above graphs. All of these were generated using the optimization as implemented in D4145 (using a UniqSet instead of a list). Observations: - The number of terms in generated Core scales in a perfectly linear fashion for all examples. - The size of the generated Core in lines-of-code, however, scales non- linearly, which tells us that the size of individual terms grows with input size. This behavior seems to be identical between all 4 examples though, so it does not explain the extreme performance differences. - Somewhere between Core and STG, the `show` and `read` examples misbehave more badly than the other two - Between STG and C--, things get utterly dramatic, `show` and `read` explode, while `getline-appl` and `nothing` are reduced back to near- linear behavior. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:76 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): It seems that for the misbehaving examples, the offensive cost centre is Stg2Stg. As code size increases, CPU time for this cost centre grows disproportionally in the `read` and `show` examples, while it remains below 40% for the well-behaved ones. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:77 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj):
the offensive cost centre is Stg2Stg
Very interesting! Keep digging :-) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:78 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow
-------------------------------------+-------------------------------------
Reporter: simonpj | Owner: simonpj
Type: bug | Status: new
Priority: normal | Milestone:
Component: Compiler | Version: 7.6.1
Resolution: | Keywords: deriving-perf
Operating System: Unknown/Multiple | Architecture:
Type of failure: Compile-time | Unknown/Multiple
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: | Differential Rev(s):
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by Ben Gamari

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by alexbiehl): Replying to [comment:77 tdammers]:
It seems that for the misbehaving examples, the offensive cost centre is Stg2Stg. As code size increases, CPU time for this cost centre grows disproportionally in the `read` and `show` examples, while it remains below 40% for the well-behaved ones.
Since you don't compile the examples with profiling on the only "heavy" transformation in Stg2Stg is StgCse. You can try `-fno-stg-cse` to see if it gets better. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:80 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): Very curious. I see in comment:79 {{{ Before total time = 26.00 secs (25996 ticks @ 1000 us, 1 processor) total alloc = 14,921,627,912 bytes (excludes profiling overheads) After total time = 13.31 secs (13307 ticks @ 1000 us, 1 processor) total alloc = 15,772,184,488 bytes (excludes profiling overheads) }}} I don't think I have ever before seen a program that actually allocates more (15G instead of 14G) and yet runs in half the time. Usually allocation and runtime are more or less correlated. But apparently not in this case. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:81 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Stg2Stg turns out to b a dead end. The performance loss I observed there is entirely due to formatting and dumping about 1.3 GiB worth of STG when compiling with -ddump-stg; when compiling without -ddump-stg, `Stg2Stg` doesn't show up in the profile at all (0.0 on everything). Re-running the profiling script without dumps shows different culprits: `tryToInline`, and `simpl_bind`. E.g. 400-field `read`: {{{ Fri Nov 3 08:43 2017 Time and Allocation Profiling Report (Final) ghc-stage2 +RTS -p -h -RTS -B/home/tobias/well- typed/devel/ghc/inplace/lib -B/home/tobias/well- typed/devel/ghc/inplace/lib -O -fforce-recomp -c generated/t-400-read.hs total time = 59.05 secs (59049 ticks @ 1000 us, 1 processor) total alloc = 75,366,446,544 bytes (excludes profiling overheads) COST CENTRE MODULE SRC %time %alloc simpl_bind Simplify compiler/simplCore/Simplify.hs:141:83-101 29.4 22.0 tryToInline CmmSink compiler/cmm/CmmSink.hs:410:3-41 24.4 32.4 RegAlloc-linear AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(658,27)-(660,55) 10.6 14.1 FloatOutwards SimplCore compiler/simplCore/SimplCore.hs:471:40-66 8.8 6.9 pprNativeCode AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(529,37)-(530,65) 4.0 4.7 OccAnal SimplCore compiler/simplCore/SimplCore.hs:(739,22)-(740,67) 2.3 2.2 StgCmm HscMain compiler/main/HscMain.hs:(1426,13)-(1427,62) 2.1 1.8 regLiveness AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(591,17)-(593,52) 1.9 1.3 NativeCodeGen CodeOutput compiler/main/CodeOutput.hs:171:18-78 1.5 1.6 genMachCode AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(580,17)-(582,62) 1.4 1.3 CoreTidy HscMain compiler/main/HscMain.hs:1253:27-67 1.3 1.4 dmdAnal'-Case-single_prod_ctor DmdAnal compiler/stranal/DmdAnal.hs:(251,5)-(276,68) 1.3 1.5 }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:82 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by tdammers): * Attachment "t-read-scc-time-abs.png" added. Largest profiling cost centres (in absolute terms) for the `read` example -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by tdammers): * Attachment "t-getline-appl-scc-time-abs.png" added. Largest profiling cost centres (in absolute terms) for the `getline-appl` example -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by tdammers): * Attachment "t-show-scc-time-abs.png" added. Largest profiling cost centres (in absolute terms) for the `show` example -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Comparing the latest 3 graphs, we can see that `tryToInline` more badly than the other cost centres, at least in the misbehaving examples, while the other top cost centres seem to scale somewhat proportionally. `simpl_binds` is the largest one in most cases, but `tryToInline` is "more nonlinear" for the `show` and `read` examples, surpassing it between 150 and 200 fields in the `show` example, and (educated guess) around 500 fields in the `read` example. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:83 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): `tryToInline` is in in `CmmSink` which we already know is working too hard. There may well be algorithmic improvements that could be done `CmmSink`. But, in addition, I would still like to understand why the STG-to-Cmm ratio climbs so sharply. If we have giant Cmm then any Cmm pass is going to take a long time. Fixing the non-linearity would be good, I agree; but generating less Cmm would also be big. Why is it so big? My guess: we are generating very deeply-nested closures that have a lot of free variables. E.g. {{{ let x = let a5 = blah q = ...K a1 a2 a3 a4 a5 ... in ... }}} Here the closure for `q` has 5 free variables (a1-a5); while that for `x` has four (a1-a4). So the code for `x` will slurp lots of variables out of x's closure only to store them again in q's. And if that is nested we'll get `O(n**2)` code size. But I'm speculating. Facts would be good. Let's measure STG size with and without counting the free vars of each closure (which are stored in the STG tree) and see. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:84 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by tdammers): * Attachment "stgVarsSizeRatio.png" added. Free variables / other STG ratio -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Here's some tentative evidence: [[Image(https://ghc.haskell.org/trac/ghc/attachment/ticket/7258/stgVarsSizeRatio.png)]] The 3 misbehaving examples (read, show, and to a lesser degree read-appl) show linear growth of the free-variables/other-stg ratio (which suggests roughly quadratic growth of the size of the overall STG size, assuming linear growth of the base STG size), while the well-behaved examples maintain a more-or-less constant ratio. The full profiling job (up to 400 fields) is still running, but this small sample (50, 100 and 150 fields) looks pretty convincing to me already. A quick glance at the generated STG suggests that there is indeed some deep nesting of closures going on. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:85 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by tdammers): * Attachment "stgVarsSizeRatio.png" added. Free variables / other STG ratio -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): I've narrowed it down a bit more. This behaves nicely: {{{ module D where import Text.ParserCombinators.ReadP as ReadP import Control.Monad.State as State import Data.Char (ord) data DT = DT { field0 :: Int , field2 :: Int , field3 :: Int , field4 :: Int , field5 :: Int , field6 :: Int , field7 :: Int , field8 :: Int , field9 :: Int , field10 :: Int } getlD :: IO DT getlD = DT <$> (read <$> getLine) <*> (read <$> getLine) <*> (read <$> getLine) <*> (read <$> getLine) <*> (read <$> getLine) <*> (read <$> getLine) <*> (read <$> getLine) <*> (read <$> getLine) <*> (read <$> getLine) <*> (read <$> getLine) }}} But this doesn't: {{{ module D where import Text.ParserCombinators.ReadP as ReadP import Control.Monad.State as State import Data.Char (ord) data DT = DT { field0 :: Int , field2 :: Int , field3 :: Int , field4 :: Int , field5 :: Int , field6 :: Int , field7 :: Int , field8 :: Int , field9 :: Int , field10 :: Int } readD :: ReadP DT readD = DT <$> (ord <$> ReadP.get) <*> (ord <$> ReadP.get) <*> (ord <$> ReadP.get) <*> (ord <$> ReadP.get) <*> (ord <$> ReadP.get) <*> (ord <$> ReadP.get) <*> (ord <$> ReadP.get) <*> (ord <$> ReadP.get) <*> (ord <$> ReadP.get) <*> (ord <$> ReadP.get) }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:86 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by tdammers): * Attachment "t-10-dump.zip" added. Various GHC dumps for 10-field getline-appl and read-appl examples. Read misbehaves, getLine doesn't. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Further evidence, based on the observation that the problem is caused by certain kinds of CPS-style code: {{{ module Nested where -- A data type capturing the relevant structure of the P type from ReadP data A a = A (Int -> A a) | N a -- This implementation produces deeply nested Core: f10 :: A (Int, Int, Int, Int, Int, Int, Int, Int, Int, Int) f10 = A (\i0 -> A (\i1 -> A (\i2 -> A (\i3 -> A (\i4 -> A (\i5 -> A (\i6 -> A (\i7 -> A (\i8 -> A (\i9 -> N (i0, i1, i2, i3, i4, i5, i6, i7, i8, i9) )))))))))) -- This implementation produces flat Core. g10 :: A (Int, Int, Int, Int, Int, Int, Int, Int, Int, Int) g10 = a0 where {-#NOINLINE a10#-} a10 = \(i0, i1, i2, i3, i4, i5, i6, i7, i8) i9 -> N (i0, i1, i2, i3, i4, i5, i6, i7, i8, i9) {-#NOINLINE a9#-} a9 = \(i0, i1, i2, i3, i4, i5, i6, i7) i8 -> A (a10 (i0, i1, i2, i3, i4, i5, i6, i7, i8)) {-#NOINLINE a8#-} a8 = \(i0, i1, i2, i3, i4, i5, i6) i7 -> A (a9 (i0, i1, i2, i3, i4, i5, i6, i7)) {-#NOINLINE a7#-} a7 = \(i0, i1, i2, i3, i4, i5) i6 -> A (a8 (i0, i1, i2, i3, i4, i5, i6)) {-#NOINLINE a6#-} a6 = \(i0, i1, i2, i3, i4) i5 -> A (a7 (i0, i1, i2, i3, i4, i5)) {-#NOINLINE a5#-} a5 = \(i0, i1, i2, i3) i4 -> A (a6 (i0, i1, i2, i3, i4)) {-#NOINLINE a4#-} a4 = \(i0, i1, i2) i3 -> A (a5 (i0, i1, i2, i3)) {-#NOINLINE a3#-} a3 = \(i0, i1) i2 -> A (a4 (i0, i1, i2)) {-#NOINLINE a2#-} a2 = \i0 i1 -> A (a3 (i0, i1)) {-#NOINLINE a1#-} a1 = \i0 -> A (a2 i0) {-#NOINLINE a0#-} a0 = A a1 }}} This module tries to provoke "tupling up the free variables" in `g10`, and compare it to the naive implementation `f10`. I do not know yet whether this would actually produce any performance benefits though. The NOINLINE pragmas are needed by the way, otherwise GHC decides to inline the `a` functions, and we end up with a mess similar to `f10`. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:87 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): Hmm. The `Cmm` for `g10` is actually a lot bigger than that for `f10`, and looked to me as if it too would scale quadratically. Are `f10` and `g10` accurate reflections of the "scale well" and "scale badly" versions of Read? I can see why things might scale badly, if we have a lot of {{{ p1 >>= \x -> p2 >>= \y -> p3 >>= \z -> return (K x y z) }}} And `f10` has that structure, albeit with data constructors rather than functions. BUT: if that's it, why doesn't `(>>=)` inline, so we get a case expression instead?? So if `f10` is reasonably accurate, I can see two ways forward: 1 Do something in the back end to generate less code. 2 Generate different code. I'm still intrigued (and a bit disbelieving) why a different exposition of Read generates code that scales better. Concerning (1) the kind of thing I had in mind was more like this {{{ f10 = A (\i0 -> A (\i1 -> A (\i2 -> A (\i3 -> A (\i4 -> let t = (i0,i1,i2,i3,i4) in A (\i5 -> A (\i6 -> A (\i7 -> A (\i8 -> A (\i9 -> case t of (i0,i1,i2,i3,i4) -> N (i0, i1, i2, i3, i4, i5, i6, i7, i8, i9) }}} Now no function closure gets more than 5 free variables. I imagine one could do this as a STG-to-STG transformation if we decided that this was really the core issue. A tantalising point is this. We have something like this: {{{ let f4 = [i0,i1,i2,i3] \i4 -> let f5 = [i0,i1,i2,i3, i4] \i5 -> ...blah... }}} The list before the `\` are the free vars of the closure. Inside the code for `f4` we have `f4`'s function closure in hand, with `i0...` in it. Rather than capturing `i0, i1...` as the free vars of f5, we could just store a pointer to `f4`'s function closure (it's a kind of tuple, after all) and fetch its free vars from there. It's as if we wanted to say {{{ let f4 = [i0,i1,i2,i3] \i4 -> let f5 = [f4, i4] \i5 -> ...blah... }}} with `f4` and `i4` being the free vars of `f5`'s closure. This would be a deeper change to the code generator, but it could make a lot of sense in cases where ''all'' of `f4`'s free variables are also free in `f5`. (If not all where, you might get a space leak by closing over `f4`.) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:88 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Replying to [comment:88 simonpj]:
Hmm. The `Cmm` for `g10` is actually a lot bigger than that for `f10`, and looked to me as if it too would scale quadratically.
It would, but for different reasons, and I believe I can rewrite it to not scale quadratically (but haven't gotten around to it yet).
Are `f10` and `g10` accurate reflections of the "scale well" and "scale badly" versions of Read?
No; we don't actually have a "scale well" version of `Read` yet. All the implementations of `Read` that I have produced so far are in the "scale badly" category, although the degree to which they scale badly differs somewhat. We have two kinds of "scale well" examples: 1. Trivial ones, where rather than nest code such that chains of closures appear, we just call `read` or something else N times, but in benign ways; these examples do not have much of an explanatory value, and I merely included them to establish a performance baseline 2. Well-behaved nested / chained examples, such as the `getline-appl` one; however, none of these do anything equivalent to `read`.
I can see why things might scale badly, if we have a lot of {{{ p1 >>= \x -> p2 >>= \y -> p3 >>= \z -> return (K x y z) }}} And `f10` has that structure, albeit with data constructors rather than
functions. BUT: if that's it, why doesn't `(>>=)` inline, so we get a case expression instead?? I believe the `(>>=)` does in fact inline; `f10` is the kind of shape you get when you inline the `(>>=)` from `ReadP`. `(>>=)` from other monadic types does not introduce this kind of shape, which would explain why `Read` misbehaves but `getLine` over `IO` does not.
So if `f10` is reasonably accurate, I can see two ways forward:
1 Do something in the back end to generate less code.
2 Generate different code. I'm still intrigued (and a bit disbelieving)
why a different exposition of Read generates code that scales better. See above; `Read` always misbehaves here, so your disbelief is entirely justified. The bad behavior seems to be related to the actual implementation of `(>>=)`, not so much whether the `(>>=)` gets inlined. Or, more accurately, inlining `(>>=)` actually *is* the problem; if we didn't inline it, then `read-appl` would compile to something equivalent to the generic applicative example, which behaves nicely.
Concerning (1) the kind of thing I had in mind was more like this {{{ f10 = A (\i0 -> A (\i1 -> A (\i2 -> A (\i3 -> A (\i4 -> let t = (i0,i1,i2,i3,i4) in A (\i5 -> A (\i6 -> A (\i7 -> A (\i8 -> A (\i9 -> case t of (i0,i1,i2,i3,i4) -> N (i0, i1, i2, i3, i4, i5, i6, i7, i8, i9) }}} Now no function closure gets more than 5 free variables. I imagine one could do this as a STG-to-STG transformation if we decided that this was really the core issue.
Further experimentation shows that this doesn't actually achieve anything; the resulting Core is *exactly the same*, and if you put both functions into the same compilation unit, they get optimized into expressing one in terms of the other (I'm guessing here that CSE is smart enough to figure out that they are equivalent). I have written 3 versions of `f10`, one as- is, one (`g10`) using the suggested 5-tuple optimization, and one (`h10`) using a nested-tuple optimization, and I end up with Core that has literally these lines: {{{ f10 = A f1 g10 = f10 h10 = f10 }}}
A tantalising point is this. We have something like this: {{{ let f4 = [i0,i1,i2,i3] \i4 -> let f5 = [i0,i1,i2,i3, i4] \i5 -> ...blah... }}} The list before the `\` are the free vars of the closure.
Inside the code for `f4` we have `f4`'s function closure in hand, with `i0...` in it. Rather than capturing `i0, i1...` as the free vars of f5, we could just store a pointer to `f4`'s function closure (it's a kind of tuple, after all) and fetch its free vars from there. It's as if we wanted to say {{{ let f4 = [i0,i1,i2,i3] \i4 -> let f5 = [f4, i4] \i5 -> ...blah... }}} with `f4` and `i4` being the free vars of `f5`'s closure. This would be a deeper change to the code generator, but it could make a lot of sense in cases where ''all'' of `f4`'s free variables are also free in `f5`. (If not all where, you might get a space leak by closing over `f4`.)
-- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:89 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Possibly related: https://github.com/alexbiehl/ghc/commit/f84137a36602a3e5997403283976efdca82f... -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:90 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): Indeed, alexbiehl is on the way to implementing the "tantalising point" of comment:89. In a phone call today we agreed * That sharing the parent closure is a good thing to try; and not too hard. * That we should start * A ticket to track progress on it * A wiki page to explain how it all works -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:91 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Progress made so far: - Reduced algorithmic complexity in register spilling ([comment:33]) - Minor improvements in the pretty-based code writer ([comment:34]) - Refactored generated `Read` code (comment [comment:37] / ticket [ticket:14364]) - Observed that `Read` instances generate long chains of monadic binds over a monad that introduces nested closures when binding (`ReadP`, specifically); and that deeply nested closures in general cause the register allocator to repeatedly unpack and re-pack free variables, which leads to quadratic behavior. The fix for this has been split off into [ticket:14461]. Things left to do: - Replace `pretty` with something that doesn't share its nonlinearities ([ticket:14005]) - Once [ticket:14461] is done, it may be worth digging further; right now, the nested-closures nonlinearity makes it difficult to isolate other bottlenecks though. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:92 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): Great summary, thanks. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:93 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC