
#7258: Compiling DynFlags is jolly slow -------------------------------------+------------------------------------- Reporter: simonpj | Owner: simonpj Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.1 Resolution: | Keywords: deriving-perf Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by tdammers): Further evidence, based on the observation that the problem is caused by certain kinds of CPS-style code: {{{ module Nested where -- A data type capturing the relevant structure of the P type from ReadP data A a = A (Int -> A a) | N a -- This implementation produces deeply nested Core: f10 :: A (Int, Int, Int, Int, Int, Int, Int, Int, Int, Int) f10 = A (\i0 -> A (\i1 -> A (\i2 -> A (\i3 -> A (\i4 -> A (\i5 -> A (\i6 -> A (\i7 -> A (\i8 -> A (\i9 -> N (i0, i1, i2, i3, i4, i5, i6, i7, i8, i9) )))))))))) -- This implementation produces flat Core. g10 :: A (Int, Int, Int, Int, Int, Int, Int, Int, Int, Int) g10 = a0 where {-#NOINLINE a10#-} a10 = \(i0, i1, i2, i3, i4, i5, i6, i7, i8) i9 -> N (i0, i1, i2, i3, i4, i5, i6, i7, i8, i9) {-#NOINLINE a9#-} a9 = \(i0, i1, i2, i3, i4, i5, i6, i7) i8 -> A (a10 (i0, i1, i2, i3, i4, i5, i6, i7, i8)) {-#NOINLINE a8#-} a8 = \(i0, i1, i2, i3, i4, i5, i6) i7 -> A (a9 (i0, i1, i2, i3, i4, i5, i6, i7)) {-#NOINLINE a7#-} a7 = \(i0, i1, i2, i3, i4, i5) i6 -> A (a8 (i0, i1, i2, i3, i4, i5, i6)) {-#NOINLINE a6#-} a6 = \(i0, i1, i2, i3, i4) i5 -> A (a7 (i0, i1, i2, i3, i4, i5)) {-#NOINLINE a5#-} a5 = \(i0, i1, i2, i3) i4 -> A (a6 (i0, i1, i2, i3, i4)) {-#NOINLINE a4#-} a4 = \(i0, i1, i2) i3 -> A (a5 (i0, i1, i2, i3)) {-#NOINLINE a3#-} a3 = \(i0, i1) i2 -> A (a4 (i0, i1, i2)) {-#NOINLINE a2#-} a2 = \i0 i1 -> A (a3 (i0, i1)) {-#NOINLINE a1#-} a1 = \i0 -> A (a2 i0) {-#NOINLINE a0#-} a0 = A a1 }}} This module tries to provoke "tupling up the free variables" in `g10`, and compare it to the naive implementation `f10`. I do not know yet whether this would actually produce any performance benefits though. The NOINLINE pragmas are needed by the way, otherwise GHC decides to inline the `a` functions, and we end up with a mess similar to `f10`. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/7258#comment:87 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler