[GHC] #8199: Get rid of HEAP_ALLOCED

#8199: Get rid of HEAP_ALLOCED ----------------------------------------------+---------------------------- Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.10.1 Component: Compiler | Version: 7.7 Keywords: | Operating System: Architecture: Unknown/Multiple | Unknown/Multiple Difficulty: Project (more than a week) | Type of failure: Blocked By: | None/Unknown Related Tickets: | Test Case: | Blocking: ----------------------------------------------+---------------------------- This bug is to track progress of removing HEAP_ALLOCED from GHC, promising faster GC, as long as we can keep the cost of indirections down. The relevant wiki article: http://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/HeapAlloced ; we are implementing method 2. The attached patchset is *almost* able to support GHC itself; a ghc-stage2 built this way can compile most files, but is much slower in some cases, and probably has some correctness bugs. Improvements possible: * Don't pay for a double indirection when -fPIC is turned on. Probably the easiest way to do this is to *not* bake in the indirections into compiled code when it is -fPIC'd, and instead scribble over the GOT. However, I don't know how to go backwards from a symbol to a GOT entry, so we might need some heinous assembly hacks to get this working. * Closure tables should have their indirections short-circuited by the initialization code. * We are paying an indirection when a GC occurs when the closure is not in R1. According to the wiki page, technically this is not needed, but I don't know how to eliminate references to the closure itself from stg_gc_fun. * Properly handle the Windows DLL case (e.g. SRTs); this should be simpler now that everything is an indirection. * Bikeshed hs_main API changes (because closures are not valid prior to RTS initialization, so you have to pass in an indirection instead) * Improve static indirection and stable pointer registration, avoiding binary bloat from `__attribute(constructor)__` stubs. * Need to implement a new megablock tracking structure so we can free/check for lost blocks -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED ----------------------------+---------------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature | Status: new request | Milestone: 7.10.1 Priority: normal | Version: 7.7 Component: | Keywords: Compiler | Architecture: Unknown/Multiple Resolution: | Difficulty: Project (more than a week) Operating System: | Blocked By: Unknown/Multiple | Related Tickets: Type of failure: | None/Unknown | Test Case: | Blocking: | ----------------------------+---------------------------------------------- Comment (by ezyang): Preliminary performance results (not all of nofib was compiling when I did these): {{{ -------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- ansi +5.1% +0.0% 0.00 0.00 +0.0% atom +5.1% +0.0% -3.5% -3.1% +0.0% bernouilli +5.1% +0.0% +1.7% +1.7% +0.0% exp3_8 +5.1% +0.0% +0.0% +0.0% +0.0% gen_regexps +5.0% +0.0% 0.00 0.00 +0.0% integrate +5.1% +0.0% 0.19 0.20 +0.0% kahan +4.9% +0.0% +0.0% -0.7% +0.0% paraffins +5.0% +0.0% 0.11 0.11 +0.0% primes +5.1% +0.0% 0.07 0.07 +0.0% queens +5.1% +0.0% 0.02 0.02 +0.0% rfib +5.1% +0.0% 0.02 0.02 +0.0% tak +5.1% +0.0% 0.02 0.02 +0.0% wheel-sieve1 +5.1% +0.0% +0.5% +0.0% +0.0% wheel-sieve2 +5.0% +0.0% -4.8% -4.8% +0.0% x2n1 +5.1% +0.0% 0.01 0.01 +0.0% -------------------------------------------------------------------------------- Min +4.9% +0.0% -4.8% -4.8% +0.0% Max +5.1% +0.0% +1.7% +1.7% +0.0% Geometric Mean +5.0% +0.0% -1.0% -1.2% +0.0% }}} Binary size bloat comes from all of the extra static indirection tables nonsense; I'm sure we can squeeze it down a bit. Some pretty good improvements in some cases. What I didn't see was any massive slowdowns like I am seeing in ghc-stage2. So I ran nofib again -dynamic: {{{ -------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- ansi +5.9% +0.0% 0.00 0.00 +0.0% atom +4.5% +0.0% -3.4% -3.8% +0.0% bernouilli +5.3% +0.0% +0.0% +0.0% +0.0% exp3_8 +5.6% +0.0% +0.0% +0.0% +0.0% gen_regexps +3.9% +0.0% 0.00 0.00 +0.0% integrate +4.6% +0.0% -2.7% -2.7% +0.0% kahan +5.1% +0.0% +8.4% +8.4% +0.0% paraffins +2.9% +0.0% 0.11 0.11 +0.0% primes +5.5% +0.0% 0.07 0.07 +0.0% queens +5.0% +0.0% 0.02 0.02 +0.0% rfib +6.0% +0.0% 0.02 0.02 +0.0% tak +5.5% +0.0% 0.01 0.01 +0.0% wheel-sieve1 +4.4% +0.0% -2.4% -2.4% +0.0% wheel-sieve2 +3.7% +0.0% -7.1% -6.3% +2.1% x2n1 +4.1% +0.0% 0.01 0.01 +0.0% -------------------------------------------------------------------------------- Min +2.9% +0.0% -7.1% -6.3% +0.0% Max +6.0% +0.0% +8.4% +8.4% +2.1% Geometric Mean +4.8% +0.0% -1.1% -1.1% +0.1% }}} kahan slows down, but not by much; certainly not the order of magnitude I was seeing in ghc-stage2. I still don't really know what's up, so perhaps the massive ghc-stage2 slowdown is a correctness bug somewhere, or perhaps pessimal behavior when a lot of modules are loaded. Ugh! The worst kind of bug. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED ----------------------------+---------------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature | Status: new request | Milestone: 7.10.1 Priority: normal | Version: 7.7 Component: | Keywords: Compiler | Architecture: Unknown/Multiple Resolution: | Difficulty: Project (more than a week) Operating System: | Blocked By: Unknown/Multiple | Related Tickets: Type of failure: | None/Unknown | Test Case: | Blocking: | ----------------------------+---------------------------------------------- Comment (by ezyang): ghc-stage2 "slow down" was actually a spinlock deadlock due to me excising too much code from the old MBlock implementation. As a bonus, the last bullet point is fixed. Version 2: now with less deadlock! -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED ----------------------------+---------------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature | Status: new request | Milestone: 7.10.1 Priority: normal | Version: 7.7 Component: | Keywords: Compiler | Architecture: Unknown/Multiple Resolution: | Difficulty: Project (more than a week) Operating System: | Blocked By: Unknown/Multiple | Related Tickets: Type of failure: | None/Unknown | Test Case: | Blocking: | ----------------------------+---------------------------------------------- Description changed by ezyang: Old description:
This bug is to track progress of removing HEAP_ALLOCED from GHC, promising faster GC, as long as we can keep the cost of indirections down.
The relevant wiki article: http://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/HeapAlloced ; we are implementing method 2. The attached patchset is *almost* able to support GHC itself; a ghc-stage2 built this way can compile most files, but is much slower in some cases, and probably has some correctness bugs.
Improvements possible:
* Don't pay for a double indirection when -fPIC is turned on. Probably the easiest way to do this is to *not* bake in the indirections into compiled code when it is -fPIC'd, and instead scribble over the GOT. However, I don't know how to go backwards from a symbol to a GOT entry, so we might need some heinous assembly hacks to get this working.
* Closure tables should have their indirections short-circuited by the initialization code.
* We are paying an indirection when a GC occurs when the closure is not in R1. According to the wiki page, technically this is not needed, but I don't know how to eliminate references to the closure itself from stg_gc_fun.
* Properly handle the Windows DLL case (e.g. SRTs); this should be simpler now that everything is an indirection.
* Bikeshed hs_main API changes (because closures are not valid prior to RTS initialization, so you have to pass in an indirection instead)
* Improve static indirection and stable pointer registration, avoiding binary bloat from `__attribute(constructor)__` stubs.
* Need to implement a new megablock tracking structure so we can free/check for lost blocks
New description: This bug is to track progress of removing HEAP_ALLOCED from GHC, promising faster GC, as long as we can keep the cost of indirections down. The relevant wiki article: http://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/HeapAlloced ; we are implementing method 2. The attached patchset is *almost* able to support GHC itself; a ghc-stage2 built this way can compile most files, but is much slower in some cases, and probably has some correctness bugs. Improvements possible: * Don't pay for a double indirection when -fPIC is turned on. Probably the easiest way to do this is to *not* bake in the indirections into compiled code when it is -fPIC'd, and instead scribble over the GOT. However, I don't know how to go backwards from a symbol to a GOT entry, so we might need some heinous assembly hacks to get this working. * Closure tables should have their indirections short-circuited by the initialization code. * We are paying an indirection when a GC occurs when the closure is not in R1. According to the wiki page, technically this is not needed, but I don't know how to eliminate references to the closure itself from stg_gc_fun. * Properly handle the Windows DLL case (e.g. SRTs); this should be simpler now that everything is an indirection. * Bikeshed hs_main API changes (because closures are not valid prior to RTS initialization, so you have to pass in an indirection instead) * Improve static indirection and stable pointer registration, avoiding binary bloat from `__attribute(constructor)__` stubs. * ~~Need to implement a new megablock tracking structure so we can free/check for lost blocks~~. Now that efficient lookup is not necessary, perhaps we can write-optimize the megablock tracking structures. -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED ----------------------------+---------------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature | Status: new request | Milestone: 7.10.1 Priority: normal | Version: 7.7 Component: | Keywords: Compiler | Architecture: Unknown/Multiple Resolution: | Difficulty: Project (more than a week) Operating System: | Blocked By: Unknown/Multiple | Related Tickets: Type of failure: | None/Unknown | Test Case: | Blocking: | ----------------------------+---------------------------------------------- Description changed by ezyang: Old description:
This bug is to track progress of removing HEAP_ALLOCED from GHC, promising faster GC, as long as we can keep the cost of indirections down.
The relevant wiki article: http://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/HeapAlloced ; we are implementing method 2. The attached patchset is *almost* able to support GHC itself; a ghc-stage2 built this way can compile most files, but is much slower in some cases, and probably has some correctness bugs.
Improvements possible:
* Don't pay for a double indirection when -fPIC is turned on. Probably the easiest way to do this is to *not* bake in the indirections into compiled code when it is -fPIC'd, and instead scribble over the GOT. However, I don't know how to go backwards from a symbol to a GOT entry, so we might need some heinous assembly hacks to get this working.
* Closure tables should have their indirections short-circuited by the initialization code.
* We are paying an indirection when a GC occurs when the closure is not in R1. According to the wiki page, technically this is not needed, but I don't know how to eliminate references to the closure itself from stg_gc_fun.
* Properly handle the Windows DLL case (e.g. SRTs); this should be simpler now that everything is an indirection.
* Bikeshed hs_main API changes (because closures are not valid prior to RTS initialization, so you have to pass in an indirection instead)
* Improve static indirection and stable pointer registration, avoiding binary bloat from `__attribute(constructor)__` stubs.
* ~~Need to implement a new megablock tracking structure so we can free/check for lost blocks~~. Now that efficient lookup is not necessary, perhaps we can write-optimize the megablock tracking structures.
New description: This bug is to track progress of removing HEAP_ALLOCED from GHC, promising faster GC, as long as we can keep the cost of indirections down. The relevant wiki article: http://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/HeapAlloced ; we are implementing method 2. Version 2 of the patchset is probably correct. Improvements possible: * Don't pay for a double indirection when -fPIC is turned on. Probably the easiest way to do this is to *not* bake in the indirections into compiled code when it is -fPIC'd, and instead scribble over the GOT. However, I don't know how to go backwards from a symbol to a GOT entry, so we might need some heinous assembly hacks to get this working. * Closure tables should have their indirections short-circuited by the initialization code. * We are paying an indirection when a GC occurs when the closure is not in R1. According to the wiki page, technically this is not needed, but I don't know how to eliminate references to the closure itself from stg_gc_fun. * Properly handle the Windows DLL case (e.g. SRTs); this should be simpler now that everything is an indirection. * Bikeshed hs_main API changes (because closures are not valid prior to RTS initialization, so you have to pass in an indirection instead) * Improve static indirection and stable pointer registration, avoiding binary bloat from `__attribute(constructor)__` stubs. * ~~Need to implement a new megablock tracking structure so we can free/check for lost blocks~~. Now that efficient lookup is not necessary, perhaps we can write-optimize the megablock tracking structures. -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED ----------------------------+---------------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature | Status: new request | Milestone: 7.10.1 Priority: normal | Version: 7.7 Component: | Keywords: Compiler | Architecture: Unknown/Multiple Resolution: | Difficulty: Project (more than a week) Operating System: | Blocked By: Unknown/Multiple | Related Tickets: Type of failure: | None/Unknown | Test Case: | Blocking: | ----------------------------+---------------------------------------------- Description changed by ezyang: Old description:
This bug is to track progress of removing HEAP_ALLOCED from GHC, promising faster GC, as long as we can keep the cost of indirections down.
The relevant wiki article: http://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/HeapAlloced ; we are implementing method 2. Version 2 of the patchset is probably correct.
Improvements possible:
* Don't pay for a double indirection when -fPIC is turned on. Probably the easiest way to do this is to *not* bake in the indirections into compiled code when it is -fPIC'd, and instead scribble over the GOT. However, I don't know how to go backwards from a symbol to a GOT entry, so we might need some heinous assembly hacks to get this working.
* Closure tables should have their indirections short-circuited by the initialization code.
* We are paying an indirection when a GC occurs when the closure is not in R1. According to the wiki page, technically this is not needed, but I don't know how to eliminate references to the closure itself from stg_gc_fun.
* Properly handle the Windows DLL case (e.g. SRTs); this should be simpler now that everything is an indirection.
* Bikeshed hs_main API changes (because closures are not valid prior to RTS initialization, so you have to pass in an indirection instead)
* Improve static indirection and stable pointer registration, avoiding binary bloat from `__attribute(constructor)__` stubs.
* ~~Need to implement a new megablock tracking structure so we can free/check for lost blocks~~. Now that efficient lookup is not necessary, perhaps we can write-optimize the megablock tracking structures.
New description: This bug is to track progress of removing HEAP_ALLOCED from GHC, promising faster GC, as long as we can keep the cost of indirections down. The relevant wiki article: http://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/HeapAlloced ; we are implementing method 2. Version 2 of the patchset is probably correct. Improvements possible: * Don't pay for a double indirection when -fPIC is turned on. Probably the easiest way to do this is to *not* bake in the indirections into compiled code when it is -fPIC'd, and instead scribble over the GOT. However, I don't know how to go backwards from a symbol to a GOT entry, so we might need some heinous assembly hacks to get this working. * Closure tables should have their indirections short-circuited by the initialization code. * We are paying an indirection when a GC occurs when the closure is not in R1. According to the wiki page, technically this is not needed, but I don't know how to eliminate references to the closure itself from stg_gc_fun. * Properly handle the Windows DLL case (e.g. SRTs); this should be simpler now that everything is an indirection. * Bikeshed hs_main API changes (because closures are not valid prior to RTS initialization, so you have to pass in an indirection instead) * ~~Improve static indirection and stable pointer registration, avoiding binary bloat from `__attribute(constructor)__` stubs.~~ After discussing this with some folks, it seems that there isn't really a portable way to do this when we are using the system dynamic linker to load libraries at startup. The problem is that we need to somehow get a list of all the GHC-compiled libraries which got loaded, and really the easiest way to get that is to just build it ourselves. * ~~Need to implement a new megablock tracking structure so we can free/check for lost blocks~~. Now that efficient lookup is not necessary, perhaps we can write-optimize the megablock tracking structures. -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED ----------------------------+---------------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature | Status: new request | Milestone: 7.10.1 Priority: normal | Version: 7.7 Component: | Keywords: Compiler | Architecture: Unknown/Multiple Resolution: | Difficulty: Project (more than a week) Operating System: | Blocked By: 2841 Unknown/Multiple | Related Tickets: Type of failure: | None/Unknown | Test Case: | Blocking: | ----------------------------+---------------------------------------------- Changes (by ezyang): * blockedby: => 2841 Old description:
This bug is to track progress of removing HEAP_ALLOCED from GHC, promising faster GC, as long as we can keep the cost of indirections down.
The relevant wiki article: http://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/HeapAlloced ; we are implementing method 2. Version 2 of the patchset is probably correct.
Improvements possible:
* Don't pay for a double indirection when -fPIC is turned on. Probably the easiest way to do this is to *not* bake in the indirections into compiled code when it is -fPIC'd, and instead scribble over the GOT. However, I don't know how to go backwards from a symbol to a GOT entry, so we might need some heinous assembly hacks to get this working.
* Closure tables should have their indirections short-circuited by the initialization code.
* We are paying an indirection when a GC occurs when the closure is not in R1. According to the wiki page, technically this is not needed, but I don't know how to eliminate references to the closure itself from stg_gc_fun.
* Properly handle the Windows DLL case (e.g. SRTs); this should be simpler now that everything is an indirection.
* Bikeshed hs_main API changes (because closures are not valid prior to RTS initialization, so you have to pass in an indirection instead)
* ~~Improve static indirection and stable pointer registration, avoiding binary bloat from `__attribute(constructor)__` stubs.~~ After discussing this with some folks, it seems that there isn't really a portable way to do this when we are using the system dynamic linker to load libraries at startup. The problem is that we need to somehow get a list of all the GHC-compiled libraries which got loaded, and really the easiest way to get that is to just build it ourselves.
* ~~Need to implement a new megablock tracking structure so we can free/check for lost blocks~~. Now that efficient lookup is not necessary, perhaps we can write-optimize the megablock tracking structures.
New description: This bug is to track progress of removing HEAP_ALLOCED from GHC, promising faster GC, as long as we can keep the cost of indirections down. The relevant wiki article: http://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/HeapAlloced ; we are implementing method 2. Version 2 of the patchset is probably correct. Improvements possible: * Don't pay for a double indirection when -fPIC is turned on. Probably the easiest way to do this is to *not* bake in the indirections into compiled code when it is -fPIC'd, and instead scribble over the GOT. However, I don't know how to go backwards from a symbol to a GOT entry, so we might need some heinous assembly hacks to get this working. * Closure tables should have their indirections short-circuited by the initialization code. But maybe it is not worth the cost of telling the RTS about the closure tables (also, they would need to be made writeable). * We are paying an indirection when a GC occurs when the closure is not in R1. According to the wiki page, technically this is not needed, but I don't know how to eliminate references to the closure itself from stg_gc_fun. * Properly handle the Windows DLL case (e.g. SRTs). We will probably have to reorganize how the indirections are laid out. * Bikeshed hs_main API changes (because closures are not valid prior to RTS initialization, so you have to pass in an indirection instead) * ~~Make it work for GHCi linking of static objects.~~ Blocked on #2841, I have it working for ELF, and can make it work for other platforms as soon as I get relevant machines. * ~~Improve static indirection and stable pointer registration, avoiding binary bloat from `__attribute(constructor)__` stubs.~~ After discussing this with some folks, it seems that there isn't really a portable way to do this when we are using the system dynamic linker to load libraries at startup. The problem is that we need to somehow get a list of all the GHC-compiled libraries which got loaded, and really the easiest way to get that is to just build it ourselves. * ~~Need to implement a new megablock tracking structure so we can free/check for lost blocks~~. Now that efficient lookup is not necessary, perhaps we can write-optimize the megablock tracking structures. -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED ----------------------------+---------------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature | Status: new request | Milestone: 7.10.1 Priority: normal | Version: 7.7 Component: | Keywords: Compiler | Architecture: Unknown/Multiple Resolution: | Difficulty: Project (more than a week) Operating System: | Blocked By: 2841 Unknown/Multiple | Related Tickets: Type of failure: | None/Unknown | Test Case: | Blocking: | ----------------------------+---------------------------------------------- Comment (by ezyang): Full perf numbers on latest head, including the outliers (in both directions). {{{ binary-trees +5.2% +0.0% +6.4% +6.8% +0.0% k-nucleotide +5.2% +0.0% +3.7% +3.7% +0.0% integer +5.1% +0.0% -8.2% -8.1% +0.0% ----------------------------------------------------------------------- Min +4.9% +0.0% -8.2% -8.1% -12.5% Max +6.0% +0.0% +6.4% +6.8% +100.0% Geometric Mean +5.1% +0.0% -1.2% -1.1% +6.2% }}} Overall, most programs saw a slight increase in runtime (in the 0-0.2 range), while a number of programs saw a 1-4% reduction in runtime. I wonder if the slight increase in runtime is primarily due to the cost of initializing the indirections. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED ----------------------------+---------------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature | Status: new request | Milestone: 7.10.1 Priority: normal | Version: 7.7 Component: | Keywords: Compiler | Architecture: Unknown/Multiple Resolution: | Difficulty: Project (more than a week) Operating System: | Blocked By: 2841 Unknown/Multiple | Related Tickets: Type of failure: | None/Unknown | Test Case: | Blocking: | ----------------------------+---------------------------------------------- Description changed by ezyang: Old description:
This bug is to track progress of removing HEAP_ALLOCED from GHC, promising faster GC, as long as we can keep the cost of indirections down.
The relevant wiki article: http://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/HeapAlloced ; we are implementing method 2. Version 2 of the patchset is probably correct.
Improvements possible:
* Don't pay for a double indirection when -fPIC is turned on. Probably the easiest way to do this is to *not* bake in the indirections into compiled code when it is -fPIC'd, and instead scribble over the GOT. However, I don't know how to go backwards from a symbol to a GOT entry, so we might need some heinous assembly hacks to get this working.
* Closure tables should have their indirections short-circuited by the initialization code. But maybe it is not worth the cost of telling the RTS about the closure tables (also, they would need to be made writeable).
* We are paying an indirection when a GC occurs when the closure is not in R1. According to the wiki page, technically this is not needed, but I don't know how to eliminate references to the closure itself from stg_gc_fun.
* Properly handle the Windows DLL case (e.g. SRTs). We will probably have to reorganize how the indirections are laid out.
* Bikeshed hs_main API changes (because closures are not valid prior to RTS initialization, so you have to pass in an indirection instead)
* ~~Make it work for GHCi linking of static objects.~~ Blocked on #2841, I have it working for ELF, and can make it work for other platforms as soon as I get relevant machines.
* ~~Improve static indirection and stable pointer registration, avoiding binary bloat from `__attribute(constructor)__` stubs.~~ After discussing this with some folks, it seems that there isn't really a portable way to do this when we are using the system dynamic linker to load libraries at startup. The problem is that we need to somehow get a list of all the GHC-compiled libraries which got loaded, and really the easiest way to get that is to just build it ourselves.
* ~~Need to implement a new megablock tracking structure so we can free/check for lost blocks~~. Now that efficient lookup is not necessary, perhaps we can write-optimize the megablock tracking structures.
New description: This bug is to track progress of removing HEAP_ALLOCED from GHC, promising faster GC, as long as we can keep the cost of indirections down. The relevant wiki article: http://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/HeapAlloced ; we are implementing method 2. Version 2 of the patchset is probably correct. Improvements possible: * Save tags inside the indirection tables, so that we don't spend instructions retagging after the following the indirection. I have a patch in progress for this. * Make sure the extra indirection is not causing any C-- optimizations to stop firing (it might be, because I put it in as a literal CmmLoad) * Once an static thunk is updated, we should tag the indirection to let other code segments to know about the good news. One way to do this is have the update frame for a static indirection should have a reference to the *indirection*, not the closure itself. However, this scheme will not affect other static closures which have references to the thunk. * Don't pay for a double indirection when -fPIC is turned on. Probably the easiest way to do this is to *not* bake in the indirections into compiled code when it is -fPIC'd, and instead scribble over the GOT. However, I don't know how to go backwards from a symbol to a GOT entry, so we might need some heinous assembly hacks to get this working. * Closure tables should have their indirections short-circuited by the initialization code. But maybe it is not worth the cost of telling the RTS about the closure tables (also, they would need to be made writeable). * We are paying an indirection when a GC occurs when the closure is not in R1. According to the wiki page, technically this is not needed, but I don't know how to eliminate references to the closure itself from stg_gc_fun. * Properly handle the Windows DLL case (e.g. SRTs). We will probably have to reorganize how the indirections are laid out. * Bikeshed hs_main API changes (because closures are not valid prior to RTS initialization, so you have to pass in an indirection instead) * This patch introduces a lot of new symbols; ensure we are not unduly polluting the symbol tables. (In particular, I think _static_closure symbols can be made hidden). * ~~Make it work for GHCi linking of static objects.~~ Blocked on #2841, I have it working for ELF, and can make it work for other platforms as soon as I get relevant machines. * ~~Improve static indirection and stable pointer registration, avoiding binary bloat from `__attribute(constructor)__` stubs.~~ After discussing this with some folks, it seems that there isn't really a portable way to do this when we are using the system dynamic linker to load libraries at startup. The problem is that we need to somehow get a list of all the GHC-compiled libraries which got loaded, and really the easiest way to get that is to just build it ourselves. * ~~Need to implement a new megablock tracking structure so we can free/check for lost blocks~~. Now that efficient lookup is not necessary, perhaps we can write-optimize the megablock tracking structures. Speculative improvements: * Now that static lives in a block, can we GC them like we GC normal data? This would be beneficial for static thunks, which now can have their indirections completely removed; reverting CAFs may be somewhat tricky, however. * Now that HEAP_ALLOCED is greatly simplified, can we further simply some aspects of the GC? At the very least, we ought to be able to make megablock allocation cheaper, by figuring out how to remove the spinlocks, etc. -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED ----------------------------+---------------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature | Status: new request | Milestone: 7.10.1 Priority: normal | Version: 7.7 Component: | Keywords: Compiler | Architecture: Unknown/Multiple Resolution: | Difficulty: Project (more than a week) Operating System: | Blocked By: 2841 Unknown/Multiple | Related Tickets: Type of failure: | None/Unknown | Test Case: | Blocking: | ----------------------------+---------------------------------------------- Description changed by ezyang: Old description:
This bug is to track progress of removing HEAP_ALLOCED from GHC, promising faster GC, as long as we can keep the cost of indirections down.
The relevant wiki article: http://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/HeapAlloced ; we are implementing method 2. Version 2 of the patchset is probably correct.
Improvements possible:
* Save tags inside the indirection tables, so that we don't spend instructions retagging after the following the indirection. I have a patch in progress for this.
* Make sure the extra indirection is not causing any C-- optimizations to stop firing (it might be, because I put it in as a literal CmmLoad)
* Once an static thunk is updated, we should tag the indirection to let other code segments to know about the good news. One way to do this is have the update frame for a static indirection should have a reference to the *indirection*, not the closure itself. However, this scheme will not affect other static closures which have references to the thunk.
* Don't pay for a double indirection when -fPIC is turned on. Probably the easiest way to do this is to *not* bake in the indirections into compiled code when it is -fPIC'd, and instead scribble over the GOT. However, I don't know how to go backwards from a symbol to a GOT entry, so we might need some heinous assembly hacks to get this working.
* Closure tables should have their indirections short-circuited by the initialization code. But maybe it is not worth the cost of telling the RTS about the closure tables (also, they would need to be made writeable).
* We are paying an indirection when a GC occurs when the closure is not in R1. According to the wiki page, technically this is not needed, but I don't know how to eliminate references to the closure itself from stg_gc_fun.
* Properly handle the Windows DLL case (e.g. SRTs). We will probably have to reorganize how the indirections are laid out.
* Bikeshed hs_main API changes (because closures are not valid prior to RTS initialization, so you have to pass in an indirection instead)
* This patch introduces a lot of new symbols; ensure we are not unduly polluting the symbol tables. (In particular, I think _static_closure symbols can be made hidden).
* ~~Make it work for GHCi linking of static objects.~~ Blocked on #2841, I have it working for ELF, and can make it work for other platforms as soon as I get relevant machines.
* ~~Improve static indirection and stable pointer registration, avoiding binary bloat from `__attribute(constructor)__` stubs.~~ After discussing this with some folks, it seems that there isn't really a portable way to do this when we are using the system dynamic linker to load libraries at startup. The problem is that we need to somehow get a list of all the GHC-compiled libraries which got loaded, and really the easiest way to get that is to just build it ourselves.
* ~~Need to implement a new megablock tracking structure so we can free/check for lost blocks~~. Now that efficient lookup is not necessary, perhaps we can write-optimize the megablock tracking structures.
Speculative improvements:
* Now that static lives in a block, can we GC them like we GC normal data? This would be beneficial for static thunks, which now can have their indirections completely removed; reverting CAFs may be somewhat tricky, however.
* Now that HEAP_ALLOCED is greatly simplified, can we further simply some aspects of the GC? At the very least, we ought to be able to make megablock allocation cheaper, by figuring out how to remove the spinlocks, etc.
New description: This bug is to track progress of removing HEAP_ALLOCED from GHC, promising faster GC (especially for large/scattered heaps), as long as we can keep the cost of indirections down. The relevant wiki article: http://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/HeapAlloced ; we are implementing method 2. Version 2 of the patchset is probably correct. Blocking problems: * Properly handle the Windows DLL case (e.g. SRTs). We will probably have to reorganize how the indirections are laid out. * ~~Make it work for GHCi linking of static objects.~~ Blocked on #2841, I have it working for ELF, and can make it work for other platforms as soon as I get relevant machines. * Bikeshed hs_main API changes (because closures are not valid prior to RTS initialization, so you have to pass in an indirection instead) Performance improvements possible: * This patch introduces a lot of new symbols; ensure we are not unduly polluting the symbol tables. (In particular, I think _static_closure symbols can be made hidden). * Don't pay for a double indirection when -fPIC is turned on. Probably the easiest way to do this is to *not* bake in the indirections into compiled code when it is -fPIC'd, and instead scribble over the GOT. However, I don't know how to go backwards from a symbol to a GOT entry, so we might need some heinous assembly hacks to get this working. * The old HEAP_ALLOCED is supposed to be pessimal on very large heaps. Do some performance tests under those workloads. * Make sure the extra indirection is not causing any C-- optimizations to stop firing (it might be, because I put it in as a literal CmmLoad) * Once an static thunk is updated, we can tag the indirection to let other code segments to know about the good news. One way to do this is have the update frame for a static indirection should have a reference to the *indirection*, not the closure itself. However, this scheme will not affect other static closures which have references to the thunk. * Closure tables should have their indirections short-circuited by the initialization code. But maybe it is not worth the cost of telling the RTS about the closure tables (also, they would need to be made writeable). * We are paying an indirection when a GC occurs when the closure is not in R1. According to the wiki page, technically this is not needed, but I don't know how to eliminate references to the closure itself from stg_gc_fun. * ~~Save tags inside the indirection tables, so that we don't spend instructions retagging after the following the indirection.~~ Done. * ~~Improve static indirection and stable pointer registration, avoiding binary bloat from `__attribute(constructor)__` stubs.~~ After discussing this with some folks, it seems that there isn't really a portable way to do this when we are using the system dynamic linker to load libraries at startup. The problem is that we need to somehow get a list of all the GHC-compiled libraries which got loaded, and really the easiest way to get that is to just build it ourselves. * ~~Need to implement a new megablock tracking structure so we can free/check for lost blocks~~. Now that efficient lookup is not necessary, perhaps we can write-optimize the megablock tracking structures. Speculative improvements: * Now that static lives in a block, can we GC them like we GC normal data? This would be beneficial for static thunks, which now can have their indirections completely removed; reverting CAFs may be somewhat tricky, however. * Now that HEAP_ALLOCED is greatly simplified, can we further simply some aspects of the GC? At the very least, we ought to be able to make megablock allocation cheaper, by figuring out how to remove the spinlocks, etc. * Another possibility is to adopt a hybrid approach, where we manually lay out closures when compiling statically, and indirect when compiling dynamically. In some sense, this gets the best of both worlds, since we expect to not pay any extra cost for indirection due to PIC. -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:9 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED ----------------------------+---------------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature | Status: new request | Milestone: 7.10.1 Priority: normal | Version: 7.7 Component: | Keywords: Compiler | Architecture: Unknown/Multiple Resolution: | Difficulty: Project (more than a week) Operating System: | Blocked By: 2841 Unknown/Multiple | Related Tickets: Type of failure: | None/Unknown | Test Case: | Blocking: | ----------------------------+---------------------------------------------- Comment (by ezyang): Caching tags improves binary sizes and performance improvement a little bit. Once again I've included some outliers. {{{ binary-trees +4.9% +0.0% +2.4% +3.1% +0.0% cryptarithm1 +4.8% +0.0% -7.0% -7.0% +100.0% integer +4.9% +0.0% -8.2% -8.4% +0.0% k-nucleotide +4.9% +0.0% -9.5% -9.4% +0.0% typecheck +4.9% +0.0% +8.0% +8.0% +100.0% ----------------------------------------------------------------------- Min +4.6% +0.0% -9.5% -9.4% -1.1% Max +5.8% +0.0% +8.0% +8.0% +100.0% Geometric Mean +4.9% -0.0% -2.0% -2.1% +6.4% }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:10 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED ----------------------------+---------------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature | Status: new request | Milestone: 7.10.1 Priority: normal | Version: 7.7 Component: | Keywords: Compiler | Architecture: Unknown/Multiple Resolution: | Difficulty: Project (more than a week) Operating System: | Blocked By: 2841 Unknown/Multiple | Related Tickets: Type of failure: | None/Unknown | Test Case: | Blocking: | ----------------------------+---------------------------------------------- Description changed by ezyang: Old description:
This bug is to track progress of removing HEAP_ALLOCED from GHC, promising faster GC (especially for large/scattered heaps), as long as we can keep the cost of indirections down.
The relevant wiki article: http://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/HeapAlloced ; we are implementing method 2. Version 2 of the patchset is probably correct.
Blocking problems:
* Properly handle the Windows DLL case (e.g. SRTs). We will probably have to reorganize how the indirections are laid out.
* ~~Make it work for GHCi linking of static objects.~~ Blocked on #2841, I have it working for ELF, and can make it work for other platforms as soon as I get relevant machines.
* Bikeshed hs_main API changes (because closures are not valid prior to RTS initialization, so you have to pass in an indirection instead)
Performance improvements possible:
* This patch introduces a lot of new symbols; ensure we are not unduly polluting the symbol tables. (In particular, I think _static_closure symbols can be made hidden).
* Don't pay for a double indirection when -fPIC is turned on. Probably the easiest way to do this is to *not* bake in the indirections into compiled code when it is -fPIC'd, and instead scribble over the GOT. However, I don't know how to go backwards from a symbol to a GOT entry, so we might need some heinous assembly hacks to get this working.
* The old HEAP_ALLOCED is supposed to be pessimal on very large heaps. Do some performance tests under those workloads.
* Make sure the extra indirection is not causing any C-- optimizations to stop firing (it might be, because I put it in as a literal CmmLoad)
* Once an static thunk is updated, we can tag the indirection to let other code segments to know about the good news. One way to do this is have the update frame for a static indirection should have a reference to the *indirection*, not the closure itself. However, this scheme will not affect other static closures which have references to the thunk.
* Closure tables should have their indirections short-circuited by the initialization code. But maybe it is not worth the cost of telling the RTS about the closure tables (also, they would need to be made writeable).
* We are paying an indirection when a GC occurs when the closure is not in R1. According to the wiki page, technically this is not needed, but I don't know how to eliminate references to the closure itself from stg_gc_fun.
* ~~Save tags inside the indirection tables, so that we don't spend instructions retagging after the following the indirection.~~ Done.
* ~~Improve static indirection and stable pointer registration, avoiding binary bloat from `__attribute(constructor)__` stubs.~~ After discussing this with some folks, it seems that there isn't really a portable way to do this when we are using the system dynamic linker to load libraries at startup. The problem is that we need to somehow get a list of all the GHC-compiled libraries which got loaded, and really the easiest way to get that is to just build it ourselves.
* ~~Need to implement a new megablock tracking structure so we can free/check for lost blocks~~. Now that efficient lookup is not necessary, perhaps we can write-optimize the megablock tracking structures.
Speculative improvements:
* Now that static lives in a block, can we GC them like we GC normal data? This would be beneficial for static thunks, which now can have their indirections completely removed; reverting CAFs may be somewhat tricky, however.
* Now that HEAP_ALLOCED is greatly simplified, can we further simply some aspects of the GC? At the very least, we ought to be able to make megablock allocation cheaper, by figuring out how to remove the spinlocks, etc.
* Another possibility is to adopt a hybrid approach, where we manually lay out closures when compiling statically, and indirect when compiling dynamically. In some sense, this gets the best of both worlds, since we expect to not pay any extra cost for indirection due to PIC.
New description: This bug is to track progress of removing HEAP_ALLOCED from GHC, promising faster GC (especially for large/scattered heaps), as long as we can keep the cost of indirections down. The relevant wiki article: http://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/HeapAlloced ; we are implementing method 2. Version 2 of the patchset is probably correct. Blocking problems: * Properly handle the Windows DLL case (e.g. SRTs). We will probably have to reorganize how the indirections are laid out. * ~~Make it work for GHCi linking of static objects.~~ Blocked on #2841, I have it working for ELF, and can make it work for other platforms as soon as I get relevant machines. * Bikeshed hs_main API changes (because closures are not valid prior to RTS initialization, so you have to pass in an indirection instead) Performance improvements possible: * ~~This patch introduces a lot of new symbols; ensure we are not unduly polluting the symbol tables. (In particular, I think _static_closure symbols can be made hidden).~~ I've eliminated all of these except for the init symbols, which cross the stub object and assembly file boundary, and so would need to be made invisible by the linker. * Don't pay for a double indirection when -fPIC is turned on. Probably the easiest way to do this is to *not* bake in the indirections into compiled code when it is -fPIC'd, and instead scribble over the GOT. However, I don't know how to go backwards from a symbol to a GOT entry, so we might need some heinous assembly hacks to get this working. * The old HEAP_ALLOCED is supposed to be pessimal on very large heaps. Do some performance tests under those workloads. * Make sure the extra indirection is not causing any C-- optimizations to stop firing (it might be, because I put it in as a literal CmmLoad) * Once an static thunk is updated, we can tag the indirection to let other code segments to know about the good news. One way to do this is have the update frame for a static indirection should have a reference to the *indirection*, not the closure itself. However, this scheme will not affect other static closures which have references to the thunk. * Closure tables should have their indirections short-circuited by the initialization code. But maybe it is not worth the cost of telling the RTS about the closure tables (also, they would need to be made writeable). * We are paying an indirection when a GC occurs when the closure is not in R1. According to the wiki page, technically this is not needed, but I don't know how to eliminate references to the closure itself from stg_gc_fun. * ~~Save tags inside the indirection tables, so that we don't spend instructions retagging after the following the indirection.~~ Done. * ~~Improve static indirection and stable pointer registration, avoiding binary bloat from `__attribute(constructor)__` stubs.~~ After discussing this with some folks, it seems that there isn't really a portable way to do this when we are using the system dynamic linker to load libraries at startup. The problem is that we need to somehow get a list of all the GHC-compiled libraries which got loaded, and really the easiest way to get that is to just build it ourselves. * ~~Need to implement a new megablock tracking structure so we can free/check for lost blocks~~. Now that efficient lookup is not necessary, perhaps we can write-optimize the megablock tracking structures. Speculative improvements: * Now that static lives in a block, can we GC them like we GC normal data? This would be beneficial for static thunks, which now can have their indirections completely removed; reverting CAFs may be somewhat tricky, however. * Now that HEAP_ALLOCED is greatly simplified, can we further simply some aspects of the GC? At the very least, we ought to be able to make megablock allocation cheaper, by figuring out how to remove the spinlocks, etc. * Another possibility is to adopt a hybrid approach, where we manually lay out closures when compiling statically, and indirect when compiling dynamically. In some sense, this gets the best of both worlds, since we expect to not pay any extra cost for indirection due to PIC. -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:11 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED ----------------------------+---------------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature | Status: new request | Milestone: 7.10.1 Priority: normal | Version: 7.7 Component: | Keywords: Compiler | Architecture: Unknown/Multiple Resolution: | Difficulty: Project (more than a week) Operating System: | Blocked By: 2841 Unknown/Multiple | Related Tickets: Type of failure: | None/Unknown | Test Case: | Blocking: | ----------------------------+---------------------------------------------- Comment (by simonpj): Why is this blocked by #2841? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:12 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED ----------------------------+---------------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature | Status: new request | Milestone: 7.10.1 Priority: normal | Version: 7.7 Component: | Keywords: Compiler | Architecture: Unknown/Multiple Resolution: | Difficulty: Project (more than a week) Operating System: | Blocked By: 2841 Unknown/Multiple | Related Tickets: Type of failure: | None/Unknown | Test Case: | Blocking: | ----------------------------+---------------------------------------------- Comment (by ezyang): Just for support for init_array sections in the GHCi linker. This might not actually be related to the particular issue reported in #2841, but I'm fairly confident foreign export won't work properly without it. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:13 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED ----------------------------+---------------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature | Status: new request | Milestone: 7.10.1 Priority: normal | Version: 7.7 Component: | Keywords: Compiler | Architecture: Unknown/Multiple Resolution: | Difficulty: Project (more than a week) Operating System: | Blocked By: 2841 Unknown/Multiple | Related Tickets: Type of failure: | None/Unknown | Test Case: | Blocking: | ----------------------------+---------------------------------------------- Changes (by simonmar): * cc: simonmar (added) Comment: No time to review the patch yet, but this looks like great work. Have you measured how long the initialisation takes on a large binary, e.g. GHC itself? And how much does a long compilation speed up? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:14 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED ----------------------------+---------------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature | Status: new request | Milestone: 7.10.1 Priority: normal | Version: 7.7 Component: | Keywords: Compiler | Architecture: Unknown/Multiple Resolution: | Difficulty: Project (more than a week) Operating System: | Blocked By: 5435 Unknown/Multiple | Related Tickets: Type of failure: | None/Unknown | Test Case: | Blocking: | ----------------------------+---------------------------------------------- Changes (by ezyang): * blockedby: => 5435 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:16 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED ----------------------------+---------------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature | Status: new request | Milestone: 7.10.1 Priority: normal | Version: 7.7 Component: | Keywords: Compiler | Architecture: Unknown/Multiple Resolution: | Difficulty: Project (more than a week) Operating System: | Blocked By: 5435 Unknown/Multiple | Related Tickets: Type of failure: | None/Unknown | Test Case: | Blocking: | ----------------------------+---------------------------------------------- Comment (by ezyang): Haven't done the GHC tests yet, but over ICFP I realized I hadn't run the GC nofib tests. They look really good: {{{ -------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- circsim +5.5% +0.0% -1.3% -1.4% +0.0% constraints +5.5% +0.0% -3.6% -3.6% +0.0% fibheaps +5.5% +0.0% -1.0% -1.0% +0.0% fulsom +5.7% +0.0% -1.1% -1.0% +3.0% gc_bench +5.5% +0.0% -1.0% -1.0% +0.0% happy +5.9% +0.0% -0.3% -0.3% -0.3% hash +5.5% +0.0% -4.7% -4.6% +0.1% lcss +5.5% +0.0% -7.2% -7.2% +0.0% mutstore1 +5.5% +0.0% -1.5% -1.5% +0.0% mutstore2 +5.5% +0.0% -3.3% -3.4% +0.0% power +5.5% +0.0% -2.4% -2.9% -1.0% spellcheck +7.4% +0.0% -6.5% -6.5% +0.0% -------------------------------------------------------------------------------- Min +5.5% +0.0% -7.2% -7.2% -1.0% Max +7.4% +0.0% -0.3% -0.3% +3.0% Geometric Mean +5.7% +0.0% -2.9% -2.9% +0.1% }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:17 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED ----------------------------+---------------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature | Status: new request | Milestone: 7.10.1 Priority: normal | Version: 7.7 Component: | Keywords: Compiler | Architecture: Unknown/Multiple Resolution: | Difficulty: Project (more than a week) Operating System: | Blocked By: 5435 Unknown/Multiple | Related Tickets: Type of failure: | None/Unknown | Test Case: | Blocking: | ----------------------------+---------------------------------------------- Comment (by ezyang): Simon and I chatted about how to move this patchset forward last night on IRC, I'm recording the main points here for posterity: * Static closures originally did not work on Windows with shared libs. Since these are now all copied to dynamic heap, it should be possible to reinstate the Int/Char static closure optimization, among other things. * This is a good step towards supporting multiple RTSes. However, we would need to store the indirection table at an offset from BaseReg. It's not obvious to me how to go about doing that, since we still need some cooperation from the linker to manage all the offsets. I ran some quick tests on the impact of this patch on the startup time of GHC. It seems to add something like 0.02s to initialization time, which is quite reasonable. Simon's primary other concern is binary size. The primary things that are inflating binary size in this patch are as follows: * We are using movq (memory-to-register) instead of movl (constant-to- register). * If we were moving a static constructor into memory, we now need to do it in two instructions, since there are no memory-to-memory movs: {{{ - movq $s_closure+1,-24(%rbp) + movq s_static_closure_ind,%rax + movq %rax,-24(%rbp) }}} * The closures stored in the indirection tables are tagged, which means that if the code expects an untagged closure, we need an extra instruction to deal with it: {{{ - movl $Main_and2_closure,%ebx + movq Main_and2_static_closure_ind,%rbx + andq $-8,%rbx }}} * The indirections table takes up one word per static closure * The initialization code imposes a per-module cost I don't know which of these are the biggest offenders yet. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:18 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED ----------------------------+---------------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature | Status: new request | Milestone: 7.10.1 Priority: normal | Version: 7.7 Component: | Keywords: Compiler | Architecture: Unknown/Multiple Resolution: | Difficulty: Project (more than a week) Operating System: | Blocked By: 5435 Unknown/Multiple | Related Tickets: Type of failure: | None/Unknown | Test Case: | Blocking: | ----------------------------+---------------------------------------------- Comment (by simonmar): FYI, I think you uploaded the wrong patch for the latest patch (version 4). Points 1 and 2 are unavoidable. Point 3 is interesting - I think we could probably do better there, but we should look at the context a bit more. In many cases where we have a tagged pointer, we take the tag into account when operating on it (e.g. field index), but here it looks like we're just unconditionally zeroing the tag bits. Point 4 might be improved by overlapping the indirection word with the closure itself. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:19 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED ----------------------------+---------------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature | Status: new request | Milestone: 7.10.1 Priority: normal | Version: 7.7 Component: | Keywords: Compiler | Architecture: Unknown/Multiple Resolution: | Difficulty: Project (more than a week) Operating System: | Blocked By: 5435 Unknown/Multiple | Related Tickets: Type of failure: | None/Unknown | Test Case: | Blocking: | ----------------------------+---------------------------------------------- Comment (by ezyang): I discovered an interesting flaw in the current design while debugging a segfault today. The problem goes like this: currently, on a per-module basis we generate an extra C file responsible for registering staticclosureinds with the initialization loop which will run at the beginning of the file. The implicit assumption being made here, of course, is that a statically linked module will be loaded atomically together, all of the object files being brought in for the ride. However, this is NOT TRUE when split objects are being used! In this case, it is quite possible for the object file containing the initialization function to *not* be loaded, while another split object which uses an indirection does get loaded. In this case, the static closure never gets initialized, and we'll get a segfault if we try to use it. How can we fix this? One possibility is, when splitting objects, to ensure if the staticclosureinds section is nonzero, then we have a reference to the initialization function somewhere else in the resulting object file, so it gets loaded in. However, it's not even clear that the starts/ends trick is going to even keep working, unless the splitter is very careful. Another possibility is to axe the initialization function, and try harder to detect the staticclosureinds section at runtime. Doing this properly in a cross-platform way is tricky (I originally tried to do it with linker scripts, but gave up when it (1) didn't work, and (2) wasn't going to work on other platforms.) I'll upload an up-to-date patchset for GHC 7.8 at some point. There's been a recent change to integer-gmp which needs to be updated for HEAP_ALLOCED. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:20 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED ----------------------------+---------------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature | Status: new request | Milestone: 7.10.1 Priority: normal | Version: 7.7 Component: | Keywords: Compiler | Architecture: Unknown/Multiple Resolution: | Difficulty: Project (more than a week) Operating System: | Blocked By: 5435 Unknown/Multiple | Related Tickets: Type of failure: | None/Unknown | Test Case: | Blocking: | ----------------------------+---------------------------------------------- Comment (by simonpj): Good catch Edward! You are such a star. I am not qualified to comment on solutions (at least not without doing quite a bit of homework), but I am very appreciative. Simon -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:21 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED ----------------------------+---------------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature | Status: new request | Milestone: 7.10.1 Priority: normal | Version: 7.7 Component: | Keywords: Compiler | Architecture: Unknown/Multiple Resolution: | Difficulty: Project (more than a week) Operating System: | Blocked By: 5435 Unknown/Multiple | Related Tickets: Type of failure: | None/Unknown | Test Case: | Blocking: | ----------------------------+---------------------------------------------- Comment (by simonmar): Simon - just to allay any confusion, this is a patchset that Edward is working on that isn't currently in GHC, so the bug that he found is not a GHC issue. Edward: we must have to solve this same problem for the initialisation sections that create !StablePtrs for foreign exports, and register cost centres for profiling. I don't remember exactly what we do there, but we probably ought to do it the same way. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:22 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED ----------------------------+---------------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature | Status: new request | Milestone: 7.10.1 Priority: normal | Version: 7.7 Component: | Keywords: Compiler | Architecture: Unknown/Multiple Resolution: | Difficulty: Project (more than a week) Operating System: | Blocked By: 5435 Unknown/Multiple | Related Tickets: Type of failure: | None/Unknown | Test Case: | Blocking: | ----------------------------+---------------------------------------------- Comment (by ezyang): I think in both cases, the usage-pattern of the symbols makes things "just work". Maybe this should be documented. In the case of foreign exports, you will usually end up linking the object file containing the init_array, because the exported symbols are in the same object file. This means we don't setup a stable pointer if the exported symbol is never referred to in the linking process, but that's desired behavior. In the case of profiling, functions which SCC emit code using 'emitSetCCC'. This will create a reference to a cost-centre object, which is contained in the object file that also has the init_array. If there are no SCCs in a module, the registration doesn't get run, but that's also desired behavior. So, it would seem that the appropriate thing to do here is ensure that there is an appropriate dependence between object files which have static closures and the initialization code. Frankly, I'm surprised that this isn't the case already; perhaps I need to link together some object files. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:23 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED ----------------------------+---------------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature | Status: new request | Milestone: 7.10.1 Priority: normal | Version: 7.7 Component: | Keywords: Compiler | Architecture: Unknown/Multiple Resolution: | Difficulty: Project (more than a week) Operating System: | Blocked By: 5435 Unknown/Multiple | Related Tickets: Type of failure: | None/Unknown | Test Case: | Blocking: | ----------------------------+---------------------------------------------- Comment (by ezyang): Today, I took a closer look, and actually it looks like profiling does have a hack to ensure that the constructor is loaded: {{{ -- Note [pipeline-split-init] -- If we have a stub file, it may contain constructor -- functions for initialisation of this module. We can't -- simply leave the stub as a separate object file, because it -- will never be linked in: nothing refers to it. We need to -- ensure that if we ever refer to the data in this module -- that needs initialisation, then we also pull in the -- initialisation routine. -- -- To that end, we make a DANGEROUS ASSUMPTION here: the data -- that needs to be initialised is all in the FIRST split -- object. See Note [codegen-split-init]. }}} (We probably annotate the link-step in the verbose output to this effect, as a reminder what the link step is doing). So the way to fix this problem here is to arrange for the first split object to contain the "data that needs to be initialized." In particular, all static closures have to be collected in the first split section. But this might actually be quite bad for split objects: with profiling, it is not a big deal if profiling data for a module is always loaded, since it doesn't contain any pointers to further data. But static closures contain pointers to their relevant code segments, and thus unconditionally loading the static closures could greatly increase the size of split-obj'd executables. I should probably implement this version and see how bad the bloat is, before we try to do something more complicated. Collecting this data through the compiler will be annoying, though, since it involves threading another list of SCCs through the various phases. If the static indirections table is placed in a single object file, which is a good idea, since otherwise the ordering of the various split object files which are linked together is not well-defined, then it may be very difficult to overcome this problem: the indirections table must point to the closures, and so if the indirections table is linked in, so are the rest of the objects. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:24 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED ----------------------------+---------------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature | Status: new request | Milestone: 7.10.1 Priority: normal | Version: 7.7 Component: | Keywords: Compiler | Architecture: Unknown/Multiple Resolution: | Difficulty: Project (more than a week) Operating System: | Blocked By: 5435 Unknown/Multiple | Related Tickets: Type of failure: | None/Unknown | Test Case: | Blocking: | ----------------------------+---------------------------------------------- Comment (by ezyang): OK, here is the proposed approach. Since we won't have collected all the static closures in time to place them in the first `CmmGroup` (due to streaming), we merely have to change the convention so the *last* block contains the data that needs to be initialized. This might not be ideal for linker efficiency but it prevents a space leak. {{{ diff --git a/compiler/codeGen/StgCmm.hs b/compiler/codeGen/StgCmm.hs index e2a5a07..e9f5668 100644 --- a/compiler/codeGen/StgCmm.hs +++ b/compiler/codeGen/StgCmm.hs @@ -79,12 +79,6 @@ codeGen dflags this_mod data_tycons return a yield cmm - -- Note [codegen-split-init] the cmm_init block must come - -- FIRST. This is because when -split-objs is on we need to - -- combine this block with its initialisation routines; see - -- Note [pipeline-split-init]. - ; cg (mkModuleInit cost_centre_info this_mod hpc_info) - ; mapM_ (cg . cgTopBinding dflags) stg_binds -- Put datatype_stuff after code_stuff, because the @@ -99,6 +93,12 @@ codeGen dflags this_mod data_tycons mapM_ (cg . cgDataCon) (tyConDataCons tycon) ; mapM_ do_tycon data_tycons + + -- Note [codegen-split-init] the cmm_init block must come + -- FIRST. This is because when -split-objs is on we need to + -- combine this block with its initialisation routines; see + -- Note [pipeline-split-init]. + ; cg (mkModuleInit cost_centre_info this_mod hpc_info) } --------------------------------------------------------------- diff --git a/compiler/main/DriverPipeline.hs b/compiler/main/DriverPipeline.hs index 517ba6c..2f7b847 100644 --- a/compiler/main/DriverPipeline.hs +++ b/compiler/main/DriverPipeline.hs @@ -1313,18 +1313,18 @@ runPhase (RealPhase SplitAs) _input_fn dflags -- initialisation routine. -- -- To that end, we make a DANGEROUS ASSUMPTION here: the data - -- that needs to be initialised is all in the FIRST split + -- that needs to be initialised is all in the LAST split -- object. See Note [codegen-split-init]. PipeState{maybe_stub_o} <- getPipeState case maybe_stub_o of Nothing -> return () Just stub_o -> liftIO $ do - tmp_split_1 <- newTempName dflags osuf - let split_1 = split_obj 1 - copyFile split_1 tmp_split_1 - removeFile split_1 - joinObjectFiles dflags [tmp_split_1, stub_o] split_1 + tmp_split_n <- newTempName dflags osuf + let split_n = split_obj n + copyFile split_n tmp_split_n + removeFile split_n + joinObjectFiles dflags [tmp_split_n, stub_o] split_n -- join them into a single .o file liftIO $ joinObjectFiles dflags (map split_obj [1..n]) output_fn diff --git a/compiler/main/HscMain.hs b/compiler/main/HscMain.hs index 26aca2a..97802b4 100644 --- a/compiler/main/HscMain.hs +++ b/compiler/main/HscMain.hs @@ -1670,13 +1670,36 @@ showModuleIndex (i,n) = "[" ++ padded ++ " of " ++ n_str ++ "] " i_str = show i padded = replicate (length n_str - length i_str) ' ' ++ i_str +-- Slightly inefficient, in that it needs to lookahead in order to +-- determine if it needs to cat the closures on immediatel +deferStaticClosures :: Monad m => + Either CLabel Module + -> Stream m [GenCmmDecl CmmStatics info stmt] b + -> [GenCmmDecl CmmStatics info stmt] + -> [GenCmmDecl CmmStatics info stmt] + -> Stream m [GenCmmDecl CmmStatics info stmt] b +deferStaticClosures lbl_or_mod str prev !closures = Stream.Stream $ do + r <- Stream.runStream str + case r of + Left x -> do + let addLabel starts = mkDataLits StaticClosureInds (mkStaticClosureIndsLabel lbl_or_mod starts) [] + return (Right (addLabel True : closures ++ [addLabel False] ++ prev, Stream.Stream (return (Left x)))) + Right (next, str') -> do + let isStaticClosure (CmmData StaticClosures _) = True + isStaticClosure (CmmData StaticClosureInds _) = True + isStaticClosure _ = False + newClosures = filter isStaticClosure next + next' = filter (not . isStaticClosure) next + closures' = newClosures `seq` (closures ++ newClosures) + if null prev + then Stream.runStream (deferStaticClosures lbl_or_mod str' next' closures') + else return (Right (prev, deferStaticClosures lbl_or_mod str' next' closures')) + prepareStaticClosures :: Monad m => Either CLabel Module -> Stream m [GenCmmDecl CmmStatics info stmt] b -> ForeignStubs -> (Stream m [GenCmmDecl CmmStatics info stmt] b, ForeignStubs) prepareStaticClosures lbl_or_mod cmms0 foreign_stubs0 = - let cmms = addLabel True >> cmms0 >>= (\r -> addLabel False >> return r) - addLabel starts = - Stream.yield [mkDataLits StaticClosureInds (mkStaticClosureIndsLabel lbl_or_mod starts) []] + let cmms = deferStaticClosures lbl_or_mod cmms0 [] [] foreign_stubs = foreign_stubs0 `appendStubC` static_closure_inds_init tag = case lbl_or_mod of Left lbl -> text "cmm_" <> ppr lbl }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:25 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED ----------------------------+---------------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature | Status: new request | Milestone: 7.10.1 Priority: normal | Version: 7.7 Component: | Keywords: Compiler | Architecture: Unknown/Multiple Resolution: | Difficulty: Project (more than a week) Operating System: | Blocked By: 5435 Unknown/Multiple | Related Tickets: Type of failure: | None/Unknown | Test Case: | Blocking: | ----------------------------+---------------------------------------------- Comment (by ezyang): As it turns out, removing HEAP_ALLOCED has a bad interaction with object unloading. The immediate failure is that the unload-object code thinks that all objects are unreferenced when checking static closures, because the static closure doesn't live in the object image but in the dynamic heap. One way to fix this is to add some extra data to all static closures saying where their indirection lives and use that to determine if a static closure is associated an unloaded object. Then when we dynamically load an object, we place all of its static closures in a block allocated specifically for that object (this would only work for GHC's linker, but as it so happens, at the moment we only support unloading static objects and not dynamic libraries); once we conclude the object is dead we free the block back to the memory manager. Here's an alternative scheme. Since the static closures no longer live in the object image, we can treat them as normal closures and just check if their info tables are keeping an object live. However, we still need to make sure that space in the static blocks ends up getting freed; the most reliable way to do that is to actually GC static closures the same we GC normal objects. (But of course, if we do that, we still need pointers to the static indirections to update them when the static object moves somewhere else.) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:26 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

T5837.comp.stderr 2>&1 =====> T6048(optasm) 1392 of 3913 [0, 6, 0] cd ./perf/compiler && '/home/hs01/ezyang/ghc-heap-alloced/inplace/bin/ghc- stage2' -fforce-recom
#8199: Get rid of HEAP_ALLOCED ----------------------------+---------------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature | Status: new request | Milestone: 7.10.1 Priority: normal | Version: 7.7 Component: | Keywords: Compiler | Architecture: Unknown/Multiple Resolution: | Difficulty: Project (more than a week) Operating System: | Blocked By: 5435 Unknown/Multiple | Related Tickets: Type of failure: | None/Unknown | Test Case: | Blocking: | ----------------------------+---------------------------------------------- Comment (by ezyang): I've been keeping an eye on the performance numbers, and this patch to the compiler results in a modest bump in some of the stats that we are testing on: {{{ =====> T3294(normal) 1380 of 3913 [0, 3, 0] cd ./perf/compiler && '/home/hs01/ezyang/ghc-heap-alloced/inplace/bin/ghc- stage2' -fforce-recom p -dno-debug-output -no-user-package-db -rtsopts -fno-ghci-history -c T3294.hs +RTS -V0 -tT32 94.comp.stats --machine-readable -RTS >T3294.comp.stderr 2>&1 max_bytes_used value is too high: Expected max_bytes_used: 43224080 +/-15% Lower bound max_bytes_used: 36740468 Upper bound max_bytes_used: 49707692 Actual max_bytes_used: 54141136 *** unexpected failure for T3294(normal) =====> T4801(normal) 1381 of 3913 [0, 4, 0] cd ./perf/compiler && '/home/hs01/ezyang/ghc-heap-alloced/inplace/bin/ghc- stage2' -fforce-recom p -dno-debug-output -no-user-package-db -rtsopts -fno-ghci-history -c T4801.hs +RTS -V0 -tT48 01.comp.stats --machine-readable -RTS -static >T4801.comp.stderr 2>&1 bytes allocated value is too high: Expected bytes allocated: 392409984 +/-10% Lower bound bytes allocated: 353168985 Upper bound bytes allocated: 431650983 Actual bytes allocated: 446748968 *** unexpected failure for T4801(normal) =====> T3064(normal) 1382 of 3913 [0, 5, 0] cd ./perf/compiler && '/home/hs01/ezyang/ghc-heap-alloced/inplace/bin/ghc- stage2' -fforce-recom p -dno-debug-output -no-user-package-db -rtsopts -fno-ghci-history -c T3064.hs +RTS -V0 -tT30 64.comp.stats --machine-readable -RTS >T3064.comp.stderr 2>&1 bytes allocated value is too high: Expected bytes allocated: 329795912 +/-5% Lower bound bytes allocated: 313306116 Upper bound bytes allocated: 346285708 Actual bytes allocated: 350480296 *** unexpected failure for T3064(normal) =====> T5837(normal) 1391 of 3913 [0, 6, 0] cd ./perf/compiler && '/home/hs01/ezyang/ghc-heap-alloced/inplace/bin/ghc- stage2' -fforce-recom p -dno-debug-output -no-user-package-db -rtsopts -fno-ghci-history -c T5837.hs -ftype-function -depth=50 +RTS -V0 -tT5837.comp.stats --machine-readable -RTS p -dno-debug-output -no-user-package-db -rtsopts -fno-ghci-history -c T6048.hs -O -fasm +RTS - V0 -tT6048.comp.stats --machine-readable -RTS >T6048.comp.stderr 2>&1 bytes allocated value is too high: Expected bytes allocated: 108578664 +/-10% Lower bound bytes allocated: 97720797 Upper bound bytes allocated: 119436531 Actual bytes allocated: 121958160 }}} The bump is understandable, because I need to do some post-processing on the C-- stream which induces some extra allocation. I think I might be able to optimize it some more but it would be nice to know if it's OK to bump these figures up. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:27 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED ----------------------------+---------------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature | Status: new request | Milestone: 7.10.1 Priority: normal | Version: 7.7 Component: | Keywords: Compiler | Architecture: Unknown/Multiple Resolution: | Difficulty: Project (more than a week) Operating System: | Blocked By: 5435 Unknown/Multiple | Related Tickets: Type of failure: | None/Unknown | Test Case: | Blocking: | ----------------------------+---------------------------------------------- Description changed by ezyang: Old description:
This bug is to track progress of removing HEAP_ALLOCED from GHC, promising faster GC (especially for large/scattered heaps), as long as we can keep the cost of indirections down.
The relevant wiki article: http://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/HeapAlloced ; we are implementing method 2. Version 2 of the patchset is probably correct.
Blocking problems:
* Properly handle the Windows DLL case (e.g. SRTs). We will probably have to reorganize how the indirections are laid out.
* ~~Make it work for GHCi linking of static objects.~~ Blocked on #2841, I have it working for ELF, and can make it work for other platforms as soon as I get relevant machines.
* Bikeshed hs_main API changes (because closures are not valid prior to RTS initialization, so you have to pass in an indirection instead)
Performance improvements possible:
* ~~This patch introduces a lot of new symbols; ensure we are not unduly polluting the symbol tables. (In particular, I think _static_closure symbols can be made hidden).~~ I've eliminated all of these except for the init symbols, which cross the stub object and assembly file boundary, and so would need to be made invisible by the linker.
* Don't pay for a double indirection when -fPIC is turned on. Probably the easiest way to do this is to *not* bake in the indirections into compiled code when it is -fPIC'd, and instead scribble over the GOT. However, I don't know how to go backwards from a symbol to a GOT entry, so we might need some heinous assembly hacks to get this working.
* The old HEAP_ALLOCED is supposed to be pessimal on very large heaps. Do some performance tests under those workloads.
* Make sure the extra indirection is not causing any C-- optimizations to stop firing (it might be, because I put it in as a literal CmmLoad)
* Once an static thunk is updated, we can tag the indirection to let other code segments to know about the good news. One way to do this is have the update frame for a static indirection should have a reference to the *indirection*, not the closure itself. However, this scheme will not affect other static closures which have references to the thunk.
* Closure tables should have their indirections short-circuited by the initialization code. But maybe it is not worth the cost of telling the RTS about the closure tables (also, they would need to be made writeable).
* We are paying an indirection when a GC occurs when the closure is not in R1. According to the wiki page, technically this is not needed, but I don't know how to eliminate references to the closure itself from stg_gc_fun.
* ~~Save tags inside the indirection tables, so that we don't spend instructions retagging after the following the indirection.~~ Done.
* ~~Improve static indirection and stable pointer registration, avoiding binary bloat from `__attribute(constructor)__` stubs.~~ After discussing this with some folks, it seems that there isn't really a portable way to do this when we are using the system dynamic linker to load libraries at startup. The problem is that we need to somehow get a list of all the GHC-compiled libraries which got loaded, and really the easiest way to get that is to just build it ourselves.
* ~~Need to implement a new megablock tracking structure so we can free/check for lost blocks~~. Now that efficient lookup is not necessary, perhaps we can write-optimize the megablock tracking structures.
Speculative improvements:
* Now that static lives in a block, can we GC them like we GC normal data? This would be beneficial for static thunks, which now can have their indirections completely removed; reverting CAFs may be somewhat tricky, however.
* Now that HEAP_ALLOCED is greatly simplified, can we further simply some aspects of the GC? At the very least, we ought to be able to make megablock allocation cheaper, by figuring out how to remove the spinlocks, etc.
* Another possibility is to adopt a hybrid approach, where we manually lay out closures when compiling statically, and indirect when compiling dynamically. In some sense, this gets the best of both worlds, since we expect to not pay any extra cost for indirection due to PIC.
New description: This bug is to track progress of removing HEAP_ALLOCED from GHC, promising faster GC (especially for large/scattered heaps), as long as we can keep the cost of indirections down. The relevant wiki article: http://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/HeapAlloced ; we are implementing method 2. Version 2 of the patchset is probably correct. Blocking problems: * Properly handle the Windows DLL case (e.g. SRTs). We will probably have to reorganize how the indirections are laid out. * ~~Make it work for GHCi linking of static objects.~~ Blocked on #2841, I have it working for ELF, and can make it work for other platforms as soon as I get relevant machines. * Bikeshed hs_main API changes (because closures are not valid prior to RTS initialization, so you have to pass in an indirection instead) * Does not work with unloadObj (see comments below) Performance improvements possible: * ~~This patch introduces a lot of new symbols; ensure we are not unduly polluting the symbol tables. (In particular, I think _static_closure symbols can be made hidden).~~ I've eliminated all of these except for the init symbols, which cross the stub object and assembly file boundary, and so would need to be made invisible by the linker. * Don't pay for a double indirection when -fPIC is turned on. Probably the easiest way to do this is to *not* bake in the indirections into compiled code when it is -fPIC'd, and instead scribble over the GOT. However, I don't know how to go backwards from a symbol to a GOT entry, so we might need some heinous assembly hacks to get this working. * The old HEAP_ALLOCED is supposed to be pessimal on very large heaps. Do some performance tests under those workloads. * Make sure the extra indirection is not causing any C-- optimizations to stop firing (it might be, because I put it in as a literal CmmLoad) * Once an static thunk is updated, we can tag the indirection to let other code segments to know about the good news. One way to do this is have the update frame for a static indirection should have a reference to the *indirection*, not the closure itself. However, this scheme will not affect other static closures which have references to the thunk. * Closure tables should have their indirections short-circuited by the initialization code. But maybe it is not worth the cost of telling the RTS about the closure tables (also, they would need to be made writeable). * We are paying an indirection when a GC occurs when the closure is not in R1. According to the wiki page, technically this is not needed, but I don't know how to eliminate references to the closure itself from stg_gc_fun. * ~~Save tags inside the indirection tables, so that we don't spend instructions retagging after the following the indirection.~~ Done. * ~~Improve static indirection and stable pointer registration, avoiding binary bloat from `__attribute(constructor)__` stubs.~~ After discussing this with some folks, it seems that there isn't really a portable way to do this when we are using the system dynamic linker to load libraries at startup. The problem is that we need to somehow get a list of all the GHC-compiled libraries which got loaded, and really the easiest way to get that is to just build it ourselves. * ~~Need to implement a new megablock tracking structure so we can free/check for lost blocks~~. Now that efficient lookup is not necessary, perhaps we can write-optimize the megablock tracking structures. Speculative improvements: * Now that static lives in a block, can we GC them like we GC normal data? This would be beneficial for static thunks, which now can have their indirections completely removed; reverting CAFs may be somewhat tricky, however. * Now that HEAP_ALLOCED is greatly simplified, can we further simply some aspects of the GC? At the very least, we ought to be able to make megablock allocation cheaper, by figuring out how to remove the spinlocks, etc. * Another possibility is to adopt a hybrid approach, where we manually lay out closures when compiling statically, and indirect when compiling dynamically. In some sense, this gets the best of both worlds, since we expect to not pay any extra cost for indirection due to PIC. -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:28 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED ----------------------------+---------------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature | Status: new request | Milestone: 7.10.1 Priority: normal | Version: 7.7 Component: | Keywords: Compiler | Architecture: Unknown/Multiple Resolution: | Difficulty: Project (more than a week) Operating System: | Blocked By: 5435 Unknown/Multiple | Related Tickets: Type of failure: | None/Unknown | Test Case: | Blocking: | ----------------------------+---------------------------------------------- Comment (by simonmar): So we have to solve the unloading problem first, right? Also did you find any way to reduce that 5-6% binary size overhead? Before bumping the performance thresholds you should check whether they're reasonable. Compare against the unpatched compiler (sometimes it is right up against the upper limit anyway). Increasing allocation by 10% sounds like a *lot* for this change. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:29 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED ----------------------------+---------------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature | Status: new request | Milestone: 7.10.1 Priority: normal | Version: 7.7 Component: | Keywords: Compiler | Architecture: Unknown/Multiple Resolution: | Difficulty: Project (more than a week) Operating System: | Blocked By: 5435 Unknown/Multiple | Related Tickets: Type of failure: | None/Unknown | Test Case: | Blocking: | ----------------------------+---------------------------------------------- Comment (by ezyang): Yeah, I'm going to try to fix both. As for the stats, on a perf build on GHC 7.8, I see: {{{ T3294 max_bytes_used 37702264 T4801 bytes_allocated 429277160 T3064 bytes_allocated 346679176 T6048 bytes_allocated 120594240 }}} In fact, some of these tests are presently failing validate on a clean branch for me. So it seems the only real problem here is T3294 (where we have almost doubled the max bytes used). -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:30 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED ----------------------------+---------------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature | Status: new request | Milestone: 7.10.1 Priority: normal | Version: 7.7 Component: | Keywords: Compiler | Architecture: Unknown/Multiple Resolution: | Difficulty: Project (more than a week) Operating System: | Blocked By: 5435 Unknown/Multiple | Related Tickets: Type of failure: | None/Unknown | Test Case: | Blocking: | ----------------------------+---------------------------------------------- Comment (by ezyang): I looked at the context of "The closures stored in the indirection tables are tagged, which means that if the code expects an untagged closure, we need an extra instruction to deal with it", and noticed that the only reason we were untagging them was because the GC functions stg_gc_enter_1 and stg_gc_fun require 'node' to be untagged. So clearly, we should just have specialized versions of these functions which untag their argument. Actually, we can do even better: if we're making specialized versions, we can sidestep performing the load in code, so we have the invariant that for static closures, the node is a pointer to the static indirection. (Of course, this also means that we only get the savings from the heap checks of static closure entry code.) I'm implement this and see how much we save. As for overlapping the indirections with the data, there's something we can do even before that: since the closure that lives in memory doesn't have to be runnable, we can drop any fields that could be derived later. The counterbalance is that the initialization code gets a little more complex, but we're already case-splitting on the closure type, so it can't be too much worse. I'll also implement this and see how much better we do. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:31 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED ----------------------------+---------------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature | Status: new request | Milestone: 7.10.1 Priority: normal | Version: 7.7 Component: | Keywords: Compiler | Architecture: Unknown/Multiple Resolution: | Difficulty: Project (more than a week) Operating System: | Blocked By: 5435 Unknown/Multiple | Related Tickets: Type of failure: | None/Unknown | Test Case: | Blocking: | ----------------------------+---------------------------------------------- Old description:
This bug is to track progress of removing HEAP_ALLOCED from GHC, promising faster GC (especially for large/scattered heaps), as long as we can keep the cost of indirections down.
The relevant wiki article: http://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/HeapAlloced ; we are implementing method 2. Version 2 of the patchset is probably correct.
Blocking problems:
* Properly handle the Windows DLL case (e.g. SRTs). We will probably have to reorganize how the indirections are laid out.
* ~~Make it work for GHCi linking of static objects.~~ Blocked on #2841, I have it working for ELF, and can make it work for other platforms as soon as I get relevant machines.
* Bikeshed hs_main API changes (because closures are not valid prior to RTS initialization, so you have to pass in an indirection instead)
* Does not work with unloadObj (see comments below)
Performance improvements possible:
* ~~This patch introduces a lot of new symbols; ensure we are not unduly polluting the symbol tables. (In particular, I think _static_closure symbols can be made hidden).~~ I've eliminated all of these except for the init symbols, which cross the stub object and assembly file boundary, and so would need to be made invisible by the linker.
* Don't pay for a double indirection when -fPIC is turned on. Probably the easiest way to do this is to *not* bake in the indirections into compiled code when it is -fPIC'd, and instead scribble over the GOT. However, I don't know how to go backwards from a symbol to a GOT entry, so we might need some heinous assembly hacks to get this working.
* The old HEAP_ALLOCED is supposed to be pessimal on very large heaps. Do some performance tests under those workloads.
* Make sure the extra indirection is not causing any C-- optimizations to stop firing (it might be, because I put it in as a literal CmmLoad)
* Once an static thunk is updated, we can tag the indirection to let other code segments to know about the good news. One way to do this is have the update frame for a static indirection should have a reference to the *indirection*, not the closure itself. However, this scheme will not affect other static closures which have references to the thunk.
* Closure tables should have their indirections short-circuited by the initialization code. But maybe it is not worth the cost of telling the RTS about the closure tables (also, they would need to be made writeable).
* We are paying an indirection when a GC occurs when the closure is not in R1. According to the wiki page, technically this is not needed, but I don't know how to eliminate references to the closure itself from stg_gc_fun.
* ~~Save tags inside the indirection tables, so that we don't spend instructions retagging after the following the indirection.~~ Done.
* ~~Improve static indirection and stable pointer registration, avoiding binary bloat from `__attribute(constructor)__` stubs.~~ After discussing this with some folks, it seems that there isn't really a portable way to do this when we are using the system dynamic linker to load libraries at startup. The problem is that we need to somehow get a list of all the GHC-compiled libraries which got loaded, and really the easiest way to get that is to just build it ourselves.
* ~~Need to implement a new megablock tracking structure so we can free/check for lost blocks~~. Now that efficient lookup is not necessary, perhaps we can write-optimize the megablock tracking structures.
Speculative improvements:
* Now that static lives in a block, can we GC them like we GC normal data? This would be beneficial for static thunks, which now can have their indirections completely removed; reverting CAFs may be somewhat tricky, however.
* Now that HEAP_ALLOCED is greatly simplified, can we further simply some aspects of the GC? At the very least, we ought to be able to make megablock allocation cheaper, by figuring out how to remove the spinlocks, etc.
* Another possibility is to adopt a hybrid approach, where we manually lay out closures when compiling statically, and indirect when compiling dynamically. In some sense, this gets the best of both worlds, since we expect to not pay any extra cost for indirection due to PIC.
New description: This bug is to track progress of removing HEAP_ALLOCED from GHC, promising faster GC (especially for large/scattered heaps), as long as we can keep the cost of indirections down. The relevant wiki article: http://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/HeapAlloced ; we are implementing method 2. Version 2 of the patchset is probably correct. Blocking problems: * Properly handle the Windows DLL case (e.g. SRTs). We will probably have to reorganize how the indirections are laid out. * ~~Make it work for GHCi linking of static objects.~~ Blocked on #2841, I have it working for ELF, and can make it work for other platforms as soon as I get relevant machines. * Bikeshed hs_main API changes (because closures are not valid prior to RTS initialization, so you have to pass in an indirection instead) * Does not work with unloadObj (see comments below) Performance improvements possible: * ~~This patch introduces a lot of new symbols; ensure we are not unduly polluting the symbol tables. (In particular, I think _static_closure symbols can be made hidden).~~ ~~I've eliminated all of these except for the init symbols, which cross the stub object and assembly file boundary, and so would need to be made invisible by the linker.~~ I needed to make local info tables public. * Don't pay for a double indirection when -fPIC is turned on. Probably the easiest way to do this is to *not* bake in the indirections into compiled code when it is -fPIC'd, and instead scribble over the GOT. However, I don't know how to go backwards from a symbol to a GOT entry, so we might need some heinous assembly hacks to get this working. * The old HEAP_ALLOCED is supposed to be pessimal on very large heaps. Do some performance tests under those workloads. * Make sure the extra indirection is not causing any C-- optimizations to stop firing (it might be, because I put it in as a literal CmmLoad) * Once an static thunk is updated, we can tag the indirection to let other code segments to know about the good news. One way to do this is have the update frame for a static indirection should have a reference to the *indirection*, not the closure itself. However, this scheme will not affect other static closures which have references to the thunk. * Closure tables should have their indirections short-circuited by the initialization code. But maybe it is not worth the cost of telling the RTS about the closure tables (also, they would need to be made writeable). * We are paying an indirection when a GC occurs when the closure is not in R1. According to the wiki page, technically this is not needed, but I don't know how to eliminate references to the closure itself from stg_gc_fun. * ~~Save tags inside the indirection tables, so that we don't spend instructions retagging after the following the indirection.~~ Done. * ~~Improve static indirection and stable pointer registration, avoiding binary bloat from `__attribute(constructor)__` stubs.~~ After discussing this with some folks, it seems that there isn't really a portable way to do this when we are using the system dynamic linker to load libraries at startup. The problem is that we need to somehow get a list of all the GHC-compiled libraries which got loaded, and really the easiest way to get that is to just build it ourselves. * ~~Need to implement a new megablock tracking structure so we can free/check for lost blocks~~. Now that efficient lookup is not necessary, perhaps we can write-optimize the megablock tracking structures. Speculative improvements: * Now that static lives in a block, can we GC them like we GC normal data? This would be beneficial for static thunks, which now can have their indirections completely removed; reverting CAFs may be somewhat tricky, however. * Now that HEAP_ALLOCED is greatly simplified, can we further simply some aspects of the GC? At the very least, we ought to be able to make megablock allocation cheaper, by figuring out how to remove the spinlocks, etc. * Another possibility is to adopt a hybrid approach, where we manually lay out closures when compiling statically, and indirect when compiling dynamically. In some sense, this gets the best of both worlds, since we expect to not pay any extra cost for indirection due to PIC. -- Comment (by ezyang): Great news: with overlapping/compressed static closure representation, binary size overall is now less than it used to be! The downside is that I still need to collect static closures together away from their definition, which means that I now need to export symbols local info tables (which I know we'd previously been keeping local to reduce the number of symbols). Remaining tasks: unloadObj support and T3294 performance debugging.. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:32 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED ----------------------------+---------------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature | Status: new request | Milestone: 7.10.1 Priority: normal | Version: 7.7 Component: | Keywords: Compiler | Architecture: Unknown/Multiple Resolution: | Difficulty: Project (more than a week) Operating System: | Blocked By: 5435 Unknown/Multiple | Related Tickets: Type of failure: | None/Unknown | Test Case: | Blocking: | ----------------------------+---------------------------------------------- Comment (by ezyang): One more progress report (as before, selection of outliers as well as overall figures), after having rebased onto HEAD: {{{ bernouilli -0.5% -0.0% -9.1% -9.1% 0.0% circsim -0.5% -0.0% +4.5% +4.5% +21.7% cryptarithm1 -0.5% +0.0% +7.6% +7.6% +100.0% lcss -0.5% -0.0% -8.3% -8.3% +1.9% wheel-sieve2 -0.5% -0.0% -9.1% -9.1% +2.1% -------------------------------------------------------------------------------- Min -0.9% -12.0% -9.1% -9.1% -5.2% Max -0.4% +0.4% +7.6% +7.6% +100.0% Geometric Mean -0.5% -0.5% -1.8% -1.8% +9.0% }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:33 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED ----------------------------+---------------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature | Status: new request | Milestone: 7.10.1 Priority: normal | Version: 7.7 Component: | Keywords: Compiler | Architecture: Unknown/Multiple Resolution: | Difficulty: Project (more than a week) Operating System: | Blocked By: 5435 Unknown/Multiple | Related Tickets: Type of failure: | None/Unknown | Test Case: | Blocking: | ----------------------------+---------------------------------------------- Comment (by ezyang): And the GC results: {{{ -------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- circsim -0.5% -0.0% -0.9% -0.9% +4.4% constraints -0.5% -0.0% -1.9% -2.0% +0.2% fibheaps -0.5% -0.0% -2.1% -2.1% 0.0% fulsom -0.5% -0.0% -2.6% -2.6% -6.4% gc_bench -0.5% -0.0% -6.8% -6.8% -11.6% happy -0.4% -0.0% -4.0% -4.0% 0.0% hash -0.5% -0.0% -5.4% -5.4% 0.0% lcss -0.5% -0.0% -7.2% -7.2% 0.0% mutstore1 -0.5% -0.0% -3.1% -3.1% +4.5% mutstore2 -0.5% -0.0% -1.9% -1.9% +0.8% power -0.5% -0.0% -4.7% -4.7% 0.0% spellcheck -0.4% -0.0% -1.0% -0.9% +1.5% -------------------------------------------------------------------------------- Min -0.5% -0.0% -7.2% -7.2% -11.6% Max -0.4% -0.0% -0.9% -0.9% +4.5% Geometric Mean -0.5% -0.0% -3.5% -3.5% -0.6% }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:34 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED ----------------------------+---------------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature | Status: new request | Milestone: 7.10.1 Priority: normal | Version: 7.7 Component: | Keywords: Compiler | Architecture: Unknown/Multiple Resolution: | Difficulty: Project (more than a week) Operating System: | Blocked By: 5435 Unknown/Multiple | Related Tickets: Type of failure: | None/Unknown | Test Case: | Blocking: | ----------------------------+---------------------------------------------- Comment (by simonmar): Wow, I'm amazed you managed to get binary sizes to be smaller than before. So the only concerns now are those few strange looking results: cryptarithm, circsim, and I notice that spellcheck is not as good as before (runtime -6.4% before, -1.0% now). Also do you have any idea what is driving the increase in total memory (+100% for cryptarithm, +21% for circsim)? If startup time has increased, it would be a good idea to check on the startup time of GHC itself again, to give us an idea of the impact for a large program. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:35 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED -------------------------------------+------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature | Status: new request | Milestone: 7.10.1 Priority: normal | Version: 7.7 Component: Compiler | Keywords: Resolution: | Architecture: Unknown/Multiple Operating System: | Difficulty: Project (more Unknown/Multiple | than a week) Type of failure: | Blocked By: 5435 None/Unknown | Related Tickets: Test Case: | Blocking: | Differential Revisions: D207 | -------------------------------------+------------------------------------- Changes (by ezyang): * differential: => D207 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:36 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED -------------------------------------+------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature | Status: new request | Milestone: 7.10.1 Priority: normal | Version: 7.7 Component: Compiler | Keywords: Resolution: | Architecture: Unknown/Multiple Operating System: | Difficulty: Project (more Unknown/Multiple | than a week) Type of failure: | Blocked By: 5435 None/Unknown | Related Tickets: Test Case: | Blocking: | Differential Revisions: D207 | -------------------------------------+------------------------------------- Comment (by ezyang): So, here's the status: I rebased the patches onto HEAD, and took a close look at the oscillating results. As it turns out, at least for cryptarithm, the nofib test had bimodal output behavior, so depending on your luck, you'd end up with even changes, or an improvement or a regression. (I think this may be what's happening for the inconsistent GC results; more investigation necessary.) The increase in total memory for cryptarithm1 and circsim stems from the fact that the binaries don't use much memory to begin with, so the allocation of static closures is causing us to allocate a new megablock and, e.g., doubling it in the case cryptarithm. The startup time for GHC is still 0.020, so same as before. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:37 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED
-------------------------------------+-------------------------------------
Reporter: ezyang | Owner: ezyang
Type: feature | Status: new
request | Milestone: 7.10.1
Priority: normal | Version: 7.7
Component: Compiler | Keywords:
Resolution: | Architecture: Unknown/Multiple
Operating System: | Difficulty: Project (more
Unknown/Multiple | than a week)
Type of failure: | Blocked By: 5435
None/Unknown | Related Tickets:
Test Case: |
Blocking: |
Differential Revisions: D207 |
-------------------------------------+-------------------------------------
Comment (by Edward Z. Yang

#8199: Get rid of HEAP_ALLOCED
-------------------------------------+-------------------------------------
Reporter: ezyang | Owner: ezyang
Type: feature | Status: new
request | Milestone: 7.10.1
Priority: normal | Version: 7.7
Component: Compiler | Keywords:
Resolution: | Architecture: Unknown/Multiple
Operating System: | Difficulty: Project (more
Unknown/Multiple | than a week)
Type of failure: | Blocked By: 5435
None/Unknown | Related Tickets:
Test Case: |
Blocking: |
Differential Revisions: D207 |
-------------------------------------+-------------------------------------
Comment (by Edward Z. Yang

#8199: Get rid of HEAP_ALLOCED
-------------------------------------+-------------------------------------
Reporter: ezyang | Owner: ezyang
Type: feature | Status: new
request | Milestone: 7.10.1
Priority: normal | Version: 7.7
Component: Compiler | Keywords:
Resolution: | Architecture: Unknown/Multiple
Operating System: | Difficulty: Project (more
Unknown/Multiple | than a week)
Type of failure: | Blocked By: 5435
None/Unknown | Related Tickets:
Test Case: |
Blocking: |
Differential Revisions: D207 |
-------------------------------------+-------------------------------------
Comment (by Edward Z. Yang

#8199: Get rid of HEAP_ALLOCED
-------------------------------------+-------------------------------------
Reporter: ezyang | Owner: ezyang
Type: feature | Status: new
request | Milestone: 7.10.1
Priority: normal | Version: 7.7
Component: Compiler | Keywords:
Resolution: | Architecture: Unknown/Multiple
Operating System: | Difficulty: Project (more
Unknown/Multiple | than a week)
Type of failure: | Blocked By: 5435
None/Unknown | Related Tickets:
Test Case: |
Blocking: |
Differential Revisions: D207 |
-------------------------------------+-------------------------------------
Comment (by Edward Z. Yang

#8199: Get rid of HEAP_ALLOCED
-------------------------------------+-------------------------------------
Reporter: ezyang | Owner: ezyang
Type: feature | Status: new
request | Milestone: 7.10.1
Priority: normal | Version: 7.7
Component: Compiler | Keywords:
Resolution: | Architecture: Unknown/Multiple
Operating System: | Difficulty: Project (more
Unknown/Multiple | than a week)
Type of failure: | Blocked By: 5435
None/Unknown | Related Tickets:
Test Case: |
Blocking: |
Differential Revisions: D207 |
-------------------------------------+-------------------------------------
Comment (by Edward Z. Yang

#8199: Get rid of HEAP_ALLOCED -------------------------------------+------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature | Status: new request | Milestone: 7.10.1 Priority: normal | Version: 7.7 Component: Compiler | Keywords: Resolution: | Architecture: Unknown/Multiple Operating System: | Difficulty: Project (more Unknown/Multiple | than a week) Type of failure: | Blocked By: 5435 None/Unknown | Related Tickets: Test Case: | Blocking: | Differential Revisions: D207 | -------------------------------------+------------------------------------- Comment (by ezyang): We backed out these patches, see #9706 for the new plan. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:43 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED -------------------------------------+------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature | Status: new request | Milestone: 7.10.1 Priority: normal | Version: 7.7 Component: Compiler | Keywords: Resolution: | Architecture: Unknown/Multiple Operating System: | Difficulty: Project (more Unknown/Multiple | than a week) Type of failure: | Blocked By: 5435 None/Unknown | Related Tickets: Test Case: | Blocking: | Differential Revisions: D207 | -------------------------------------+------------------------------------- Comment (by ezyang): We backed out these patches, see #9706 for the new plan. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:44 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED
-------------------------------------+-------------------------------------
Reporter: ezyang | Owner: ezyang
Type: feature | Status: new
request | Milestone: 7.10.1
Priority: normal | Version: 7.7
Component: Compiler | Keywords:
Resolution: | Architecture: Unknown/Multiple
Operating System: | Difficulty: Project (more
Unknown/Multiple | than a week)
Type of failure: | Blocked By: 5435
None/Unknown | Related Tickets:
Test Case: |
Blocking: |
Differential Revisions: D207 |
-------------------------------------+-------------------------------------
Comment (by ajp):
Method 1 from the attached task may actually be made to work (at least on
Linux). The key point is that we don't have to use the memory address that
we happened to be given at startup. We can shift it to an aligned location
using MREMAP, and simply include a mb of padding at the end to avoid
trashing anything. Here's a modification of the code in the attached task
that demostrates shifting the memory address with zero copies.
{{{
┌─[anders gurney] [~/devel/alignment]
└─[139] cat test.S
.section foo,"aw"
.p2align 20
.global foo_start
.global foo_end
foo_start:
.ascii "A"
.fill 1048576
foo_end:
.ascii "B"
┌─[anders gurney] [~/devel/alignment]
└─[0] cat main.c
#define _GNU_SOURCE
#include

#8199: Get rid of HEAP_ALLOCED -------------------------------------+------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature | Status: new request | Milestone: 7.10.1 Priority: normal | Version: 7.7 Component: Compiler | Keywords: Resolution: | Architecture: Unknown/Multiple Operating System: | Difficulty: Project (more Unknown/Multiple | than a week) Type of failure: | Blocked By: 5435 None/Unknown | Related Tickets: Test Case: | Blocking: | Differential Revisions: D207 | -------------------------------------+------------------------------------- Comment (by simonmar): @ajp the problem is that this data is statically linked, not mmapped. If we move it somewhere else, all the references within it will be wrong. That's what method 2 was doing: relocating all the static closures elsewhere and fixing up all the pointers. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:46 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED -------------------------------------+------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.12.1 Component: Compiler | Version: 7.7 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: 5435 | Blocking: Related Tickets: | Differential Revisions: D207 -------------------------------------+------------------------------------- Comment (by rwbarton): Is this ticket now obsolete since #9706 is done? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:48 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED -------------------------------------+------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.12.1 Component: Compiler | Version: 7.7 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: 5435 | Blocking: Related Tickets: | Differential Revisions: D207 -------------------------------------+------------------------------------- Comment (by ezyang): Not quite, because the #9706 technique doesn't work for Windows. But it seems unlikely we'll want to deploy this big patch just to make Windows work. So... -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:49 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED -------------------------------------+------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.12.1 Component: Compiler | Version: 7.7 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: 5435 | Blocking: Related Tickets: | Differential Revisions: D207 -------------------------------------+------------------------------------- Comment (by simonmar): I wonder whether we might try an alternative strategy on Windows: a bitmap covering the whole 48-bit address space (units of 1MB) is only 32MB, and most of it won't be touched in the common case. The remaining question is whether checking this bitmap can be made as cheap as the current cache. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:50 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED -------------------------------------+------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.12.1 Component: Compiler | Version: 7.7 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: 5435 | Blocking: Related Tickets: | Differential Revisions: D207 -------------------------------------+------------------------------------- Comment (by ezyang): Reproducing some comments from the Phabricator diff for posterity. #9706 didn't work on Windows because Windows counted page table memory as part of committed memory. Go has the same problem: {{{ // Number of bits in page to span calculations (4k pages). // On Windows 64-bit we limit the arena to 32GB or 35 bits. // Windows counts memory used by page table into committed memory // of the process, so we can't reserve too much memory. // See https://golang.org/issue/5402 and https://golang.org/issue/5236. // On other 64-bit platforms, we limit the arena to 512GB, or 39 bits. // On 32-bit, we don't bother limiting anything, so we use the full 32-bit address. // On Darwin/arm64, we cannot reserve more than ~5GB of virtual memory, // but as most devices have less than 4GB of physical memory anyway, we // try to be conservative here, and only ask for a 2GB heap. }}} The links are good reading: https://golang.org/issue/5402 and https://golang.org/issue/5236 (scroll down). Note that there is a little bit about Go relying on overcommitting causing problems for people on systems that don't overcommit; this is not relevant to us since we're not actually committing memory and relying on the OS to lazily initialize it if they don't actually use it. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:51 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED -------------------------------------+------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.12.1 Component: Compiler | Version: 7.7 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: 5435 | Blocking: Related Tickets: | Differential Revisions: D207 -------------------------------------+------------------------------------- Comment (by ezyang): Simon, are you thinking doing some sort of lazy committing for the 32MB bitmap, so that most of it doesn't need to backed with real memory? I guess I don't see any reason this shouldn't work. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:52 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED -------------------------------------+------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: normal | Milestone: 7.12.1 Component: Compiler | Version: 7.7 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: 5435 | Blocking: Related Tickets: | Differential Revisions: D207 -------------------------------------+------------------------------------- Comment (by simonmar): I believe lazy committing should be the default if you just allocate a chunk of memory with `VirtualAlloc`. The pages will be lazily faulted in as necessary, so the pages we never touch won't consume any memory. We just need to use the convention that a zero bit means "not heap". -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:53 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8199: Get rid of HEAP_ALLOCED on Windows (and other non-Linux platforms) -------------------------------------+------------------------------------- Reporter: ezyang | Owner: ezyang Type: feature request | Status: new Priority: low | Milestone: 8.0.1 Component: Compiler | Version: 7.7 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: 5435 | Blocking: Related Tickets: | Differential Rev(s): D207 Wiki Page: | -------------------------------------+------------------------------------- Changes (by ezyang): * priority: normal => low Old description:
This bug is to track progress of removing HEAP_ALLOCED from GHC, promising faster GC (especially for large/scattered heaps), as long as we can keep the cost of indirections down.
The relevant wiki article: http://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/HeapAlloced ; we are implementing method 2. Version 2 of the patchset is probably correct.
Blocking problems:
* Properly handle the Windows DLL case (e.g. SRTs). We will probably have to reorganize how the indirections are laid out.
* ~~Make it work for GHCi linking of static objects.~~ Blocked on #2841, I have it working for ELF, and can make it work for other platforms as soon as I get relevant machines.
* Bikeshed hs_main API changes (because closures are not valid prior to RTS initialization, so you have to pass in an indirection instead)
* Does not work with unloadObj (see comments below)
Performance improvements possible:
* ~~This patch introduces a lot of new symbols; ensure we are not unduly polluting the symbol tables. (In particular, I think _static_closure symbols can be made hidden).~~ ~~I've eliminated all of these except for the init symbols, which cross the stub object and assembly file boundary, and so would need to be made invisible by the linker.~~ I needed to make local info tables public.
* Don't pay for a double indirection when -fPIC is turned on. Probably the easiest way to do this is to *not* bake in the indirections into compiled code when it is -fPIC'd, and instead scribble over the GOT. However, I don't know how to go backwards from a symbol to a GOT entry, so we might need some heinous assembly hacks to get this working.
* The old HEAP_ALLOCED is supposed to be pessimal on very large heaps. Do some performance tests under those workloads.
* Make sure the extra indirection is not causing any C-- optimizations to stop firing (it might be, because I put it in as a literal CmmLoad)
* Once an static thunk is updated, we can tag the indirection to let other code segments to know about the good news. One way to do this is have the update frame for a static indirection should have a reference to the *indirection*, not the closure itself. However, this scheme will not affect other static closures which have references to the thunk.
* Closure tables should have their indirections short-circuited by the initialization code. But maybe it is not worth the cost of telling the RTS about the closure tables (also, they would need to be made writeable).
* We are paying an indirection when a GC occurs when the closure is not in R1. According to the wiki page, technically this is not needed, but I don't know how to eliminate references to the closure itself from stg_gc_fun.
* ~~Save tags inside the indirection tables, so that we don't spend instructions retagging after the following the indirection.~~ Done.
* ~~Improve static indirection and stable pointer registration, avoiding binary bloat from `__attribute(constructor)__` stubs.~~ After discussing this with some folks, it seems that there isn't really a portable way to do this when we are using the system dynamic linker to load libraries at startup. The problem is that we need to somehow get a list of all the GHC-compiled libraries which got loaded, and really the easiest way to get that is to just build it ourselves.
* ~~Need to implement a new megablock tracking structure so we can free/check for lost blocks~~. Now that efficient lookup is not necessary, perhaps we can write-optimize the megablock tracking structures.
Speculative improvements:
* Now that static lives in a block, can we GC them like we GC normal data? This would be beneficial for static thunks, which now can have their indirections completely removed; reverting CAFs may be somewhat tricky, however.
* Now that HEAP_ALLOCED is greatly simplified, can we further simply some aspects of the GC? At the very least, we ought to be able to make megablock allocation cheaper, by figuring out how to remove the spinlocks, etc.
* Another possibility is to adopt a hybrid approach, where we manually lay out closures when compiling statically, and indirect when compiling dynamically. In some sense, this gets the best of both worlds, since we expect to not pay any extra cost for indirection due to PIC.
New description: Update for readers: we fixed this for Linux with #9706. But may systems (including Windows) can't handle reserving this much memory. Some other technique is necessary in these cases. This is one possibility. ---- This bug is to track progress of removing HEAP_ALLOCED from GHC, promising faster GC (especially for large/scattered heaps), as long as we can keep the cost of indirections down. The relevant wiki article: http://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/HeapAlloced ; we are implementing method 2. Version 2 of the patchset is probably correct. Blocking problems: * Properly handle the Windows DLL case (e.g. SRTs). We will probably have to reorganize how the indirections are laid out. * ~~Make it work for GHCi linking of static objects.~~ Blocked on #2841, I have it working for ELF, and can make it work for other platforms as soon as I get relevant machines. * Bikeshed hs_main API changes (because closures are not valid prior to RTS initialization, so you have to pass in an indirection instead) * Does not work with unloadObj (see comments below) Performance improvements possible: * ~~This patch introduces a lot of new symbols; ensure we are not unduly polluting the symbol tables. (In particular, I think _static_closure symbols can be made hidden).~~ ~~I've eliminated all of these except for the init symbols, which cross the stub object and assembly file boundary, and so would need to be made invisible by the linker.~~ I needed to make local info tables public. * Don't pay for a double indirection when -fPIC is turned on. Probably the easiest way to do this is to *not* bake in the indirections into compiled code when it is -fPIC'd, and instead scribble over the GOT. However, I don't know how to go backwards from a symbol to a GOT entry, so we might need some heinous assembly hacks to get this working. * The old HEAP_ALLOCED is supposed to be pessimal on very large heaps. Do some performance tests under those workloads. * Make sure the extra indirection is not causing any C-- optimizations to stop firing (it might be, because I put it in as a literal CmmLoad) * Once an static thunk is updated, we can tag the indirection to let other code segments to know about the good news. One way to do this is have the update frame for a static indirection should have a reference to the *indirection*, not the closure itself. However, this scheme will not affect other static closures which have references to the thunk. * Closure tables should have their indirections short-circuited by the initialization code. But maybe it is not worth the cost of telling the RTS about the closure tables (also, they would need to be made writeable). * We are paying an indirection when a GC occurs when the closure is not in R1. According to the wiki page, technically this is not needed, but I don't know how to eliminate references to the closure itself from stg_gc_fun. * ~~Save tags inside the indirection tables, so that we don't spend instructions retagging after the following the indirection.~~ Done. * ~~Improve static indirection and stable pointer registration, avoiding binary bloat from `__attribute(constructor)__` stubs.~~ After discussing this with some folks, it seems that there isn't really a portable way to do this when we are using the system dynamic linker to load libraries at startup. The problem is that we need to somehow get a list of all the GHC-compiled libraries which got loaded, and really the easiest way to get that is to just build it ourselves. * ~~Need to implement a new megablock tracking structure so we can free/check for lost blocks~~. Now that efficient lookup is not necessary, perhaps we can write-optimize the megablock tracking structures. Speculative improvements: * Now that static lives in a block, can we GC them like we GC normal data? This would be beneficial for static thunks, which now can have their indirections completely removed; reverting CAFs may be somewhat tricky, however. * Now that HEAP_ALLOCED is greatly simplified, can we further simply some aspects of the GC? At the very least, we ought to be able to make megablock allocation cheaper, by figuring out how to remove the spinlocks, etc. * Another possibility is to adopt a hybrid approach, where we manually lay out closures when compiling statically, and indirect when compiling dynamically. In some sense, this gets the best of both worlds, since we expect to not pay any extra cost for indirection due to PIC. -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8199#comment:55 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC