[GHC] #15571: Eager AP_STACK blackholing causes incorrect size info for sanity checks

#15571: Eager AP_STACK blackholing causes incorrect size info for sanity checks -------------------------------------+------------------------------------- Reporter: osa1 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.1 Component: Runtime | Version: 8.5 System | Keywords: | Operating System: Unknown/Multiple Architecture: | Type of failure: None/Unknown Unknown/Multiple | Test Case: | Blocked By: Blocking: | Related Tickets: #15508 Differential Rev(s): | Wiki Page: -------------------------------------+------------------------------------- While debugging #15508 I found a case where eager blackholing in AP_STACK causes `closure_sizeW()` to return incorrect size, which in turn causes incorrect slop zeroing by `OVERWRITING_CLOSURE()`, which breaks sanity checks. To reproduce, cd into `testsuite/tests/concurrent/prog001`, then: {{{ $ ghc-stage2 Mult.hs -fforce-recomp -debug -rtsopts $ ./Mult +RTS -DS Mult: internal error: checkClosure: stack frame (GHC version 8.7.20180825 for x86_64_unknown_linux) Please report this as a GHC bug: http://www.haskell.org/ghc/reportabug zsh: abort (core dumped) ./Mult +RTS -DS }}} Here's how the problem occurs: 1. Allocate an AP_STACK in a generation during a GC. 2. Evaluate the AP_STACK. The entry code first WHITEHOLEs and then eagerly BLACKHOLEs it. At this point size of the STACK becomes 2 because that's the size of (eager or not) BLACKHOLE. 3. To start a GC the thread does `threadPaused`, which in line 342 actually BLACKHOLEs the eager blackhole (is this part really correct?) and zeros the slop, but because the eager blackhole has the same size as BLACKHOLE it doesn't actually zero the stack frames in the original AP_STACK's payload. 4. In the next GC, in pre-GC sanity check we check the whole heap. When checking the generation that the BLACKHOLE (the AP_STACK that became a BLACKHOLE in step (2)) resides in we check the closure, and then check `closure + 2` (2 is the size of BLACKHOLE) instead of `closure + <size of the stack>`, and end up checking a stack frame of the original AP_STACK. This causes the sanity check to fail because we don't expect to see a stack frame outside of a stack. In summary, normally when blackhole an object we zero the space after the blackhole (i.e. some part of the original object's payload) so that in sanity checks we can skip over that space, but we can't do this when eagerly blackholing (because the payload of the original object will be used) which causes sanity check failures. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15571 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15571: Eager AP_STACK blackholing causes incorrect size info for sanity checks -------------------------------------+------------------------------------- Reporter: osa1 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.1 Component: Runtime System | Version: 8.5 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: #15508 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonmar): Isn't this a problem in general for eager blackholing? I can't think of a good fix off the top of my head. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15571#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15571: Eager AP_STACK blackholing causes incorrect size info for sanity checks -------------------------------------+------------------------------------- Reporter: osa1 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.1 Component: Runtime System | Version: 8.5 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: #15508 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by osa1): We discussed this. Simon is right in comment:1 that this is problem with eager blackholing in general. However, eager blackholing is optional, so if you're working on the RTS and you need sanity checks can always not use it (which is the default). The problem with AP_STACK eager blackholing is more serious as it can't be disabled (expect perhaps in non-threaded runtime?) Anyways, we came up with this plan: - Implement a new stack frame AP_STACK_EAGER_BLACKHOLE which is exactly like the EAGER_BLACKHOLE but indicates that the object that became an eager blackhole was an AP_STACK. - (Only in debug runtime) Allocate one more word when allocating an AP_STACK and record its size. Note that this is only possible because we only allocate AP_STACKs in runtime (and not in generated code). - When eagerly blackholing an AP_STACK use AP_STACK_EAGER_BLACKHOLE instead of EAGER_BLACKHOLE and record the AP_STACK's size. - When we actually BLACKHOLE the AP_STACK_EAGER_BLACKHOLE in `threadPaused` we look at the size of the object and we can correctly return the `AP_STACK` size in `closure_sizeW()` because we recorded it in step (2). Simon, did I get this right? One thing I'm not sure (and forgot to ask at the meeting) is (3) in the bug report. I don't know if we're supposed to zero the slop when blackholing an AP_STACK eager blackhole in `threadPaused`. We enter the AP_STACK right after eagerly blackholing it, but I'm not sure if we can call `threadPaused` before being done with the stack. Can you comment on this? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15571#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15571: Eager AP_STACK blackholing causes incorrect size info for sanity checks -------------------------------------+------------------------------------- Reporter: osa1 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.1 Component: Runtime System | Version: 8.5 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: #15508 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonmar): Hmm, I wonder if there's a much simpler way. The AP_STACK entry code simply copies the payload of the object onto the stack, after doing a stack check. So why don't we zero the slop immediately after copying the payload? That way we don't overwrite any data, and we don't need any new objects or stack frames. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15571#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15571: Eager AP_STACK blackholing causes incorrect size info for sanity checks -------------------------------------+------------------------------------- Reporter: osa1 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.1 Component: Runtime System | Version: 8.5 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: #15508 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonmar): Zero the slop right here: https://phabricator.haskell.org/diffusion/GHC/browse/master/rts%2FApply.cmm$... -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15571#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15571: Eager AP_STACK blackholing causes incorrect size info for sanity checks -------------------------------------+------------------------------------- Reporter: osa1 | Owner: (none) Type: bug | Status: patch Priority: normal | Milestone: 8.6.1 Component: Runtime System | Version: 8.5 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: #15508 | Differential Rev(s): Phab:D5165 Wiki Page: | -------------------------------------+------------------------------------- Changes (by osa1): * status: new => patch * differential: => Phab:D5165 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15571#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15571: Eager AP_STACK blackholing causes incorrect size info for sanity checks -------------------------------------+------------------------------------- Reporter: osa1 | Owner: (none) Type: bug | Status: patch Priority: normal | Milestone: 8.6.1 Component: Runtime System | Version: 8.5 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: #15508 | Differential Rev(s): Phab:D5165 Wiki Page: | -------------------------------------+------------------------------------- Comment (by osa1): I submitted a patch. Simon, looking at this note {{{ Note [zeroing slop] In some scenarios we write zero words into "slop"; memory that is left unoccupied after we overwrite a closure in the heap with a smaller closure. Zeroing slop is required for: - full-heap sanity checks (DEBUG, and +RTS -DS) - LDV profiling (PROFILING, and +RTS -hb) Zeroing slop must be disabled for: - THREADED_RTS with +RTS -N2 and greater, because we cannot overwrite slop when another thread might be reading it. Hence, slop is zeroed when either: - PROFILING && era <= 0 (LDV is on) - !THREADED_RTS && DEBUG And additionally: - LDV profiling and +RTS -N2 are incompatible - full-heap sanity checks are disabled for THREADED_RTS }}} This says that we can't zero slops in threaded runtime and heap sanity checks are disabled in threaded runtime, but that's not true. I can see in gdb that we do full heap sanity check in threaded programs too. Do you know what changed since this comment? I'm surprised that sanity checks work at all in threaded runtime because we don't zero the slops... -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15571#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15571: Eager AP_STACK blackholing causes incorrect size info for sanity checks -------------------------------------+------------------------------------- Reporter: osa1 | Owner: (none) Type: bug | Status: patch Priority: normal | Milestone: 8.6.1 Component: Runtime System | Version: 8.5 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: #15508 | Differential Rev(s): Phab:D5165 Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonmar): If I recall correctly we disable the part of the sanity check that requires zeroing the slop when THREADED_RTS is on. See here: https://phabricator.haskell.org/diffusion/GHC/browse/master/rts%2Fsm%2FSanit... So full heap sanity check only happens after a major GC in the threaded runtime, and we don't do slop zeroing. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15571#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15571: Eager AP_STACK blackholing causes incorrect size info for sanity checks
-------------------------------------+-------------------------------------
Reporter: osa1 | Owner: (none)
Type: bug | Status: patch
Priority: normal | Milestone: 8.6.1
Component: Runtime System | Version: 8.5
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
| Unknown/Multiple
Type of failure: None/Unknown | Test Case:
Blocked By: | Blocking:
Related Tickets: #15508 | Differential Rev(s): Phab:D5165
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by Ömer Sinan Ağacan

#15571: Eager AP_STACK blackholing causes incorrect size info for sanity checks -------------------------------------+------------------------------------- Reporter: osa1 | Owner: (none) Type: bug | Status: closed Priority: normal | Milestone: Component: Runtime System | Version: 8.5 Resolution: fixed | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: #15508 | Differential Rev(s): Phab:D5165 Wiki Page: | -------------------------------------+------------------------------------- Changes (by osa1): * status: patch => closed * resolution: => fixed * milestone: 8.6.1 => -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15571#comment:9 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15571: Eager AP_STACK blackholing causes incorrect size info for sanity checks -------------------------------------+------------------------------------- Reporter: osa1 | Owner: (none) Type: bug | Status: merge Priority: normal | Milestone: 8.6.2 Component: Runtime System | Version: 8.5 Resolution: fixed | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: #15508 | Differential Rev(s): Phab:D5165 Wiki Page: | -------------------------------------+------------------------------------- Changes (by bgamari): * status: closed => merge * milestone: => 8.6.2 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15571#comment:10 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15571: Eager AP_STACK blackholing causes incorrect size info for sanity checks -------------------------------------+------------------------------------- Reporter: osa1 | Owner: (none) Type: bug | Status: closed Priority: normal | Milestone: 8.6.2 Component: Runtime System | Version: 8.5 Resolution: fixed | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: #15508 | Differential Rev(s): Phab:D5165 Wiki Page: | -------------------------------------+------------------------------------- Changes (by bgamari): * status: merge => closed Comment: Merged to `ghc-8.6` with 10e3125d4f6ea0684cbe1315b35a2a213d2765cd. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15571#comment:11 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC