[GHC] #15571: Eager AP_STACK blackholing causes incorrect size info for sanity checks

27 Aug 2018

      #15571: Eager AP_STACK blackholing causes incorrect size info for sanity checks
-------------------------------------+-------------------------------------
           Reporter:  osa1           |             Owner:  (none)
               Type:  bug            |            Status:  new
           Priority:  normal         |         Milestone:  8.6.1
          Component:  Runtime        |           Version:  8.5
  System                             |
           Keywords:                 |  Operating System:  Unknown/Multiple
       Architecture:                 |   Type of failure:  None/Unknown
  Unknown/Multiple                   |
          Test Case:                 |        Blocked By:
           Blocking:                 |   Related Tickets:  #15508
Differential Rev(s):                 |         Wiki Page:
-------------------------------------+-------------------------------------
 While debugging #15508 I found a case where eager blackholing in AP_STACK
 causes `closure_sizeW()` to return incorrect size, which in turn causes
 incorrect slop zeroing by `OVERWRITING_CLOSURE()`, which breaks sanity
 checks.

 To reproduce, cd into `testsuite/tests/concurrent/prog001`, then:

 {{{
 $ ghc-stage2 Mult.hs -fforce-recomp -debug -rtsopts
 $ ./Mult +RTS -DS
 Mult: internal error: checkClosure: stack frame
     (GHC version 8.7.20180825 for x86_64_unknown_linux)
     Please report this as a GHC bug:
 http://www.haskell.org/ghc/reportabug
 zsh: abort (core dumped)  ./Mult +RTS -DS
 }}}

 Here's how the problem occurs:

 1. Allocate an AP_STACK in a generation during a GC.

 2. Evaluate the AP_STACK. The entry code first WHITEHOLEs and then eagerly
    BLACKHOLEs it. At this point size of the STACK becomes 2 because that's
 the
    size of (eager or not) BLACKHOLE.

 3. To start a GC the thread does `threadPaused`, which in line 342
 actually
    BLACKHOLEs the eager blackhole (is this part really correct?) and zeros
 the
    slop, but because the eager blackhole has the same size as BLACKHOLE it
    doesn't actually zero the stack frames in the original AP_STACK's
 payload.

 4. In the next GC, in pre-GC sanity check we check the whole heap. When
    checking the generation that the BLACKHOLE (the AP_STACK that became a
    BLACKHOLE in step (2)) resides in we check the closure, and then check
    `closure + 2` (2 is the size of BLACKHOLE) instead of `closure + <size
 of the stack>`, and end up checking a stack frame of the original
 AP_STACK.
    This causes the sanity check to fail because we don't expect to see a
 stack
    frame outside of a stack.

 In summary, normally when blackhole an object we zero the space after the
 blackhole (i.e. some part of the original object's payload) so that in
 sanity
 checks we can skip over that space, but we can't do this when eagerly
 blackholing (because the payload of the original object will be used)
 which
 causes sanity check failures.

-- 
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15571
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler