[GHC] #15427: Calling hs_try_putmvar from an unsafe foreign call can cause the RTS to hang

21 Jul 2018

      #15427: Calling hs_try_putmvar from an unsafe foreign call can cause the RTS to
hang
-------------------------------------+-------------------------------------
           Reporter:  syntheorem     |             Owner:  (none)
               Type:  bug            |            Status:  new
           Priority:  normal         |         Milestone:  8.6.1
          Component:  Runtime        |           Version:  8.4.3
  System                             |
           Keywords:                 |  Operating System:  Unknown/Multiple
       Architecture:                 |   Type of failure:  Runtime crash
  Unknown/Multiple                   |
          Test Case:                 |        Blocked By:
           Blocking:                 |   Related Tickets:
Differential Rev(s):                 |         Wiki Page:
-------------------------------------+-------------------------------------
 An unsafe foreign call which calls `hs_try_putmvar` can cause the RTS to
 hang, preventing any Haskell threads from making progress. However,
 compiling with `-debug` causes it instead to fail an assertion in the
 scheduler:

 {{{
 internal error: ASSERTION FAILED: file rts/Schedule.c, line 510

     (GHC version 8.4.3 for x86_64_apple_darwin)
 }}}

 Here is a minimal test case which reproduces the assertion. It needs to be
 built with `-debug -threaded` and run with `+RTS -N2` or higher.

 {{{#!hs
 import Control.Concurrent (forkIO, threadDelay)
 import Control.Concurrent.MVar (MVar, newEmptyMVar, takeMVar)
 import Control.Monad (forever)
 import Foreign.C.Types (CInt(..))
 import Foreign.StablePtr (StablePtr)
 import GHC.Conc (PrimMVar, newStablePtrPrimMVar)

 foreign import ccall unsafe hs_try_putmvar :: CInt -> StablePtr PrimMVar
 -> IO ()

 main = do
   mvar <- newEmptyMVar

   forkIO $ forever $ do
     takeMVar mvar

   forkIO $ forever $ do
     sp <- newStablePtrPrimMVar mvar
     hs_try_putmvar (-1) sp
     threadDelay 1

   -- Let it spin a few times to trigger the bug
   threadDelay 500
 }}}

 I actually checked out GHC and added this as a test case and did some
 debugging. The specific assertion that fails is `ASSERT(task->cap ==
 cap)`. This seems to happen because of this code in `hs_try_putmvar`:

 {{{#!c
 Task *task = getTask();
 // ...
 ACQUIRE_LOCK(&cap->lock);
 // If the capability is free, we can perform the tryPutMVar immediately
 if (cap->running_task == NULL) {
     cap->running_task = task;
     task->cap = cap;
     RELEASE_LOCK(&cap->lock);
     // ...
     releaseCapability(cap);
 } else {
     // ...
 }
 }}}

 Basically it assumes that the current thread's task isn't currently
 running a capability, so it takes a new one and then releases it without
 restoring the previous value of `task->cap`.

 Modifying the code to restore the value of `task->cap` after releasing the
 capability fixes the assertion.  But I don't know enough about the RTS to
 be sure I'm not missing something here. In particular, is there a problem
 with the task basically holding two capabilities for a short time?

 My other thought is that maybe it should check if its task is currently
 running a capability, and in that case do something else. But I'm not sure
 what.

-- 
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15427
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler