[GHC] #9284: shutdownCapability sometimes loops indefinitely on OSX after forkProcess

8 Jul 2014

      #9284: shutdownCapability sometimes loops indefinitely on OSX after forkProcess
------------------------------------+-------------------------------------
       Reporter:  edsko             |             Owner:
           Type:  bug               |            Status:  new
       Priority:  normal            |         Milestone:
      Component:  Compiler          |           Version:  7.8.2
       Keywords:                    |  Operating System:  Unknown/Multiple
   Architecture:  Unknown/Multiple  |   Type of failure:  None/Unknown
     Difficulty:  Unknown           |         Test Case:
     Blocked By:                    |          Blocking:
Related Tickets:                    |
------------------------------------+-------------------------------------
 The attached Haskell program is a stress test for `forkProcess`. It starts
 100 child processes, each of which do a single, safe, FFI call, after
 which the main process waits for all child processes to terminate.

 I compile the test with

 {{{
 # gcc -c -o TestForkProcessC.o -g TestForkProcessC.c
 # ghc -debug -threaded -fforce-recomp -Wall TestForkProcess.hs
 TestForkProcessC.o
 }}}

 and then start running it until it fails (that is, until one or more of
 the child processes fail to terminate):

 {{{
 # while ./TestForkProcess +RTS -N1 ; do echo "OK"; done
 }}}

 Actually, most of the time this happens pretty quickly (often even on the
 first call to `TestForkProcess`).

 Those child processes that do fail to terminate get stuck in an infinite
 loop in `shutdownCapability`, which looks something like:

 {{{
 void shutdownCapability (Capability *cap, Task *task, rtsBool safe)
 {
     nat i;
     task->cap = cap;

     for (i = 0; /* i < 50 */; i++) {
         // ... other conditionals omitted

         if (cap->suspended_ccalls && safe) {
             cap->running_task = NULL;
             RELEASE_LOCK(&cap->lock);
             // The IO manager thread might have been slow to start up,
             // so the first attempt to kill it might not have
             // succeeded.  Just in case, try again - the kill message
             // will only be sent once.
             ioManagerDie();
             yieldThread();
             continue;
         }

         traceSparkCounters(cap);
         RELEASE_LOCK(&cap->lock);
         break;
     }
 }
 }}}

 (note that I'm only considering the threaded RTS). In the child processes
 that loop indefinitely this `cap->suspended_ccalls && safe` condition gets
 triggered time and again.

 When it does, it gets stuck waiting for a single `InCall`. This `InCall`
 is created by a call to `newInCall` in `workerStart` -- i.e., it is
 created on pthread startup. That begs the question where this worker task
 was created; this I don't know for sure but I am fairly sure that it
 happens during the initialization of the IO manager. (The initialization
 sequence of the IO manager involves the creation of 4 tasks before we even
 get to `main`, so it's bit a hard to navigate.)

 I have some further evidence that the I/O manager is involved, although
 not necessarily the cause of the problem. On normal termination, the I/O
 manager is asked to shutdown by the call to `ioManagerDie` in
 `shutdownCapability`, shown above. This will send `IO_MANAGER_DIE`
 (`0xFE`) on the I/O managers "control pipe" (created in
 `GHC.Event.Thread.startTimerManagerThread`). When the timer manager thread
 receives this (in `GHC.Event.TimerManager.handleControlEvent`) it calls
 `shutdownManagers`, which shuts down the IO manager threads by sending
 them `io_MANAGER_DIE` on their respective pipes. This gets received by
 `GHC.Event.Manager.handleControlEvent` and the IO manager threads exit.
 (Note on capitalization: `IO_MANAGER_DIE` is the C symbol;
 `io_MANAGER_DIE` is the Haskell symbol.)

 When the child process fails to terminate, the first part of this process
 still happens. The timer manager thread receives `IO_MANAGER_DIE` and
 calls `shutdownManagers`. However, now things go wrong, and it seems they
 go wrong in one of two ways. The very first thing that `shutdownManagers`
 does is acquire the `ioManagerLock`. Sometimes it gets stuck right there.
 However, this is not ''always'' the case. Sometimes it does manage to
 acquire the lock, and I can see it going through the loop and sending the
 shutdown signal to the IO manager thread (I'm saying "the" because I've
 exclusively been testing with `-N1`). Either way, in the case that the
 child process gets stuck, this signal somehow never arrives at the IO
 manager thread (that is, I have a print statement in `readControlMessage`
 that prints a message when it receives `IO_MANAGER_DIE`, along with a bit
 of information where it was called from, and that print statement never
 triggers).

 I am not sure where to go from here. Note that I have only been able to
 reproduce this on OSX/ghc 7.8. I have not been able to reproduce this
 problem on Linux/7.8 (although there _are_ other problems with
 `forkProcess` on Linux, which unfortunately are proving even more
 elusive). The attached stress test ''does'' very often get stuck on
 Linux/7.4 but of course that's a different I/O manager altogether and is
 probably an unrelated bug.

--
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9284
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler