
#9284: shutdownCapability sometimes loops indefinitely on OSX after forkProcess ------------------------------------+------------------------------------- Reporter: edsko | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Keywords: | Operating System: Unknown/Multiple Architecture: Unknown/Multiple | Type of failure: None/Unknown Difficulty: Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | ------------------------------------+------------------------------------- The attached Haskell program is a stress test for `forkProcess`. It starts 100 child processes, each of which do a single, safe, FFI call, after which the main process waits for all child processes to terminate. I compile the test with {{{ # gcc -c -o TestForkProcessC.o -g TestForkProcessC.c # ghc -debug -threaded -fforce-recomp -Wall TestForkProcess.hs TestForkProcessC.o }}} and then start running it until it fails (that is, until one or more of the child processes fail to terminate): {{{ # while ./TestForkProcess +RTS -N1 ; do echo "OK"; done }}} Actually, most of the time this happens pretty quickly (often even on the first call to `TestForkProcess`). Those child processes that do fail to terminate get stuck in an infinite loop in `shutdownCapability`, which looks something like: {{{ void shutdownCapability (Capability *cap, Task *task, rtsBool safe) { nat i; task->cap = cap; for (i = 0; /* i < 50 */; i++) { // ... other conditionals omitted if (cap->suspended_ccalls && safe) { cap->running_task = NULL; RELEASE_LOCK(&cap->lock); // The IO manager thread might have been slow to start up, // so the first attempt to kill it might not have // succeeded. Just in case, try again - the kill message // will only be sent once. ioManagerDie(); yieldThread(); continue; } traceSparkCounters(cap); RELEASE_LOCK(&cap->lock); break; } } }}} (note that I'm only considering the threaded RTS). In the child processes that loop indefinitely this `cap->suspended_ccalls && safe` condition gets triggered time and again. When it does, it gets stuck waiting for a single `InCall`. This `InCall` is created by a call to `newInCall` in `workerStart` -- i.e., it is created on pthread startup. That begs the question where this worker task was created; this I don't know for sure but I am fairly sure that it happens during the initialization of the IO manager. (The initialization sequence of the IO manager involves the creation of 4 tasks before we even get to `main`, so it's bit a hard to navigate.) I have some further evidence that the I/O manager is involved, although not necessarily the cause of the problem. On normal termination, the I/O manager is asked to shutdown by the call to `ioManagerDie` in `shutdownCapability`, shown above. This will send `IO_MANAGER_DIE` (`0xFE`) on the I/O managers "control pipe" (created in `GHC.Event.Thread.startTimerManagerThread`). When the timer manager thread receives this (in `GHC.Event.TimerManager.handleControlEvent`) it calls `shutdownManagers`, which shuts down the IO manager threads by sending them `io_MANAGER_DIE` on their respective pipes. This gets received by `GHC.Event.Manager.handleControlEvent` and the IO manager threads exit. (Note on capitalization: `IO_MANAGER_DIE` is the C symbol; `io_MANAGER_DIE` is the Haskell symbol.) When the child process fails to terminate, the first part of this process still happens. The timer manager thread receives `IO_MANAGER_DIE` and calls `shutdownManagers`. However, now things go wrong, and it seems they go wrong in one of two ways. The very first thing that `shutdownManagers` does is acquire the `ioManagerLock`. Sometimes it gets stuck right there. However, this is not ''always'' the case. Sometimes it does manage to acquire the lock, and I can see it going through the loop and sending the shutdown signal to the IO manager thread (I'm saying "the" because I've exclusively been testing with `-N1`). Either way, in the case that the child process gets stuck, this signal somehow never arrives at the IO manager thread (that is, I have a print statement in `readControlMessage` that prints a message when it receives `IO_MANAGER_DIE`, along with a bit of information where it was called from, and that print statement never triggers). I am not sure where to go from here. Note that I have only been able to reproduce this on OSX/ghc 7.8. I have not been able to reproduce this problem on Linux/7.8 (although there _are_ other problems with `forkProcess` on Linux, which unfortunately are proving even more elusive). The attached stress test ''does'' very often get stuck on Linux/7.4 but of course that's a different I/O manager altogether and is probably an unrelated bug. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9284 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler