[GHC] #9284: shutdownCapability sometimes loops indefinitely on OSX after forkProcess

#9284: shutdownCapability sometimes loops indefinitely on OSX after forkProcess ------------------------------------+------------------------------------- Reporter: edsko | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Keywords: | Operating System: Unknown/Multiple Architecture: Unknown/Multiple | Type of failure: None/Unknown Difficulty: Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | ------------------------------------+------------------------------------- The attached Haskell program is a stress test for `forkProcess`. It starts 100 child processes, each of which do a single, safe, FFI call, after which the main process waits for all child processes to terminate. I compile the test with {{{ # gcc -c -o TestForkProcessC.o -g TestForkProcessC.c # ghc -debug -threaded -fforce-recomp -Wall TestForkProcess.hs TestForkProcessC.o }}} and then start running it until it fails (that is, until one or more of the child processes fail to terminate): {{{ # while ./TestForkProcess +RTS -N1 ; do echo "OK"; done }}} Actually, most of the time this happens pretty quickly (often even on the first call to `TestForkProcess`). Those child processes that do fail to terminate get stuck in an infinite loop in `shutdownCapability`, which looks something like: {{{ void shutdownCapability (Capability *cap, Task *task, rtsBool safe) { nat i; task->cap = cap; for (i = 0; /* i < 50 */; i++) { // ... other conditionals omitted if (cap->suspended_ccalls && safe) { cap->running_task = NULL; RELEASE_LOCK(&cap->lock); // The IO manager thread might have been slow to start up, // so the first attempt to kill it might not have // succeeded. Just in case, try again - the kill message // will only be sent once. ioManagerDie(); yieldThread(); continue; } traceSparkCounters(cap); RELEASE_LOCK(&cap->lock); break; } } }}} (note that I'm only considering the threaded RTS). In the child processes that loop indefinitely this `cap->suspended_ccalls && safe` condition gets triggered time and again. When it does, it gets stuck waiting for a single `InCall`. This `InCall` is created by a call to `newInCall` in `workerStart` -- i.e., it is created on pthread startup. That begs the question where this worker task was created; this I don't know for sure but I am fairly sure that it happens during the initialization of the IO manager. (The initialization sequence of the IO manager involves the creation of 4 tasks before we even get to `main`, so it's bit a hard to navigate.) I have some further evidence that the I/O manager is involved, although not necessarily the cause of the problem. On normal termination, the I/O manager is asked to shutdown by the call to `ioManagerDie` in `shutdownCapability`, shown above. This will send `IO_MANAGER_DIE` (`0xFE`) on the I/O managers "control pipe" (created in `GHC.Event.Thread.startTimerManagerThread`). When the timer manager thread receives this (in `GHC.Event.TimerManager.handleControlEvent`) it calls `shutdownManagers`, which shuts down the IO manager threads by sending them `io_MANAGER_DIE` on their respective pipes. This gets received by `GHC.Event.Manager.handleControlEvent` and the IO manager threads exit. (Note on capitalization: `IO_MANAGER_DIE` is the C symbol; `io_MANAGER_DIE` is the Haskell symbol.) When the child process fails to terminate, the first part of this process still happens. The timer manager thread receives `IO_MANAGER_DIE` and calls `shutdownManagers`. However, now things go wrong, and it seems they go wrong in one of two ways. The very first thing that `shutdownManagers` does is acquire the `ioManagerLock`. Sometimes it gets stuck right there. However, this is not ''always'' the case. Sometimes it does manage to acquire the lock, and I can see it going through the loop and sending the shutdown signal to the IO manager thread (I'm saying "the" because I've exclusively been testing with `-N1`). Either way, in the case that the child process gets stuck, this signal somehow never arrives at the IO manager thread (that is, I have a print statement in `readControlMessage` that prints a message when it receives `IO_MANAGER_DIE`, along with a bit of information where it was called from, and that print statement never triggers). I am not sure where to go from here. Note that I have only been able to reproduce this on OSX/ghc 7.8. I have not been able to reproduce this problem on Linux/7.8 (although there _are_ other problems with `forkProcess` on Linux, which unfortunately are proving even more elusive). The attached stress test ''does'' very often get stuck on Linux/7.4 but of course that's a different I/O manager altogether and is probably an unrelated bug. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9284 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9284: shutdownCapability sometimes loops indefinitely on OSX after forkProcess -------------------------------------+------------------------------------ Reporter: edsko | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Unknown/Multiple Type of failure: None/Unknown | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------ Changes (by AndreasVoellmy): * cc: AndreasVoellmy (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9284#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9284: shutdownCapability sometimes loops indefinitely on OSX after forkProcess -------------------------------------+------------------------------------ Reporter: edsko | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Unknown/Multiple Type of failure: None/Unknown | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------ Comment (by AndreasVoellmy): I've verified that the test case triggers the bug in my environment (OS X 10.9.4 with GHC HEAD) as well. FWIW, it seems that just sampling the process (i.e. using "/usr/bin/sample" in OS X) causes the stuck process to finish. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9284#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9284: shutdownCapability sometimes loops indefinitely on OSX after forkProcess -------------------------------------+------------------------------------ Reporter: edsko | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Unknown/Multiple Type of failure: None/Unknown | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------ Comment (by edsko): Ah, yes, perhaps I should have mentioned -- I tried connecting `lldb` to the stuck process and that too was enough to nudge it to completion. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9284#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9284: shutdownCapability sometimes loops indefinitely on OSX after forkProcess -------------------------------------+------------------------------------ Reporter: edsko | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Unknown/Multiple Type of failure: None/Unknown | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------ Comment (by AndreasVoellmy): I just verified that the bug occurs even if you remove the call to `safeFfiFunction 2` in main, and replace it with `return ()`. It takes longer to trigger the problem, but it does still occur. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9284#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9284: shutdownCapability sometimes loops indefinitely on OSX after forkProcess -------------------------------------+------------------------------------ Reporter: edsko | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Unknown/Multiple Type of failure: None/Unknown | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------ Comment (by AndreasVoellmy): I haven't yet nailed it down, but here is my current guess as to what is happening when it gets stuck... The child process's IO manager thread is blocked in the foreign call on kevent indefinitely (as you describe). The intention is that `ioManagerDie()` writes to the control pipe that the kevent is subscribed to, and this write will cause the kevent call to return from its foreign call. It seems that the RTS has a file descriptor for the write end of the pipe (set by the call to `c_setIOManagerControlFd` in `GHC.Event.Control`) and so does the event manager on the HS side. I suspect that these get out of sync after the fork and exec; i.e. `ioManagerDie()` writes to some file descriptor that no longer corresponds to the write end of the pipe that kevent is waiting on. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9284#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9284: shutdownCapability sometimes loops indefinitely on OSX after forkProcess -------------------------------------+------------------------------------ Reporter: edsko | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Unknown/Multiple Type of failure: None/Unknown | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------ Comment (by edsko): Yeah, I thought about that too, but I couldn't see anything different in the file descriptors in deadlocking runs and non-deadlocking runs -- but that was purely based on the file descriptor _numbers_ (9, 4, 8), so you may of course still be right. I didn't manage to verify or refute this :/ -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9284#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9284: shutdownCapability sometimes loops indefinitely on OSX after forkProcess -------------------------------------+------------------------------------- Reporter: edsko | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: | Architecture: Unknown/Multiple Unknown/Multiple | Difficulty: Unknown Type of failure: | Blocked By: None/Unknown | Related Tickets: 9377 Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Changes (by edsko): * related: => 9377 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9284#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9284: shutdownCapability sometimes loops indefinitely on OSX after forkProcess -------------------------------------+------------------------------------- Reporter: edsko | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: | Architecture: Unknown/Multiple Unknown/Multiple | Difficulty: Unknown Type of failure: | Blocked By: None/Unknown | Related Tickets: #9377 Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Changes (by edsko): * related: 9377 => #9377 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9284#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9284: shutdownCapability sometimes loops indefinitely on OSX after forkProcess -------------------------------------+------------------------------------- Reporter: edsko | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: | Architecture: Unknown/Multiple Unknown/Multiple | Difficulty: Unknown Type of failure: | Blocked By: None/Unknown | Related Tickets: #9377 Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Changes (by snoyberg): * cc: snoyberg (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9284#comment:9 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9284: shutdownCapability sometimes loops indefinitely on OSX after forkProcess -------------------------------------+------------------------------------- Reporter: edsko | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: | Architecture: Unknown/Multiple Unknown/Multiple | Difficulty: Unknown Type of failure: | Blocked By: None/Unknown | Related Tickets: #9377 Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Comment (by simonmar): Isn't this fixed now? https://phabricator.haskell.org/D99 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9284#comment:10 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9284: shutdownCapability sometimes loops indefinitely on OSX after forkProcess -------------------------------------+------------------------------------- Reporter: edsko | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: | Architecture: Unknown/Multiple Unknown/Multiple | Difficulty: Unknown Type of failure: | Blocked By: None/Unknown | Related Tickets: #9377 Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Comment (by edsko): Termination of the I/O manager is still broken in the general case; Andreas has been working on that and he sent me a patch a few days ago but I haven't had a chance to look at it it -- but the problem is that we will wait indefinitely for the I/O manager to terminate whenever we do a "safe" exit from the RTS (call to `hs_exit` rather than `hs_exit_`). -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9284#comment:11 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9284: shutdownCapability sometimes loops indefinitely on OSX after forkProcess -------------------------------------+------------------------------------- Reporter: edsko | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: | Architecture: Unknown/Multiple Unknown/Multiple | Difficulty: Unknown Type of failure: | Blocked By: None/Unknown | Related Tickets: #9377 Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Comment (by AndreasVoellmy): The issue was that `forkProcess` calls `hs_exit_(rtsTrue)`, which waits for foreign calls to return. Then, the foreign calls made by IO manager threads were not woken up (the cause of this is a bit complicated so I'll explain that elsewhere). The solution in the case of forkProcess is to call `hs_exit(rtsFalse)` which is what normal exit of the Haskell RTS does anyway. However, some clients (e.g. when writing a C program that start and exit GHC RTS) may call `hs_exit()`, which calls `hs_exit_(rtsTrue)`, so the problem of waking up the foreign calls made by IO manager threads remains an issue. I now have a fix for this issue as well. So now the question is: should we close this issue, since it is specific to `forkProcess` and create a separate issue for programs calling `hs_exit()`? Or should we solve both of these issues in this issue? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9284#comment:12 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9284: shutdownCapability sometimes loops indefinitely on OSX after forkProcess -------------------------------------+------------------------------------- Reporter: edsko | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: | Keywords: Operating System: | Architecture: Unknown/Multiple Unknown/Multiple | Difficulty: Unknown Type of failure: | Blocked By: None/Unknown | Related Tickets: #9377 Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Comment (by AndreasVoellmy): I created a new issue #9423 for the problem when caused by a call to `hs_exit`. Now we can treat the issue here as being specific to `forkProcess`. Since we have a fix for this (I believe) we can close this ticket independently of #9423. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9284#comment:13 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9284: shutdownCapability sometimes loops indefinitely on OSX after forkProcess -------------------------------------+------------------------------------- Reporter: edsko | Owner: Type: bug | Status: closed Priority: normal | Milestone: Component: Compiler | Version: 7.8.2 Resolution: fixed | Keywords: Operating System: | Architecture: Unknown/Multiple Unknown/Multiple | Difficulty: Unknown Type of failure: | Blocked By: None/Unknown | Related Tickets: #9377 Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Changes (by simonmar): * status: new => closed * resolution: => fixed Comment: Fixed by Phab:D99 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9284#comment:14 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9284: shutdownCapability sometimes loops indefinitely on OSX after forkProcess
-------------------------------------+-------------------------------------
Reporter: edsko | Owner:
Type: bug | Status: closed
Priority: normal | Milestone:
Component: Compiler | Version: 7.8.2
Resolution: fixed | Keywords:
Operating System: | Architecture: Unknown/Multiple
Unknown/Multiple | Difficulty: Unknown
Type of failure: | Blocked By:
None/Unknown | Related Tickets: #9377
Test Case: |
Blocking: |
Differential Revisions: |
-------------------------------------+-------------------------------------
Comment (by Austin Seipp

#9284: shutdownCapability sometimes loops indefinitely on OSX after forkProcess -------------------------------------+------------------------------------- Reporter: edsko | Owner: Type: bug | Status: closed Priority: normal | Milestone: 7.8.4 Component: Compiler | Version: 7.8.2 Resolution: fixed | Keywords: Operating System: | Architecture: Unknown/Multiple Unknown/Multiple | Difficulty: Unknown Type of failure: | Blocked By: None/Unknown | Related Tickets: #9377 Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Changes (by carter): * milestone: => 7.8.4 Comment: Adding a 7.8.4 milestone so this can be considered for the 7.8.4 release, i think this was only applied to the 7.9 (HEAD) branch -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9284#comment:16 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9284: shutdownCapability sometimes loops indefinitely on OSX after forkProcess -------------------------------------+------------------------------------- Reporter: edsko | Owner: Type: bug | Status: merge Priority: normal | Milestone: 7.8.4 Component: Compiler | Version: 7.8.2 Resolution: fixed | Keywords: Operating System: | Architecture: Unknown/Multiple Unknown/Multiple | Difficulty: Unknown Type of failure: | Blocked By: None/Unknown | Related Tickets: #9377 Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Changes (by thoughtpolice): * status: closed => merge -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9284#comment:17 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9284: shutdownCapability sometimes loops indefinitely on OSX after forkProcess -------------------------------------+------------------------------------- Reporter: edsko | Owner: Type: bug | Status: closed Priority: normal | Milestone: 7.8.4 Component: Compiler | Version: 7.8.2 Resolution: fixed | Keywords: Operating System: | Architecture: Unknown/Multiple Unknown/Multiple | Difficulty: Unknown Type of failure: | Blocked By: None/Unknown | Related Tickets: #9377 Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Changes (by thoughtpolice): * status: merge => closed Comment: Merged to `ghc-7.8` (after a bit of pain). -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9284#comment:18 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9284: shutdownCapability sometimes loops indefinitely on OSX after forkProcess -------------------------------------+------------------------------------- Reporter: edsko | Owner: Type: bug | Status: closed Priority: normal | Milestone: 7.10.1 Component: Compiler | Version: 7.8.2 Resolution: fixed | Keywords: Operating System: | Architecture: Unknown/Multiple Unknown/Multiple | Difficulty: Unknown Type of failure: | Blocked By: None/Unknown | Related Tickets: #9377 Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Changes (by thoughtpolice): * milestone: 7.8.4 => 7.10.1 Comment: I'm actually backing this out of 7.8 right now due to some worries about #9722; it also introduces API additions in 7.8.4 we should be careful about. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9284#comment:19 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9284: shutdownCapability sometimes loops indefinitely on OSX after forkProcess -------------------------------------+------------------------------------- Reporter: edsko | Owner: Type: bug | Status: closed Priority: normal | Milestone: 7.10.1 Component: Compiler | Version: 7.8.2 Resolution: fixed | Keywords: Operating System: | Architecture: Unknown/Multiple Unknown/Multiple | Difficulty: Unknown Type of failure: | Blocked By: None/Unknown | Related Tickets: #9377 Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Comment (by carter): :( ok -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9284#comment:20 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9284: shutdownCapability sometimes loops indefinitely on OSX after forkProcess -------------------------------------+------------------------------------- Reporter: edsko | Owner: Type: bug | Status: closed Priority: normal | Milestone: 7.10.1 Component: Compiler | Version: 7.8.2 Resolution: fixed | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: None/Unknown | Unknown/Multiple Blocked By: | Test Case: Related Tickets: #9377 | Blocking: | Differential Revisions: -------------------------------------+------------------------------------- Comment (by edsko): Guys, I'm not sure that we deal with this entirely correctly still. Sadly, I have no test case other than to tell you that I was just compiling a project with 7.10 and I got: {{{ ghc-stage2: ioManagerWakeup: write: Bad file descriptor }}} :-/ -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9284#comment:21 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9284: shutdownCapability sometimes loops indefinitely on OSX after forkProcess -------------------------------------+------------------------------------- Reporter: edsko | Owner: Type: bug | Status: closed Priority: normal | Milestone: 7.10.1 Component: Compiler | Version: 7.8.2 Resolution: fixed | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: None/Unknown | Unknown/Multiple Blocked By: | Test Case: Related Tickets: #9377 | Blocking: | Differential Revisions: -------------------------------------+------------------------------------- Comment (by simonmar): @edsko: head on over to #9722 with that. There is possibly a nondeterministic bug in the shutdown code that was introduced with the latest fix, but we haven't been able to get a reproducible enough test case. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9284#comment:22 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC