[Git][ghc/ghc][wip/io-manager-deadlock-detection] 25 commits: Make the IOManager API use CapIOManager rather than Capability
Duncan Coutts pushed to branch wip/io-manager-deadlock-detection at Glasgow Haskell Compiler / GHC Commits: 6755737b by Duncan Coutts at 2025-11-15T21:18:33+00:00 Make the IOManager API use CapIOManager rather than Capability This makes the API somewhat more self-contained and more consistent. Now the IOManager API and each of the backends takes just the I/O manager structure. Previously we had a bit of a mixture, depending on whether the function needed access to the Capability or just the CapIOManager. We still need access to the cap, so we introduce a back reference to reach the capability, via iomgr->cap. Convert all uses in select and poll backends, but not win32 ones. Convert callers in the scheduler and elsewhere. Also convert the three CMM primops that call IOManager APIs. They just need to use Capability_iomgr(MyCapability()). - - - - - c62f7d73 by Duncan Coutts at 2025-11-15T21:18:34+00:00 Split posix/MIO.c out of posix/Signals.c The MIO I/O manager was secretly living inside the Signals file. Now it gets its own file, like any other self-respecting I/O manager. - - - - - 6a7e3be8 by Duncan Coutts at 2025-11-15T21:18:34+00:00 Rationalise some scheduler run queue utilities Move them all to the same place in the file. Make some static that were used only internally. Also remove a redundant assignment after calling truncateRunQueue that is already done within truncateRunQueue. - - - - - 3bb8999e by Duncan Coutts at 2025-11-15T21:18:34+00:00 Rename initIOManager{AfterFork} to {re}startIOManager These are more accurate names, since these actions happen after initialisation and are really about starting (or restarting) background threads. - - - - - 64397059 by Duncan Coutts at 2025-11-15T21:18:34+00:00 Add a TODO to the MIO I/O manager The direction of travel is to make I/O managers per-capability and have all their state live in the struct CapIOManager. The MIO I/O manager however still has a number of global variables. It's not obvious how handle these globals however. - - - - - bc8a0ddc by Duncan Coutts at 2025-11-15T21:18:34+00:00 Free per-cap I/O managers during shutdown and forkProcess Historically this was not strictly necessary. The select and win32 legacy I/O managers did not maintain any dynamically allocated resources. The new poll one does (an auxillary table), and so this should be freed. After forkProcess, all threads get deleted. This includes threads waiting on I/O or timers. So as of this patch, resetting the I/O manager is just about tidying things up. For example, for the poll I/O manager this will reset the size of the AIOP table (which otherwise grows but never shrinks). In future however the re-initialising will become neeecessary for functionality, since some I/O managers will need to re-initialise wakeup fds that are set CLOEXEC. - - - - - f23f321f by Duncan Coutts at 2025-11-15T21:18:34+00:00 Add an FdWakup module for posix I/O managers This will be used to implement wakeupIOManager for in-RTS I/O managers. It provides a notification/wakeup mechanism using FDs, suitable for situations when I/O managers are blocked on a set of fds anyway. - - - - - 35a9653d by Duncan Coutts at 2025-11-15T21:18:34+00:00 Add wakeupIOManager support for select I/O manager Uses the FdWakup mechanism. - - - - - adf29463 by Duncan Coutts at 2025-11-15T21:18:34+00:00 Add wakeupIOManager support for poll I/O manager Uses the FdWakup mechanism. A quirk we have to cope with is that we now need to poll one more fd -- the wakeup_fd_r -- but this fd has no corresponding entry in the aiop_table. This is awkward since we have set up our aiop_poll_table to be an auxilliary table with matching indicies. The solution this patch uses (and described in the comments) is to have two tables: struct pollfd *aiop_poll_table, *full_poll_table; and to have the aiop_poll_table alias the tail of the full_poll_table. The head entry in the full_poll_table is the extra fd. So we poll the full_poll_table, while the aiop_poll_table still has matching indicies with the aiop_table. Hurrah for C aliasing rules. - - - - - 30b7fe6b by Duncan Coutts at 2025-11-15T21:18:34+00:00 Add wakeupIOManager support for win32 legacy I/O manager - - - - - 14f56d0e by Duncan Coutts at 2025-11-15T21:18:34+00:00 wakeupIOManager is now required for all I/O managers We are going to rely on it. Previously it could be a no-op. Update the docs in the header file. Also, temporarily disable awaitCompletedTimeoutsOrIO post-condition assertion. It will become more complicated due to wakeupIOManager, and it's not yet clear how to express it. We will re-introduce a post condition after a few more changes. - - - - - 2b8c1604 by Duncan Coutts at 2025-11-15T21:18:34+00:00 Make signal handling be a respondibility of the I/O manager(s) Previously it was scattered between I/O managers and the scheduler, and especially the scheduler's deadlock detection. Previously the scheduler would poll for pending signals each iteration of the scheduler loop. The scheduler also had some hairy signal functionality in the deadlock detection: in the non-threaded RTS (only) if there were still no threads running after deadlock detection then it would block waiting for signals. But signals can and (in my opinion) should be thought of as just a funny kind of I/O, and thus should be a responsibility of the I/O manager. So now we have the I/O managers poll for signals when they are polling for I/O completion (and removing the separate poll in the scheduler). And when I/O managers block waiting for I/O then they now also start signal handlers if they get interrupted by a signal. Crucially, if there is no pending I/O or timers, the awaitCompletedTimeoutsOrIO will still block waiting for signals. This patch puts us into an intermediate state: it temporarily breaks deadlock detection in the non-threaded RTS. The waiting on I/O currently happens before deadlock detection. This means we'll now wait forever on signals before doing deadlock detection. We need to move waiting after deadlock detection. We'll do that in a later patch. - - - - - c8f52a52 by Duncan Coutts at 2025-11-15T21:18:34+00:00 Clean up signal handling internal API Now that the I/O manager is responsible for signals, we can simplify the API we present for signal handling. We now just need startPendingSignalHandlers, which is called from the I/O managers. We can get rid of awaitUserSignals. We also don't need RtsSignals.h to re-export the platform-specific posix/Signals.h or win32/ConsoleHandler.h We can also hide more of the implementation of signals. Less has to be exposed in posix/Signals.h or win32/ConsoleHandler.h. Partly this is because we don't need inline functions (or macros) in the interface. Also remove signal_handlers from RTS ABI exported symbols list. It does not appear to have any users in the core libs, and its really an internal implementation detail. It should not be exposed unless its really necessary. - - - - - 626d112c by Duncan Coutts at 2025-11-15T21:18:34+00:00 In the scheduler, move I/O blocking after deadlock detection To make deadlock detection effective in the non-threaded RTS when there are deadlocked threads and other unrelated threads waiting on I/O, we need to arrange to do deadlock detection before we block in scheduler to wait on I/O. The solution is to: 1. adjust scheduleFindWork, which runs before deadlock detection, to only poll for I/O and not block; and 2. add a step after deadlock detection to wait on I/O if there are still no threads to run (and there's any I/O or timeouts outstanding) The scheduleCheckBlockedThreads is now so simple that it made more sense to inline it into scheduleFindWork. - - - - - ea1b84b2 by Duncan Coutts at 2025-11-15T21:18:34+00:00 Remove bogus anyPendingTimeoutsOrIO guard from scheduleDetectDeadlock The deadlock detection was only invoked if both of these conditions hold: 1. the run queue is empty 2. there is no pending I/O or timeouts The second condition is unnecessary. The deadlock detection mechanism can find deadlocks even if there are other threads waiting on I/O or timers. Having this extra condition means that we fail to detect blocked threads if there are any threads waiting on I/O or timers. Part of fixing issue #26408 - - - - - eefcdfea by Duncan Coutts at 2025-11-15T21:18:34+00:00 Don't consider pending I/O for early context switch optimisation Context switches are normally initiated by the timer signal. If however the user specifies "context switch as often as possible", with +RTS -C0 then the scheduler arranges for an early context switch (when it's just about to run a Haskell thread). Context switching very often is expensive, so as an optimisation there cases where we do not arrange an early context switch: 1. if there's no other threads to run 2. if there is no pending I/O or timers This patch eliminates case 2, leaving only case 1. The rationale is as follows. The use of this was inconsistent across platforms and threaded/non-threaded RTS ways. It only worked on the non-threaded RTS and on Windows only worked for the win32-legacy I/O manager. On all other combinations anyPendingTimeoutsOrIO would always return false. The fact that nobody noticed and complained about this inconsistency suggests that the feature is not relied upon. If however it turns out that applications do rely on this, then the proper thing to do is not to restore this check, but to add a new I/O manager hint function that returns if there is any pending events that are likely to happen *soon*: for example timeouts expiring within one timeslice, or I/O waits on things likely to complete soon like disk I/O, but not for example socket/pipe I/O. The motivation to avoid this use of anyPendingTimeoutsOrIO is to allow us to eliminate anyPendingTimeoutsOrIO entirely. All other uses of this are just guards on {await,poll}CompletedTimeoutsOrIO and the guards can safely be folded into those functions. This will better cope with some I/O managers having no proper implementation of anyPendingTimeoutsOrIO. Ultimately this will let us simplify the scheduler which currently has to have special #ifdef mingw32_HOST_OS cases to cope with the lack of a working anyPendingTimeoutsOrIO for some Windows I/O managers - - - - - 2fee1ad9 by Duncan Coutts at 2025-11-15T21:18:34+00:00 Remove anyPendingTimeoutsOrIO guarding {poll,await}CompletedTimeoutsOrIO Previously the API of the I/O manager used a two step process: check anyPendingTimeoutsOrIO and then call {poll,await}CompletedTimeoutsOrIO. This was primarily there as a performance thing, to cheaply check if we need to do anything. And then because anyPendingTimeoutsOrIO existed, it was used for other things too. We have now eliminated the other uses, and are just left with the performance pattern. But this was problematic because not all I/O managers correctly implement anyPendingTimeoutsOrIO (specifically the win32 ones), and now that we also make I/O managers responsible for signals then we need to poll/await even if there is no pending I/O or timeouts. If there is no pending I/O or timeouts then poll/await needs to degenerate to just waiting forever for any signals. - - - - - 306c7616 by Duncan Coutts at 2025-11-15T21:18:34+00:00 Remove anyPendingTimeoutsOrIO, it is no longer used And this avoids the problems arising from the win32 I/O managers having had a bogus implementation. - - - - - edd9dbf9 by Duncan Coutts at 2025-11-15T21:18:34+00:00 Remove second scheduler call to awaitCompletedTimeoutsOrIO Previously awaitCompletedTimeoutsOrIO was called both before and after deadlock detection in the scheduler. The reason for that was that the win32 I/O managers had a bogus implementation of anyPendingTimeoutsOrIO and this was used to guard the call of awaitCompletedTimeoutsOrIO prior to deadlock detection. This meant the first call site was never actually called when using the win32 I/O managers. This was the reason for the second call: the first one was never used. What a mess. So now we have a simple design in the scheduler: 1. poll for completed I/O, timers or signals 2. if no runnable threads: do deadlock detection 3. if still no runnable threads: block waiting for I/O, timers or signals. - - - - - 27470f83 by Duncan Coutts at 2025-11-15T21:18:34+00:00 Lift emptyRunQueue guard out of scheduleDetectDeadlock this improved the clarity of the logic when reading the scheduler code. - - - - - b927bf9c by Duncan Coutts at 2025-11-15T21:18:34+00:00 Make non-threaded deadlock detection also rely on idle GC Only do deadlock detection GC when idle GC kicks in. This also relies on using wakeUpRts, so now do this unconditionally. Previously wakeUpRts was for the threaded rts only. - - - - - 6f9d9d20 by Duncan Coutts at 2025-11-15T21:18:34+00:00 Enable idle GC by default on non-threaded RTS. The behaviour is now uniform between threaded and non-threaded. The deadlock detection now relies on idle GC for both threaded and non-threaded ways. Previously deadlock detection did not rely on idle GC for the non-threaded way. - - - - - 3d4bc7d3 by Duncan Coutts at 2025-11-15T21:18:34+00:00 Add a long Note [Deadlock detection] It describes the historical and modern designs and their trade-offs. The point is we've now unified the code for deadlock detection between the threaded and non-threaded ways, by changing the non-threaded to follow the same design as the threaded. - - - - - 4bc527ef by Duncan Coutts at 2025-11-15T21:18:34+00:00 Add a test for deadlock detection, issue #26408 - - - - - eb13b6d4 by Duncan Coutts at 2025-11-15T21:18:34+00:00 Update the user guide with the revised idle GC behaviour i.e. it's now not just for the threaded RTS, but general. Also document the fact that disabling idle GC also disables deadlock detection. - - - - - 33 changed files: - docs/users_guide/runtime_control.rst - rts/Capability.c - rts/IOManager.c - rts/IOManager.h - rts/IOManagerInternals.h - rts/PrimOps.cmm - rts/RaiseAsync.c - rts/RtsFlags.c - rts/RtsSignals.h - rts/RtsStartup.c - rts/RtsSymbols.c - rts/Schedule.c - rts/Schedule.h - rts/Timer.c - + rts/posix/FdWakeup.c - + rts/posix/FdWakeup.h - + rts/posix/MIO.c - + rts/posix/MIO.h - rts/posix/Poll.c - rts/posix/Poll.h - rts/posix/Select.c - rts/posix/Select.h - rts/posix/Signals.c - rts/posix/Signals.h - rts/posix/Timeout.c - rts/posix/Timeout.h - rts/rts.cabal - rts/win32/AwaitEvent.c - rts/win32/ConsoleHandler.c - rts/win32/ConsoleHandler.h - + testsuite/tests/rts/T26408.hs - + testsuite/tests/rts/T26408.stderr - testsuite/tests/rts/all.T The diff was not included because it is too large. View it on GitLab: https://gitlab.haskell.org/ghc/ghc/-/compare/7d601300898f314a7b9fe7164872cee... -- View it on GitLab: https://gitlab.haskell.org/ghc/ghc/-/compare/7d601300898f314a7b9fe7164872cee... You're receiving this email because of your account on gitlab.haskell.org.
participants (1)
-
Duncan Coutts (@dcoutts)