[GHC] #7653: incorrect handling of StackOverflow exception in the event manager

#7653: incorrect handling of StackOverflow exception in the event manager --------------------------+------------------------------------------------- Reporter: nus | Owner: Type: bug | Status: new Priority: normal | Component: libraries/base Version: 7.7 | Keywords: Os: Linux | Architecture: Unknown/Multiple Failure: Runtime crash | Blockedby: Blocking: | Related: --------------------------+------------------------------------------------- Under high pressure of `registerTimeout`s the event manager thread's stack overflows. Testcases: Shachaf initially reported this testcase on #ghc, which is irreproducible locally: {{{ import Control.Monad; import Control.Concurrent main = replicateM_ 1000000 (forkIO (threadDelay 1)) }}} to be compiled and run as follows: {{{ $ ghc -O2 -threaded Main.hs && time ./Main +RTS -N }}} Limiting the stack to the minimum helped to reproduce this locally, both on x86 and x64: {{{ import Control.Monad import Control.Concurrent main = replicateM_ 502 (forkIO (threadDelay 1)) -- 504 on x64 }}} {{{ $ ghc -O2 -threaded -with-rtsopts="-N8 -K4" rplfrk.hs && ./rplfrk # -K8 on x64 }}} and this, though less deterministically: {{{ import Control.Monad import Control.Concurrent main = replicateM_ 340 ( forkIO (threadDelay 1)) }}} {{{ $ ghc -O2 -threaded -rtsopts repl2-x86.hs && ./repl2-x86 +RTS -N1 -K4 }}} Error messages look like: {{{ Stack space overflow: current size 4 bytes. Use `+RTS -Ksize -RTS' to increase it. repl2-x86: sendWakeup: invalid argument (Bad file descriptor) [...repeated...] repl2-x86: ioManagerDie: write: Bad file descriptor }}} -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7653 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7653: incorrect handling of StackOverflow exception in the event manager --------------------------+------------------------------------------------- Reporter: nus | Owner: Type: bug | Status: new Priority: normal | Component: libraries/base Version: 7.7 | Keywords: Os: Linux | Architecture: Unknown/Multiple Failure: Runtime crash | Blockedby: Blocking: | Related: --------------------------+------------------------------------------------- Comment(by nus): `libraries\base\GHC\Event\Manager.hs`, the thunk accummulated in `f'`: {{{ registerTimeout :: EventManager -> Int -> TimeoutCallback -> IO TimeoutKey registerTimeout mgr us cb = do !key <- newUnique (emUniqueSource mgr) if us <= 0 then cb else do now <- getMonotonicTime let expTime = fromIntegral us / 1000000.0 + now -- We intentionally do not evaluate the modified map to WHNF here. -- Instead, we leave a thunk inside the IORef and defer its -- evaluation until mkTimeout in the event loop. This is a -- workaround for a nasty IORef contention problem that causes the -- thread-delay benchmark to take 20 seconds instead of 0.2. atomicModifyIORef (emTimeouts mgr) $ \f -> let f' = (Q.insert key expTime cb) . f in (f', ()) wakeManager mgr return $ TK key }}} is evaluated in `applyEdits`: {{{ step :: EventManager -> TimeoutQueue -> IO (Bool, TimeoutQueue) step mgr@EventManager{..} tq = do (timeout, q') <- mkTimeout tq I.poll emBackend timeout (onFdEvent mgr) state <- readIORef emState state `seq` return (state == Running, q') where -- | Call all expired timer callbacks and return the time to the -- next timeout. mkTimeout :: TimeoutQueue -> IO (Timeout, TimeoutQueue) mkTimeout q = do now <- getMonotonicTime applyEdits <- atomicModifyIORef emTimeouts $ \f -> (id, f) let (expired, q'') = let q' = applyEdits q in q' `seq` Q.atMost now q' sequence_ $ map Q.value expired let timeout = case Q.minView q'' of Nothing -> Forever Just (Q.E _ t _, _) -> -- This value will always be positive since the call -- to 'atMost' above removed any timeouts <= 'now' let t' = t - now in t' `seq` Timeout t' return (timeout, q'') }}} The main `loop :: EventManager -> IO ()` in `cleanup :: EventManager -> IO ()` uses: {{{ closeControl :: Control -> IO () closeControl w = do _ <- c_close . fromIntegral . controlReadFd $ w _ <- c_close . fromIntegral . controlWriteFd $ w #if defined(HAVE_EVENTFD) _ <- c_close . fromIntegral . controlEventFd $ w #else _ <- c_close . fromIntegral . wakeupReadFd $ w _ <- c_close . fromIntegral . wakeupWriteFd $ w #endif return () }}} which doesn't clean up the C side of `newControl :: IO Control`. -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7653#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7653: incorrect handling of StackOverflow exception in the event manager --------------------------+------------------------------------------------- Reporter: nus | Owner: Type: bug | Status: new Priority: normal | Component: libraries/base Version: 7.7 | Keywords: Os: Linux | Architecture: Unknown/Multiple Failure: Runtime crash | Blockedby: Blocking: | Related: --------------------------+------------------------------------------------- Comment(by nus): Another part of the problem is that we really shouldn't be trying to operate on an event manager if it's in the `Finished` or `Dying` states. A proper fix would require changing the API to reflect that in the return type of the operations using `wakeManager`. Is it worth the trouble? -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7653#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7653: incorrect handling of StackOverflow exception in the event manager --------------------------+------------------------------------------------- Reporter: nus | Owner: Type: bug | Status: patch Priority: normal | Component: libraries/base Version: 7.7 | Keywords: Os: Linux | Architecture: Unknown/Multiple Failure: Runtime crash | Blockedby: Blocking: | Related: --------------------------+------------------------------------------------- Changes (by nus): * status: new => patch -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7653#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7653: incorrect handling of StackOverflow exception in the event manager ---------------------------------+------------------------------------------ Reporter: nus | Owner: tibbe Type: bug | Status: patch Priority: normal | Milestone: Component: libraries/base | Version: 7.7 Keywords: | Os: Linux Architecture: Unknown/Multiple | Failure: Runtime crash Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Changes (by igloo): * cc: johan.tibell@… (added) * difficulty: => Unknown * owner: => tibbe Comment: Johan, could you take a look at these patches please? Thanks! -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7653#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7653: incorrect handling of StackOverflow exception in the event manager ---------------------------------+------------------------------------------ Reporter: nus | Owner: tibbe Type: bug | Status: patch Priority: normal | Milestone: Component: libraries/base | Version: 7.7 Keywords: | Os: Linux Architecture: Unknown/Multiple | Failure: Runtime crash Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Changes (by kazu-yamamoto): * cc: kazu@… (added) -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7653#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7653: incorrect handling of StackOverflow exception in the event manager ---------------------------------+------------------------------------------ Reporter: nus | Owner: tibbe Type: bug | Status: patch Priority: normal | Milestone: Component: libraries/base | Version: 7.7 Keywords: | Os: Linux Architecture: Unknown/Multiple | Failure: Runtime crash Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Comment(by kazu-yamamoto): First of all, I would like to discuss what is the best behavior when the stack of IO manager overflows. Note that in the current parallel IO manager, timer manager and IO managers are separated. We should apply this discussion to the timer manager. A comment to the first patch: even if we set manager's FD to -1, nobody refers to it. What is the purpose of this patch? -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7653#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7653: incorrect handling of StackOverflow exception in the event manager ---------------------------------+------------------------------------------ Reporter: nus | Owner: tibbe Type: bug | Status: patch Priority: normal | Milestone: Component: libraries/base | Version: 7.7 Keywords: | Os: Linux Architecture: Unknown/Multiple | Failure: Runtime crash Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Comment(by nus): Replying to [comment:6 kazu-yamamoto]:
First of all, I would like to discuss what is the best behavior when the stack of IO manager overflows. Perhaps the best would be not to cause stack overflows in the manager at all. At least the shutdown should be as gracious as possible.
Note that in the current parallel IO manager, timer manager and IO managers are separated. We should apply this discussion to the timer manager. Sure, the bug report and the patches were made before the new I/O manager merge. The code excerpts above that show how the thunk is accumulated are now only pertinent to 7.4 and 7.6 branches. While I'm not sure how (and if) the situation like this could be reproduced on the current HEAD, the concerns might still be applicable: 1. There're no counterparts for `c_setIOManagerControlFd` and `c_setIOManagerWakeupFd` of `newControl` in `closeControl`; 2. A situation might emerge (again, I'm not sure how, but still) when `wakeManager` would be passed an `EventManager` in the `Finished` state.
A comment to the first patch: even if we set manager's FD to -1, nobody
refers to it. What is the purpose of this patch? The RTS does, please have a look at `ioManagerWakeup` in `rts/posix/Signals.c` (and `wakeUpRts` in `rts/Schedule.c`). -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7653#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7653: incorrect handling of StackOverflow exception in the event manager ---------------------------------+------------------------------------------ Reporter: nus | Owner: tibbe Type: bug | Status: patch Priority: normal | Milestone: Component: libraries/base | Version: 7.7 Keywords: | Os: Linux Architecture: Unknown/Multiple | Failure: Runtime crash Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Comment(by kazu-yamamoto): Thank you for explanation. Your patch can be applied to GHC/Event/TimerManager.hs. Unfortunately, the following error happens with/without your patches: {{{ rplfrk: user error (Pattern match failure in do expression at libraries/base GHC/Event/Thread.hs:212:3-10) }}} The code is here: {{{ getSystemTimerManager :: IO TM.TimerManager getSystemTimerManager = do Just mgr <- readIORef timerManager return mgr }}} What should getSystemTimerManager do when readIORef returns Nothing? -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7653#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7653: incorrect handling of StackOverflow exception in the event manager ---------------------------------+------------------------------------------ Reporter: nus | Owner: tibbe Type: bug | Status: patch Priority: normal | Milestone: Component: libraries/base | Version: 7.7 Keywords: | Os: Linux Architecture: Unknown/Multiple | Failure: Runtime crash Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Comment(by nus): I should clarify that the patches were meant for the old I/O manager, that is the one that's currently in the 7.4 and 7.6 branches, and only the 7.6 and pre-merge HEAD builds were validated. The new I/O manager situation needs more insight. -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7653#comment:9 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7653: incorrect handling of StackOverflow exception in the event manager ---------------------------------+------------------------------------------ Reporter: nus | Owner: tibbe Type: bug | Status: patch Priority: normal | Milestone: 7.8.1 Component: libraries/base | Version: 7.7 Keywords: | Os: Linux Architecture: Unknown/Multiple | Failure: Runtime crash Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ Changes (by igloo): * milestone: => 7.8.1 -- Ticket URL: http://hackage.haskell.org/trac/ghc/ticket/7653#comment:10 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#7653: incorrect handling of StackOverflow exception in the event manager
-----------------------------+----------------------------------------------
Reporter: nus | Owner: tibbe
Type: bug | Status: closed
Priority: normal | Milestone: 7.8.1
Component: libraries/base | Version: 7.7
Resolution: fixed | Keywords:
Os: Linux | Architecture: Unknown/Multiple
Failure: Runtime crash | Difficulty: Unknown
Testcase: | Blockedby:
Blocking: | Related:
-----------------------------+----------------------------------------------
Changes (by igloo):
* status: patch => closed
* resolution: => fixed
Comment:
Fixed by
{{{
commit e843e73690f828498f6e33bb89f47a50c3ab2ac9
Author: Ian Lynagh
participants (2)
-
GHC
-
GHC