[GHC] #15136: High CPU when asynchronous exception and unblocking retry on TVar raced

#15136: High CPU when asynchronous exception and unblocking retry on TVar raced -------------------------------------+------------------------------------- Reporter: nshimaza | Owner: (none) Type: bug | Status: new Priority: highest | Milestone: 8.6.1 Component: Runtime | Version: 8.4.2 System | Keywords: | Operating System: Unknown/Multiple Architecture: | Type of failure: Runtime crash Unknown/Multiple | Test Case: | Blocked By: Blocking: | Related Tickets: Differential Rev(s): | Wiki Page: -------------------------------------+------------------------------------- Detail: https://github.com/nshimaza/race-tmvar-async-exception Runtime falls into high CPU under racing condition between async exception and unblocking retry on TVar. * Reproduces with +RTS -Nx where x > 1 * Does NOT reproduce with +RTS -N1 * Program stalls at `killThread` * High CPU based on given -Nx * CPU won't be 100% if you gave x smaller than available hardware threads of your platform. * Does NOT reproduce if TVar/retry is replaced by MVar * Reproduced with GHC 8.4.2 (macOS High Sierra (10.13.4)) * Reproduced with GHC 8.4.2 (Docker for Mac Version 18.03.1-ce-mac65) * Reproduced with ghc-8.5.20180506 (Docker for Mac Version 18.03.1-ce- mac65) Minimal reproducing code here. (You can find more verbose code on the above github repo.) {{{#!hs main :: IO () main = do let volume = 1000 forM_ [1..1000] $ \i -> do putStrFlush $ show i ++ " " -- Spawn massive number of threads. threads <- replicateM volume $ do trigger <- newTVarIO False tid <- forkIO $ void $ atomically $ do t <- readTVar trigger if t then pure t else retry pure (trigger, tid) -- Make sure all threads are spawned. threadDelay 30000 -- Let threads start to exit normally. forkIO $ forM_ threads $ \(trigger, _) -> threadDelay 1 *> atomically (writeTVar trigger True) -- Concurrently kill threads in order to create race. -- TMVar operation and asynchronous exception can hit same thread simultaneously. -- Adjust threadDelay if you don't reproduce very well. threadDelay 1000 forM_ threads $ \(_, tid) -> do putCharFlush 'A' killThread tid -- When the issue reproduced, this killThread doesn't return. putCharFlush '\b' }}} This program intentionally creates race condition between asynchronous exception and unblocking operation of `retry` on TVar. From one side, a `writeTVar trigger True` is attempted from external thread while target thread is blocking at `retry` on the same `TVar`. On the other side, an asynchronous exception `ThreadKilled` is thrown by yet another external thread to the same target thread. In other word, it attempts to kill a thread about to unblock. I guess when the above two operation hit the same thread at the same time in parallel in SMP environment, GHC runtime falls into high CPU. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15136 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15136: High CPU when asynchronous exception and unblocking retry on TVar raced -------------------------------------+------------------------------------- Reporter: nshimaza | Owner: (none) Type: bug | Status: new Priority: highest | Milestone: 8.6.1 Component: Runtime System | Version: 8.4.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: Runtime crash | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): Thanks for the small repro! Indeed I am able to easily reproduce this. The racing threads look like this, {{{ Thread 8 (LWP 25960): #0 0x000000000049c0df in stmCommitTransaction () #1 0x00000000004a9de4 in stg_atomically_frame_info () #2 0x0000000000000000 in ?? () Thread 1 (LWP 25948): #0 0x000000000049b290 in lock_tvar () #1 0x000000000049b559 in remove_watch_queue_entries_for_trec () #2 0x000000000049b95b in stmAbortTransaction () #3 0x000000000049984c in raiseAsync () #4 0x0000000000499ba3 in throwToMsg () #5 0x0000000000499e82 in throwTo () #6 0x00000000004a6aac in stg_killThreadzh () #7 0x0000000000000000 in ?? () }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15136#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15136: High CPU when asynchronous exception and unblocking retry on TVar raced -------------------------------------+------------------------------------- Reporter: nshimaza | Owner: (none) Type: bug | Status: new Priority: highest | Milestone: 8.6.1 Component: Runtime System | Version: 8.4.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: Runtime crash | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by carlostome): After some investigation I found that indeed there is a deadlock between the mutexes of a TSO and a TVar. An excerpt of the debugging log is the follwing: {{{ Task 2: cap 1: running thread 7012 (ThreadRunGHC) Task 2: STM: 0xad6d08 : stmStartTransaction with 316 tokens Task 2: STM: 0xad6d08 : stmStartTransaction()=0x4200224000 Task 2: STM: 0x4200224000 : stmWriteTVar(0x4200225c68, 0xad5e72) Task 2: STM: 0x4200224000 : get_entry_for TVar 0x4200225c68 Task 2: STM: 0x4200224000 : FOR_EACH_ENTRY, current_chunk=0x420027e4d0 limit=0 Task 2: STM: 0x4200224000 : read_current_value(0x4200225c68)=0xad5e69 Task 2: STM: 0x4200224000 : stmWriteTVar done Task 2: STM: 0x4200224000 : stmCommitTransaction() Task 2: STM: 0x4200224000 : lock_stm() Task 2: STM: 0x4200224000 : FOR_EACH_ENTRY, current_chunk=0x420027e4d0 limit=1 Task 2: STM: 0x4200224000 : trying to acquire 0x4200225c68 Task 2: STM: 0x4200224000 : cond_lock_tvar(0x4200225c68, 0xad5e69) -- (1) Task 2: STM: 0x4200224000 : success Task 2: STM: 0x4200224000 : doing read check Task 2: STM: 0x4200224000 : FOR_EACH_ENTRY, current_chunk=0x420027e4d0 limit=1 Task 2: STM: 0x4200224000 : read-check succeeded Task 2: STM: 0x4200224000 : FOR_EACH_ENTRY, current_chunk=0x420027e4d0 limit=1 Task 2: STM: 0x4200224000 : writing 0xad5e72 to 0x4200225c68, waking waiters -- (2) Task 2: STM: unpark_waiters_on tvar=0x4200225c68 Task 2: STM: unpark_waiters_on tvar=0x4200225c68, status=0x4200223b80 Task 2: STM: cap 1: unpark_waiters_on tvar=0x4200225c68, status2=0x4200223b80 -- (3) Task 2: STM: cap 1: unpark_tso id 6866 -- (4) Task 1: BlockedOnSTM: lockTSO id 6866 Task 1: cap 0 BlockedOnSTM: raiseAsync id 6866 Task 1: cap 0: raising exception in thread 6866. Task 1: cap 0: raiseAsync: freezing atomically frame -- (5) Task 1: STM: 0x4200225f58 : stmAbortTransaction -- (6) Task 1: STM: 0x4200225f58 : lock_stm() Task 1: STM: 0x4200225f58 : aborting top-level transaction cap 0 Task 1: STM: 0x4200225f58 : aborting top-level transaction Task 1: STM: 0x4200225f58 : stmAbortTransaction aborting waiting transaction Task 1: STM: 0x4200225f58 : remove_watch_queue_entries_for_trec() -- (7) Task 1: STM: 0x4200225f58 : FOR_EACH_ENTRY, current_chunk=0x4200223cc8 limit=1 Task 1: STM: 0x4200225f58 : lock_tvar (0x4200225c68) -- (8) }}} What is happening here has to do with how async exceptions are implemented. The thread 6866 is waiting on the TVar 0x4200225c68. In the meantime the thread 7012 is able to change the value of such TVar to True. To do so, first it has to acquire the lock for the TVar (1), then actually change its value (2), and finally wake any thread that was waiting on this TVar (3). To wake up a concrete waiter it needs to hold its mutex through lock_tso() (4). The problem is that when an async exception is thrown to a thread whose status is BlockedOnSTM, the first thing the thread throwing the exception does is to grab the mutex of the target thread with lock_tso() (https://github.com/ghc/ghc/blob/93220d46fceabf3afeae36f1fda94e1698c3639a/rts...), then because the target thread was in the middle of an STM transaction (5) that dependes on the same TVar, it has to abort it (6) for what it has to be removed from the TVar watch queue (7) needing the mutex on the TVar to do so (8). Basically the thread completing the STM grabs first the TVar lock and then the TSO lock while for the async exception the locks are taken in reverse order. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15136#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15136: High CPU when asynchronous exception and unblocking retry on TVar raced -------------------------------------+------------------------------------- Reporter: nshimaza | Owner: (none) Type: bug | Status: new Priority: highest | Milestone: 8.6.1 Component: Runtime System | Version: 8.4.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: Runtime crash | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by carlostome): * cc: carlostome (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15136#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15136: High CPU when asynchronous exception and unblocking retry on TVar raced -------------------------------------+------------------------------------- Reporter: nshimaza | Owner: (none) Type: bug | Status: new Priority: highest | Milestone: 8.6.1 Component: Runtime System | Version: 8.4.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: Runtime crash | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by simonmar): * cc: simonmar (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15136#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15136: High CPU when asynchronous exception and unblocking retry on TVar raced -------------------------------------+------------------------------------- Reporter: nshimaza | Owner: (none) Type: bug | Status: new Priority: highest | Milestone: 8.6.1 Component: Runtime System | Version: 8.4.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: Runtime crash | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonmar): Thanks for the excellent diagnosis @bgamari. I'm not sure what the best fix is yet, but there is something distinctly strange here. `lockTSO` and `unlockTSO` are only used to synchronise for this particular case, between `raiseAsync` and `unpark_tso`. In all other cases, a TSO has a clear owner - in the case of a blocked TSO, the owner is usually the object on which the TSO is blocked, e.g. an MVar. Perhaps we should switch to using an owner semantics for BlockedOnSTM too - that is, if we see `BlockedOnSTM` in `raiseAsync`, we attempt to lock the `TVar` pointed to by `tso->block_info`. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15136#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15136: High CPU when asynchronous exception and unblocking retry on TVar raced -------------------------------------+------------------------------------- Reporter: nshimaza | Owner: (none) Type: bug | Status: new Priority: highest | Milestone: 8.6.1 Component: Runtime System | Version: 8.4.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: Runtime crash | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): Just to make sure credit is given where credit is due, carlostome is the one responsible for the diagnosis (which is indeed excellent; great work carlostome!) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15136#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15136: High CPU when asynchronous exception and unblocking retry on TVar raced -------------------------------------+------------------------------------- Reporter: nshimaza | Owner: (none) Type: bug | Status: new Priority: highest | Milestone: 8.6.1 Component: Runtime System | Version: 8.4.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: Runtime crash | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): We will make a best-effort to look at this for 8.6. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15136#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15136: High CPU when asynchronous exception and unblocking retry on TVar raced -------------------------------------+------------------------------------- Reporter: nshimaza | Owner: osa1 Type: bug | Status: new Priority: highest | Milestone: 8.6.1 Component: Runtime System | Version: 8.4.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: Runtime crash | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by osa1): * owner: (none) => osa1 Comment: I'll try implementing comment:5 see how it goes. This code path is not tested enough so I'll also try to port some of concurrency tests that use MVar and async exceptions (exercising `raiseAsync()`) to STM. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15136#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15136: High CPU when asynchronous exception and unblocking retry on TVar raced -------------------------------------+------------------------------------- Reporter: nshimaza | Owner: osa1 Type: bug | Status: new Priority: highest | Milestone: 8.6.1 Component: Runtime System | Version: 8.4.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: Runtime crash | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by osa1): Just to repeat comment:1 and comment:2 with my words: Thread 1 kills Thread 2 which is blocked on a TVar operation. For this it calls raiseAsync() and for that it has to lock Thread 2 (lockTSO). Then to abort the transaction it needs to lock the TVar (lock_tvar). At the same time Thread 3 succeeds to modify the TVar and to unblock any threads blocked on this TVar it needs to lock the TVar (lock_tvar), and then to actually unblock the thread it needs to lock the TSO (lockTSO). When the order of locking goes like this: - Thread 1 locks the TSO (lockTSO) - Thread 3 locks the TVar (lock_tvar) We get a deadlock because Thread 1 now wants to lock the TVar and Thread 3 wants to lock the TSO, both of which are locked already.
Perhaps we should switch to using an owner semantics for BlockedOnSTM too - that is, if we see BlockedOnSTM in raiseAsync, we attempt to lock the TVar pointed to by tso->block_info.
I only get `END_TSO_QUEUE` in `tso->block_info`. I think the TVar is only reachable from the array list `tso->trec->current_chunk`. I guess we could do this: - Lock the TSO - If BlockedOnSTM then check tso->trec entries. Expect to see only one TVar there (can we have more than on TVars here?). Lock the TVar and release the TSO. - Continue with raiseAsync() I don't know if we can see more than one TVar in tso->trec entries. Also, we need to modify stmAbortTransaction because we'll have the TVar locked already, but it still needs to lock it when it's called from other call sites (e.g. from `raise#`). -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15136#comment:9 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15136: High CPU when asynchronous exception and unblocking retry on TVar raced -------------------------------------+------------------------------------- Reporter: nshimaza | Owner: osa1 Type: bug | Status: new Priority: highest | Milestone: 8.6.1 Component: Runtime System | Version: 8.4.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: Runtime crash | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by osa1): I documented why we can't apply ownership semantics of MVars to TVars (as described in comment:5) in Phab:D4954. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15136#comment:10 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15136: High CPU when asynchronous exception and unblocking retry on TVar raced -------------------------------------+------------------------------------- Reporter: nshimaza | Owner: osa1 Type: bug | Status: patch Priority: highest | Milestone: 8.6.1 Component: Runtime System | Version: 8.4.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: Runtime crash | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D4956 Wiki Page: | -------------------------------------+------------------------------------- Changes (by osa1): * status: new => patch * differential: => Phab:D4956 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15136#comment:11 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15136: High CPU when asynchronous exception and unblocking retry on TVar raced
-------------------------------------+-------------------------------------
Reporter: nshimaza | Owner: osa1
Type: bug | Status: patch
Priority: highest | Milestone: 8.6.1
Component: Runtime System | Version: 8.4.2
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
| Unknown/Multiple
Type of failure: Runtime crash | Test Case:
Blocked By: | Blocking:
Related Tickets: | Differential Rev(s): Phab:D4956
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by Ben Gamari

#15136: High CPU when asynchronous exception and unblocking retry on TVar raced -------------------------------------+------------------------------------- Reporter: nshimaza | Owner: osa1 Type: bug | Status: merge Priority: highest | Milestone: 8.6.1 Component: Runtime System | Version: 8.4.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: Runtime crash | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D4956 Wiki Page: | -------------------------------------+------------------------------------- Changes (by bgamari): * status: patch => merge -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15136#comment:13 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15136: High CPU when asynchronous exception and unblocking retry on TVar raced -------------------------------------+------------------------------------- Reporter: nshimaza | Owner: osa1 Type: bug | Status: merge Priority: highest | Milestone: 8.6.1 Component: Runtime System | Version: 8.4.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: Runtime crash | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D4956 Wiki Page: | -------------------------------------+------------------------------------- Comment (by osa1): Phab:D4961 is a continuation of the commit above and should probably be merged with it. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15136#comment:14 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15136: High CPU when asynchronous exception and unblocking retry on TVar raced
-------------------------------------+-------------------------------------
Reporter: nshimaza | Owner: osa1
Type: bug | Status: merge
Priority: highest | Milestone: 8.6.1
Component: Runtime System | Version: 8.4.2
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
| Unknown/Multiple
Type of failure: Runtime crash | Test Case:
Blocked By: | Blocking:
Related Tickets: | Differential Rev(s): Phab:D4956
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by Ben Gamari

#15136: High CPU when asynchronous exception and unblocking retry on TVar raced -------------------------------------+------------------------------------- Reporter: nshimaza | Owner: osa1 Type: bug | Status: merge Priority: highest | Milestone: 8.6.1 Component: Runtime System | Version: 8.4.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: Runtime crash | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D4956 Wiki Page: | -------------------------------------+------------------------------------- Comment (by nshimaza): Let me report that my repro code no longer reproduces the issue with ghc-8.7.20180715 in my test environment! (Platform is Docker for Mac Version 18.03.1-ce-mac65 (24312).) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15136#comment:16 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#15136: High CPU when asynchronous exception and unblocking retry on TVar raced -------------------------------------+------------------------------------- Reporter: nshimaza | Owner: osa1 Type: bug | Status: closed Priority: highest | Milestone: 8.6.1 Component: Runtime System | Version: 8.4.2 Resolution: fixed | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: Runtime crash | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D4956 Wiki Page: | -------------------------------------+------------------------------------- Changes (by bgamari): * status: merge => closed * resolution: => fixed Comment: Thanks for the confirmation! comment:15 merged with c15ba1fb83575081bb6a1d4955f9e13626ef8d51. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15136#comment:17 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC