Increasing capabilities dramatically increases execution time

Hi friends, I have encountered a situation in a concurrent program where I see an unexpected, dramatic increase in execution time when I add additional capabilities. On a multi-core CPU, "-N1" is fastest by an order of magnitude and the program increasingly slows for an increasing number of capabilities (still fewer than the number of real cores, of course). My question is, why is this so and what can I do to improve this? Some details: I have a shared data structure which I call a "store", which is basically a strict HashMap String Value. This structure is shared between Haskell threads using an IORef/MVar pair: data Store = Store (IORef (HashMap Key Value)) (MVar ()) Focus on the IORef/MVar pair - the HashMap itself is not important. My intention is that read-only transactions can take the pure value straight from the IORef without blocking writers or other readers, whereas mutating transactions (those that will update the IORef when they complete) are serialized using the MVar. Alternatively, you can regard the read-only transaction as working with a private snapshot of the store that is discarded after it completes. It is my hope that this technique will allow my program to exploit a multi-core CPU by running several read-only transactions and at most one mutating transaction in parallel. I am also assuming that this technique is marginally more efficient than STM for this use case, especially for write-heavy loads where I am assuming STM would waste some time on retries (I did not test this). -- | Execute a read-only transaction that returns a value. withStore :: Store -> (HashMap Key Value -> a) -> IO a withStore (Store ioref _) f = do store <- readIORef ioref return (f store) -- | Execute a transaction that updates the store and returns a value. modifyStore :: Store -> (HashMap Key Value -> (HashMap Key Value, a)) -> IO a modifyStore (Store ioref mvar) f = modifyMVar mvar $ \ z -> do store <- readIORef ioref let (store', x) = f store store' `seq` writeIORef ioref store' return (z, x) Stop me right here if this is a silly way to do this. I have a test that forks 4 threads that each increment a counter in the store 100000 times, with the expectation that the final answer is 400000. I use the "async" package for this. This test is not necessarily pathological. I expect simple operations like incrementing counters and concatenating text to be typical transactions. threads <- replicateM 4 $ async $ replicateM_ 100000 $ modifyStore store (...increment a counter...) forM_ threads wait In this test, while any thread is busy modifying the store, the other three are blocked on the empty MVar, irrespective of how many capabilities I have. Therefore, I expect the execution time with -N4 to be similar to -N1. I expect the only difference to be attributable to the runtime's scheduling overheads, which I assume are relatively insignificant. Instead, I see a dramatic increase in execution time with increasing capabilities (typical measurements below). By the way, I say I assume scheduling overheads are "relatively insignificant" compared to incrementing the counter because in my program, the counter is incremented by interpreting an imperative EDSL, which I assume is relatively inefficient compared to e.g. "succ", but perhaps I am mistaken. In a way, my whole question probably centres around this assumption. I would be grateful is anyone can illuminate the reason for this dramatic increase in execution time when I increase the number of capabilities, and any hints on how I can mitigate it. I am using GHC 7.10.2 and compile with -O -threaded. All library versions are as at Stackage LTS 3.2. A typical measurement, with -N1: Tot time (elapsed) Avg pause Max pause Gen 0 516 colls, 0 par 0.000s 0.003s 0.0000s 0.0001s Gen 1 2 colls, 0 par 0.000s 0.001s 0.0003s 0.0003s TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.000s ( 0.001s elapsed) MUT time 0.136s ( 0.139s elapsed) GC time 0.000s ( 0.003s elapsed) EXIT time 0.000s ( 0.001s elapsed) Total time 0.136s ( 0.144s elapsed) With -N2: Tot time (elapsed) Avg pause Max pause Gen 0 334 colls, 334 par 0.012s 0.006s 0.0000s 0.0001s Gen 1 2 colls, 1 par 0.000s 0.000s 0.0002s 0.0003s Parallel GC work balance: 39.33% (serial 0%, perfect 100%) TASKS: 6 (1 bound, 5 peak workers (5 total), using -N2) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.000s ( 0.002s elapsed) MUT time 2.032s ( 2.456s elapsed) GC time 0.012s ( 0.007s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 2.044s ( 2.465s elapsed) With -N4: Tot time (elapsed) Avg pause Max pause Gen 0 133 colls, 133 par 0.032s 0.005s 0.0000s 0.0001s Gen 1 2 colls, 1 par 0.000s 0.001s 0.0003s 0.0003s Parallel GC work balance: 41.33% (serial 0%, perfect 100%) TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.000s ( 0.003s elapsed) MUT time 3.516s ( 4.431s elapsed) GC time 0.032s ( 0.005s elapsed) EXIT time 0.000s ( 0.001s elapsed) Total time 3.548s ( 4.439s elapsed) Thanks, Thomas Koster

Hi Thomas On 2016-01-22, Thomas Koster wrote:
Hi friends,
I have encountered a situation in a concurrent program where I see an unexpected, dramatic increase in execution time when I add additional capabilities. On a multi-core CPU, "-N1" is fastest by an order of magnitude and the program increasingly slows for an increasing number of capabilities (still fewer than the number of real cores, of course).
<snip/>
Maybe post this to the café list at: https://mail.haskell.org/mailman/listinfo/haskell-cafe -- Greetings Elias

On 2016-01-22, Thomas Koster wrote: I have encountered a situation in a concurrent program where I see an unexpected, dramatic increase in execution time when I add additional capabilities. On a multi-core CPU, "-N1" is fastest by an order of magnitude and the program increasingly slows for an increasing number of capabilities (still fewer than the number of real cores, of course).
<snip/>
On 27 January 2016 at 18:07, Elias Diem
Maybe post this to the café list at:
Thanks, Elias. This is being discussed there in the thread "When are MVars better than STM?". https://mail.haskell.org/pipermail/haskell-cafe/2016-January/122785.html -- Thomas Koster
participants (2)
-
Elias Diem
-
Thomas Koster