
Hi friends, Using Criterion, I have been running benchmarks to measure the relative performance of STM and MVars for some simple transactions that I expect will be typical in my application. I am using GHC 7.10.2 and libraries as at Stackage LTS 3.2. I have found that STM is faster than MVars in all my benchmarks, without exception. This seems to go against accepted wisdom [1][2][3]. I have not included my source code here to save space, but if you suspect that I am using MVars incorrectly, just say so and I will post my source code separately. I have two questions: 1. When are MVars faster than STM? If the answer is "never", then when are MVars "better" than STM? (Choose your own definition of "better".) 2. When given two capabilities (+RTS -N2), MVars are suddenly an order of magnitude slower than with just one capability. Why? For those who want details: My benchmark forks four Haskell threads. Each thread repeats a transaction that increments a shared counter many, many times. These transactions must be serialized. The counter is therefore highly contended. One version uses an MVar to store the counter in the obvious way. The other version uses a TVar instead. By the way, simply using "atomic-primops" to increment the counter won't do because the increment operation is actually a mock substitute for a more complex operation. I use the counter for my benchmarks because the real operation needs much more memory and I don't want the additional, unpredictable GC cost to affect my measurements. Typical measurements are: 1 capability, using MVar: 37.30 ms 1 capability, using TVar: 24.88 ms 2 capabilities, using MVar: 1.564 s 2 capabilities, using TVar: 80.09 ms 4 capabilities, using MVar: 2.890 s 4 capabilities, using TVar: 207.8 ms Notice that the MVar version suddenly slows by an order of magnitude when run with more than one capability. Why is this so? (This is question 2.) Despite the absolute time elapsed, I realize that the CPU usage characteristics of the two versions are also quite different. I realize that the MVar version interlocks the four threads so that only one capability is ever busy at a time, irrespective of the number of capabilities available, whereas the STM version allows up to four capabilities to be busy at once. However, I believe that the additional parallel transactions in the STM version would be mostly wasted, destined to be retried. Unless I am mistaken, this assumption appears to be consistent with the observation that the STM version with -N1 is the fastest of all. Despite all this wasted work by the thundering herd, the total CPU time (i.e. my power bill) for the STM version is still less than for the MVar version, because the MVar version so dramatically slow. Paradoxically, MVars seem to be the wrong tool for this job. So when are MVars faster than STM? (This is question 1.) [1] https://stackoverflow.com/questions/15439966/when-why-use-an-mvar-over-a-tva... [2] https://www.reddit.com/r/haskell/comments/39ef3y/ioref_vs_mvar_vs_tvar_vs_tm... [3] https://mail.haskell.org/pipermail/haskell-cafe/2014-January/112158.html Thanks, Thomas Koster

Could you post the code please?
On Sun, Jan 24, 2016 at 12:46 AM, Thomas Koster
Hi friends,
Using Criterion, I have been running benchmarks to measure the relative performance of STM and MVars for some simple transactions that I expect will be typical in my application. I am using GHC 7.10.2 and libraries as at Stackage LTS 3.2.
I have found that STM is faster than MVars in all my benchmarks, without exception. This seems to go against accepted wisdom [1][2][3]. I have not included my source code here to save space, but if you suspect that I am using MVars incorrectly, just say so and I will post my source code separately.
I have two questions:
1. When are MVars faster than STM? If the answer is "never", then when are MVars "better" than STM? (Choose your own definition of "better".)
2. When given two capabilities (+RTS -N2), MVars are suddenly an order of magnitude slower than with just one capability. Why?
For those who want details:
My benchmark forks four Haskell threads. Each thread repeats a transaction that increments a shared counter many, many times. These transactions must be serialized. The counter is therefore highly contended. One version uses an MVar to store the counter in the obvious way. The other version uses a TVar instead.
By the way, simply using "atomic-primops" to increment the counter won't do because the increment operation is actually a mock substitute for a more complex operation. I use the counter for my benchmarks because the real operation needs much more memory and I don't want the additional, unpredictable GC cost to affect my measurements.
Typical measurements are:
1 capability, using MVar: 37.30 ms 1 capability, using TVar: 24.88 ms 2 capabilities, using MVar: 1.564 s 2 capabilities, using TVar: 80.09 ms 4 capabilities, using MVar: 2.890 s 4 capabilities, using TVar: 207.8 ms
Notice that the MVar version suddenly slows by an order of magnitude when run with more than one capability. Why is this so? (This is question 2.)
Despite the absolute time elapsed, I realize that the CPU usage characteristics of the two versions are also quite different. I realize that the MVar version interlocks the four threads so that only one capability is ever busy at a time, irrespective of the number of capabilities available, whereas the STM version allows up to four capabilities to be busy at once. However, I believe that the additional parallel transactions in the STM version would be mostly wasted, destined to be retried. Unless I am mistaken, this assumption appears to be consistent with the observation that the STM version with -N1 is the fastest of all. Despite all this wasted work by the thundering herd, the total CPU time (i.e. my power bill) for the STM version is still less than for the MVar version, because the MVar version so dramatically slow.
Paradoxically, MVars seem to be the wrong tool for this job. So when are MVars faster than STM? (This is question 1.)
[1] https://stackoverflow.com/questions/15439966/when-why-use-an-mvar-over-a-tva... [2] https://www.reddit.com/r/haskell/comments/39ef3y/ioref_vs_mvar_vs_tvar_vs_tm... [3] https://mail.haskell.org/pipermail/haskell-cafe/2014-January/112158.html
Thanks, Thomas Koster _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
-- Chris Allen Currently working on http://haskellbook.com

On Sun, Jan 24, 2016 at 12:46 AM, Thomas Koster
Using Criterion, I have been running benchmarks to measure the relative performance of STM and MVars for some simple transactions that I expect will be typical in my application. I am using GHC 7.10.2 and libraries as at Stackage LTS 3.2.
I have found that STM is faster than MVars in all my benchmarks, without exception. This seems to go against accepted wisdom [1][2][3]. I have not included my source code here to save space, but if you suspect that I am using MVars incorrectly, just say so and I will post my source code separately.
I have two questions:
1. When are MVars faster than STM? If the answer is "never", then when are MVars "better" than STM? (Choose your own definition of "better".)
2. When given two capabilities (+RTS -N2), MVars are suddenly an order of magnitude slower than with just one capability. Why?
On 24 January 2016 at 17:55, Christopher Allen
Could you post the code please?
module Main (main) where import Control.Concurrent.Async import Control.Concurrent.MVar import Control.Concurrent.STM import Control.Monad import Criterion.Main main = defaultMain [ bgroup "thrash" [ bench "MVar" $ whnfIO (thrashTest mvarNew mvarInc mvarGet), bench "TVar" $ whnfIO (thrashTest tvarNew tvarInc tvarGet) ] ] thrashTest :: IO a -> (a -> IO ()) -> (a -> IO b) -> IO b thrashTest new inc get = do var <- new threads <- replicateM 4 (async (replicateM_ 100000 $ inc var)) forM_ threads wait get var mvarNew :: IO (MVar Int) mvarNew = newMVar 0 mvarInc :: MVar Int -> IO () mvarInc var = modifyMVar_ var $ \ i -> return $! succ i mvarGet :: MVar Int -> IO Int mvarGet = readMVar tvarNew :: IO (TVar Int) tvarNew = newTVarIO 0 tvarInc :: TVar Int -> IO () tvarInc var = atomically $ do i <- readTVar var writeTVar var $! succ i tvarGet :: TVar Int -> IO Int tvarGet = readTVarIO -- Thomas Koster

Hi Thomas,
I'm sorry I don't have time right now for a proper response (buried under
paper deadlines). There are certainly times when one will be faster then
the other and the reasons are quite complicated. To complicate matters
further it is very difficult to get benchmarks that don't lie about
performance in this space. There are also alternative implementations that
change the balance drastically. The only broad advice I can give is to
benchmark the target application with both implementations to see how all
the implications fall out. A broad description of the differences in
implementation would be that MVars have a fairness guarantee (that does not
come for free) for waking waiting threads. STM does not have this fairness
which can lead to problems for programs that have quick transactions that
always win over occasional long transactions (there are ways to avoid with
a different implementation or with the cost of shifted to the programmer).
My guess is in your particular benchmark the unfairness of STM works to
your advantage and all the work is happening sequentially while the MVar
version's fairness incurs frequent cache misses.
Ryan
On Sun, Jan 24, 2016 at 2:13 AM, Thomas Koster
On Sun, Jan 24, 2016 at 12:46 AM, Thomas Koster
wrote: Using Criterion, I have been running benchmarks to measure the relative performance of STM and MVars for some simple transactions that I expect will be typical in my application. I am using GHC 7.10.2 and libraries as at Stackage LTS 3.2.
I have found that STM is faster than MVars in all my benchmarks, without exception. This seems to go against accepted wisdom [1][2][3]. I have not included my source code here to save space, but if you suspect that I am using MVars incorrectly, just say so and I will post my source code separately.
I have two questions:
1. When are MVars faster than STM? If the answer is "never", then when are MVars "better" than STM? (Choose your own definition of "better".)
2. When given two capabilities (+RTS -N2), MVars are suddenly an order of magnitude slower than with just one capability. Why?
On 24 January 2016 at 17:55, Christopher Allen
wrote: Could you post the code please?
module Main (main) where
import Control.Concurrent.Async import Control.Concurrent.MVar import Control.Concurrent.STM import Control.Monad import Criterion.Main
main = defaultMain [ bgroup "thrash" [ bench "MVar" $ whnfIO (thrashTest mvarNew mvarInc mvarGet), bench "TVar" $ whnfIO (thrashTest tvarNew tvarInc tvarGet) ] ]
thrashTest :: IO a -> (a -> IO ()) -> (a -> IO b) -> IO b thrashTest new inc get = do var <- new threads <- replicateM 4 (async (replicateM_ 100000 $ inc var)) forM_ threads wait get var
mvarNew :: IO (MVar Int) mvarNew = newMVar 0
mvarInc :: MVar Int -> IO () mvarInc var = modifyMVar_ var $ \ i -> return $! succ i
mvarGet :: MVar Int -> IO Int mvarGet = readMVar
tvarNew :: IO (TVar Int) tvarNew = newTVarIO 0
tvarInc :: TVar Int -> IO () tvarInc var = atomically $ do i <- readTVar var writeTVar var $! succ i
tvarGet :: TVar Int -> IO Int tvarGet = readTVarIO
-- Thomas Koster _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

Ryan,
On Sun, Jan 24, 2016 at 12:46 AM, Thomas Koster
Using Criterion, I have been running benchmarks to measure the relative performance of STM and MVars for some simple transactions that I expect will be typical in my application. I am using GHC 7.10.2 and libraries as at Stackage LTS 3.2.
I have found that STM is faster than MVars in all my benchmarks, without exception. This seems to go against accepted wisdom [1][2][3]. I have not included my source code here to save space, but if you suspect that I am using MVars incorrectly, just say so and I will post my source code separately.
I have two questions:
1. When are MVars faster than STM? If the answer is "never", then when are MVars "better" than STM? (Choose your own definition of "better".)
2. When given two capabilities (+RTS -N2), MVars are suddenly an order of magnitude slower than with just one capability. Why?
On 25 January 2016 at 01:04, Ryan Yates
I'm sorry I don't have time right now for a proper response (buried under paper deadlines). There are certainly times when one will be faster then the other and the reasons are quite complicated. To complicate matters further it is very difficult to get benchmarks that don't lie about performance in this space. There are also alternative implementations that change the balance drastically. The only broad advice I can give is to benchmark the target application with both implementations to see how all the implications fall out.
That is fair. From what I can tell, the time spent in the runtime dominates my user time, so I am basically benchmarking the GHC runtime, which I am not qualified to do :) I had only hoped to be able to decide on MVar vs STM before getting into the nitty gritty.
A broad description of the differences in implementation would be that MVars have a fairness guarantee (that does not come for free) for waking waiting threads. STM does not have this fairness which can lead to problems for programs that have quick transactions that always win over occasional long transactions (there are ways to avoid with a different implementation or with the cost of shifted to the programmer). My guess is in your particular benchmark the unfairness of STM works to your advantage and all the work is happening sequentially while the MVar version's fairness incurs frequent cache misses.
Fairness may actually be very important to my application. Unlike my benchmark, the complexity of real transactions can vary enormously. Let me think about this. Thanks for your response. -- Thomas Koster

On Sun, 2016-01-24 at 17:46 +1100, Thomas Koster wrote:
2. When given two capabilities (+RTS -N2), MVars are suddenly an order of magnitude slower than with just one capability. Why?
One possible explanation is closure locking which is not performed when there is only one capability. In my quick measurements it gives 40% speedup: https://ghc.haskell.org/trac/ghc/ticket/693#comment:9

Yuras, On Sun, 2016-01-24 at 17:46 +1100, Thomas Koster wrote:
2. When given two capabilities (+RTS -N2), MVars are suddenly an order of magnitude slower than with just one capability. Why?
On 25 January 2016 at 02:09, Yuras Shumovich
One possible explanation is closure locking which is not performed when there is only one capability. In my quick measurements it gives 40% speedup: https://ghc.haskell.org/trac/ghc/ticket/693#comment:9
This makes sense. After all, why bother with locks and barriers when the process is single-threaded anyway? Thanks for your response. -- Thomas Koster

Well, there are cases where even with single-threading you want a memory
barrier to prevent the CPU reordering instructions, but the shift to a
single-threaded runtime should elide _some_ locks expressly designed to
cope with multithreading.
On Sun, Jan 24, 2016 at 5:04 PM, Thomas Koster
Yuras,
On Sun, 2016-01-24 at 17:46 +1100, Thomas Koster wrote:
2. When given two capabilities (+RTS -N2), MVars are suddenly an order of magnitude slower than with just one capability. Why?
On 25 January 2016 at 02:09, Yuras Shumovich
wrote: One possible explanation is closure locking which is not performed when there is only one capability. In my quick measurements it gives 40% speedup: https://ghc.haskell.org/trac/ghc/ticket/693#comment:9
This makes sense. After all, why bother with locks and barriers when the process is single-threaded anyway?
Thanks for your response.
-- Thomas Koster _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
-- Chris Allen Currently working on http://haskellbook.com

This has nothing to do with your questions, but are you sure that mvarInc
is sufficiently strict?
On 15:17, Sun, Jan 24, 2016 Christopher Allen
Well, there are cases where even with single-threading you want a memory barrier to prevent the CPU reordering instructions, but the shift to a single-threaded runtime should elide _some_ locks expressly designed to cope with multithreading.
On Sun, Jan 24, 2016 at 5:04 PM, Thomas Koster
wrote: Yuras,
On Sun, 2016-01-24 at 17:46 +1100, Thomas Koster wrote:
2. When given two capabilities (+RTS -N2), MVars are suddenly an order of magnitude slower than with just one capability. Why?
On 25 January 2016 at 02:09, Yuras Shumovich
wrote: One possible explanation is closure locking which is not performed when there is only one capability. In my quick measurements it gives 40% speedup: https://ghc.haskell.org/trac/ghc/ticket/693#comment:9
This makes sense. After all, why bother with locks and barriers when the process is single-threaded anyway?
Thanks for your response.
-- Thomas Koster _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
-- Chris Allen Currently working on http://haskellbook.com _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

On 24 January 2016 at 17:46, Thomas Koster
I have found that STM is faster than MVars in all my benchmarks, without exception. This seems to go against accepted wisdom.
On 24 January 2016 at 18:13, Thomas Koster
module Main (main) where
import Control.Concurrent.Async import Control.Concurrent.MVar import Control.Concurrent.STM import Control.Monad import Criterion.Main
main = defaultMain [ bgroup "thrash" [ bench "MVar" $ whnfIO (thrashTest mvarNew mvarInc mvarGet), bench "TVar" $ whnfIO (thrashTest tvarNew tvarInc tvarGet) ] ]
thrashTest :: IO a -> (a -> IO ()) -> (a -> IO b) -> IO b thrashTest new inc get = do var <- new threads <- replicateM 4 (async (replicateM_ 100000 $ inc var)) forM_ threads wait get var
mvarNew :: IO (MVar Int) mvarNew = newMVar 0
mvarInc :: MVar Int -> IO () mvarInc var = modifyMVar_ var $ \ i -> return $! succ i
mvarGet :: MVar Int -> IO Int mvarGet = readMVar
tvarNew :: IO (TVar Int) tvarNew = newTVarIO 0
tvarInc :: TVar Int -> IO () tvarInc var = atomically $ do i <- readTVar var writeTVar var $! succ i
tvarGet :: TVar Int -> IO Int tvarGet = readTVarIO
On 28 January 2016 at 16:26, John Lato
This has nothing to do with your questions, but are you sure that mvarInc is sufficiently strict?
I think so. If you think it isn't, I would love to know why, since strictness and correct use of seq are still a bit of a black art for me. The strictness characteristics of the MVar version and the STM version as written ought to be identical. If not, I would love to know why as well. -- Thomas Koster
participants (5)
-
Christopher Allen
-
John Lato
-
Ryan Yates
-
Thomas Koster
-
Yuras Shumovich