
Dear list, I create four haskell threads, each performs disjoint STM transactions. The more system threads I add, the slower the program runs. With four system threads the performance is 14% of the nonparallel run. What is going on? $ ghc -v Glasgow Haskell Compiler, Version 7.10.3, stage 2 booted by GHC version 7.8.4 $ ghc -O2 -threaded test-stm.hs $ time ./test-stm +RTS -N1 real 1.843s $ time ./test-stm +RTS -N2 real 7.469s $ time ./test-stm +RTS -N3 real 9.640s $ time ./test-stm +RTS -N4 real 13.144s $ cat test-stm.hs {-# LANGUAGE ScopedTypeVariables #-} import Control.Monad import Control.Concurrent import Control.Concurrent.STM import Control.Concurrent.STM.Stats import System.IO.Unsafe -- Copied from GHC docs children :: MVar [MVar ()] children = unsafePerformIO (newMVar []) waitForChildren :: IO () waitForChildren = do cs <- takeMVar children case cs of [] -> return () m:ms -> do putMVar children ms takeMVar m waitForChildren forkChild :: IO () -> IO ThreadId forkChild io = do mvar <- newEmptyMVar childs <- takeMVar children putMVar children (mvar:childs) forkFinally io (\_ -> putMVar mvar ()) -- Test case main = do forkChild $ newTVarIO 0 >>= incrManyTimes "thread1" forkChild $ newTVarIO 0 >>= incrManyTimes "thread2" forkChild $ newTVarIO 0 >>= incrManyTimes "thread3" forkChild $ newTVarIO 0 >>= incrManyTimes "thread4" waitForChildren dumpSTMStats -- Confirms no conflicts incrManyTimes :: String -> TVar Int -> IO () incrManyTimes l = incrRec (1000000 :: Int) where incrRec n v | n == 0 = pure () | otherwise = trackNamedSTM l (modifyTVar v (+1)) >> incrRec (n-1) v -- Wojtek

On 02/18/2016 03:25 PM, Wojtek Narczyński wrote:
Dear list,
I create four haskell threads, each performs disjoint STM transactions. The more system threads I add, the slower the program runs. With four system threads the performance is 14% of the nonparallel run.
What is going on?
$ ghc -v Glasgow Haskell Compiler, Version 7.10.3, stage 2 booted by GHC version 7.8.4
$ ghc -O2 -threaded test-stm.hs
$ time ./test-stm +RTS -N1 real 1.843s
$ time ./test-stm +RTS -N2 real 7.469s
$ time ./test-stm +RTS -N3 real 9.640s
$ time ./test-stm +RTS -N4 real 13.144s
$ cat test-stm.hs {-# LANGUAGE ScopedTypeVariables #-}
import Control.Monad import Control.Concurrent import Control.Concurrent.STM import Control.Concurrent.STM.Stats import System.IO.Unsafe
-- Copied from GHC docs
children :: MVar [MVar ()] children = unsafePerformIO (newMVar [])
waitForChildren :: IO () waitForChildren = do cs <- takeMVar children case cs of [] -> return () m:ms -> do putMVar children ms takeMVar m waitForChildren
forkChild :: IO () -> IO ThreadId forkChild io = do mvar <- newEmptyMVar childs <- takeMVar children putMVar children (mvar:childs) forkFinally io (\_ -> putMVar mvar ())
-- Test case
main = do forkChild $ newTVarIO 0 >>= incrManyTimes "thread1" forkChild $ newTVarIO 0 >>= incrManyTimes "thread2" forkChild $ newTVarIO 0 >>= incrManyTimes "thread3" forkChild $ newTVarIO 0 >>= incrManyTimes "thread4" waitForChildren dumpSTMStats -- Confirms no conflicts
incrManyTimes :: String -> TVar Int -> IO () incrManyTimes l = incrRec (1000000 :: Int) where incrRec n v | n == 0 = pure () | otherwise = trackNamedSTM l (modifyTVar v (+1)) >> incrRec (n-1) v
While your code does not contain any conflicts, the stm-stats library seems to contain some. Replacing trackNamedSTM with atomically speeds things up quite a bit, so most of the time is lost in the library. A short look at the code of the library shows that it uses atomicModifyIORef on some global IORef holding a map of the statistics. And your stm transactions are really short, thus that IORef is under high pressure. Now atomicModifyIORef works by reading the old value, computing the new one and then doing an atomic swap, if the IORef still contains the old value. So if the value changed, the new value has to be read, computed, and so on... Now you have four threads doing tiny transactions and then updating this IORef, interfering with each other. If I make the transaction longer by doing 1000 increments in one transaction and then doing only 1000 transactions, the code scales like one would expect.

On 18.02.2016 16:00, Jonas Scholl wrote:
While your code does not contain any conflicts, the stm-stats library seems to contain some. Replacing trackNamedSTM with atomically speeds things up quite a bit, so most of the time is lost in the library. A short look at the code of the library shows that it uses atomicModifyIORef on some global IORef holding a map of the statistics.
Yes, the instrumentation was the culprit. Thank you! I'm trying to come up with a (Ord a, Hashable a) => STM (Set a), internally partitioned for reduced contention. Looks like my code for the Set itself was right, but my code for testing it was ...four transactions, inserting one milion elements each. This coudn't have worked well, I don't know why it worked at all. Now that I switched to four million transactions, inserting one element each, things work much better. Perhaps not great, as the 1 OS thread case is still the fastest, but 4 OS threads are much faster on the partitioned set then on its unpartitioned counterpart. Overall, inserting transactionally four milion Ints into a Set in one second ain't bad at all. Counting them afterwards takes much longer... -- Wojtek
participants (2)
-
Jonas Scholl
-
Wojtek Narczyński