
On 02/18/2016 03:25 PM, Wojtek Narczyński wrote:
Dear list,
I create four haskell threads, each performs disjoint STM transactions. The more system threads I add, the slower the program runs. With four system threads the performance is 14% of the nonparallel run.
What is going on?
$ ghc -v Glasgow Haskell Compiler, Version 7.10.3, stage 2 booted by GHC version 7.8.4
$ ghc -O2 -threaded test-stm.hs
$ time ./test-stm +RTS -N1 real 1.843s
$ time ./test-stm +RTS -N2 real 7.469s
$ time ./test-stm +RTS -N3 real 9.640s
$ time ./test-stm +RTS -N4 real 13.144s
$ cat test-stm.hs {-# LANGUAGE ScopedTypeVariables #-}
import Control.Monad import Control.Concurrent import Control.Concurrent.STM import Control.Concurrent.STM.Stats import System.IO.Unsafe
-- Copied from GHC docs
children :: MVar [MVar ()] children = unsafePerformIO (newMVar [])
waitForChildren :: IO () waitForChildren = do cs <- takeMVar children case cs of [] -> return () m:ms -> do putMVar children ms takeMVar m waitForChildren
forkChild :: IO () -> IO ThreadId forkChild io = do mvar <- newEmptyMVar childs <- takeMVar children putMVar children (mvar:childs) forkFinally io (\_ -> putMVar mvar ())
-- Test case
main = do forkChild $ newTVarIO 0 >>= incrManyTimes "thread1" forkChild $ newTVarIO 0 >>= incrManyTimes "thread2" forkChild $ newTVarIO 0 >>= incrManyTimes "thread3" forkChild $ newTVarIO 0 >>= incrManyTimes "thread4" waitForChildren dumpSTMStats -- Confirms no conflicts
incrManyTimes :: String -> TVar Int -> IO () incrManyTimes l = incrRec (1000000 :: Int) where incrRec n v | n == 0 = pure () | otherwise = trackNamedSTM l (modifyTVar v (+1)) >> incrRec (n-1) v
While your code does not contain any conflicts, the stm-stats library seems to contain some. Replacing trackNamedSTM with atomically speeds things up quite a bit, so most of the time is lost in the library. A short look at the code of the library shows that it uses atomicModifyIORef on some global IORef holding a map of the statistics. And your stm transactions are really short, thus that IORef is under high pressure. Now atomicModifyIORef works by reading the old value, computing the new one and then doing an atomic swap, if the IORef still contains the old value. So if the value changed, the new value has to be read, computed, and so on... Now you have four threads doing tiny transactions and then updating this IORef, interfering with each other. If I make the transaction longer by doing 1000 increments in one transaction and then doing only 1000 transactions, the code scales like one would expect.