IORef vs TVar performance: 6 seconds versus 4 minutes

I decided to try to implement a graph algorithm using STM. Each node in the graph has a set of TVar-protected lists of the nodes it links to and the nodes that link to it. Also, there is a global TVar-protected Data.Map that contains all the nodes in the graph, indexed by name (which is polymorphic): data Node k r = Node { fwdPos :: TVar [Node k r], -- forward links (nodes we like) fwdNeg :: TVar [Node k r], -- we allow "negative" links, too (nodes we don't like) revPos :: TVar [Node k r], -- backlinks (nodes that like us) revNeg :: TVar [Node k r], -- negative back links (nodes that don't like us) currRep :: r, -- extra user-defined data name :: k -- node's unique identifier } deriving Show data Network k r = Network { node :: TVar (M.Map k (Node k r)), -- map of nodes by name trusted :: TVar [Node k r] -- a list of nodes we need to iterate over occasionally } deriving Show I tried loading a datafile of about 20,000 nodes into the graph in one big transaction, and found that it takes about 4 minutes. This seemed rather slow, so I replaced all the TVars with IORefs (and substituted STM with IO in the type signatures), and the same operation with the new version took about 6 seconds! This is all with one thread, so there should be no contention for the TVars. Is there something about STM that makes it scale worse than linearly wrt the number of mutations in a transaction? Above performance numbers are for ghc-6.10.1. With ghc-6.8.3, the STM version takes more than 9 minutes. According to profiling, one of my trouble spots is this function, which just adds an entry onto a TVar [a]: stmcons :: k -> TVar [k] -> STM () stmcons x tv = do xs <- readTVar tv writeTVar tv (x:xs) This seems like it ought to be pretty innocuous, unless the whole list is getting evaluated each time I cons a new entry, or if readTVar or writeTVar are much more expensive than they appear. -jim

Both readTVar and writeTVar are worse than O(1); they have to look up the TVar in the transaction log to see if you have made local changes to it. Right now it looks like that operation is O(n) where n is the number of TVars accessed by a transaction, so your big transaction which is just accessing a ton of TVars is likely O(n^2).
From ghc/HEAD/rts/STM.c:
static TRecEntry *get_entry_for(StgTRecHeader *trec, StgTVar *tvar,
StgTRecHeader **in) {
TRecEntry *result = NULL;
TRACE("%p : get_entry_for TVar %p", trec, tvar);
ASSERT(trec != NO_TREC);
do {
FOR_EACH_ENTRY(trec, e, {
if (e -> tvar == tvar) {
result = e;
if (in != NULL) {
*in = trec;
}
BREAK_FOR_EACH;
}
});
trec = trec -> enclosing_trec;
} while (result == NULL && trec != NO_TREC);
return result;
}
STM performance is not really geared towards "big" transactions right
now; in large part because big transactions are likely to starve under
any real workload anyways. If you have a single-threaded startup bit
to populate your data followed by concurrent small mutations, you can
put the startup in IO using small transactions to populate the data.
-- ryan
On Sun, Dec 28, 2008 at 8:02 PM, Jim Snow
I decided to try to implement a graph algorithm using STM. Each node in the graph has a set of TVar-protected lists of the nodes it links to and the nodes that link to it. Also, there is a global TVar-protected Data.Map that contains all the nodes in the graph, indexed by name (which is polymorphic):
data Node k r = Node { fwdPos :: TVar [Node k r], -- forward links (nodes we like) fwdNeg :: TVar [Node k r], -- we allow "negative" links, too (nodes we don't like) revPos :: TVar [Node k r], -- backlinks (nodes that like us) revNeg :: TVar [Node k r], -- negative back links (nodes that don't like us) currRep :: r, -- extra user-defined data name :: k -- node's unique identifier } deriving Show
data Network k r = Network { node :: TVar (M.Map k (Node k r)), -- map of nodes by name trusted :: TVar [Node k r] -- a list of nodes we need to iterate over occasionally } deriving Show
I tried loading a datafile of about 20,000 nodes into the graph in one big transaction, and found that it takes about 4 minutes. This seemed rather slow, so I replaced all the TVars with IORefs (and substituted STM with IO in the type signatures), and the same operation with the new version took about 6 seconds!
This is all with one thread, so there should be no contention for the TVars. Is there something about STM that makes it scale worse than linearly wrt the number of mutations in a transaction?
Above performance numbers are for ghc-6.10.1. With ghc-6.8.3, the STM version takes more than 9 minutes.
According to profiling, one of my trouble spots is this function, which just adds an entry onto a TVar [a]:
stmcons :: k -> TVar [k] -> STM () stmcons x tv = do xs <- readTVar tv writeTVar tv (x:xs)
This seems like it ought to be pretty innocuous, unless the whole list is getting evaluated each time I cons a new entry, or if readTVar or writeTVar are much more expensive than they appear.
-jim _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Thanks, that's good to know. I tried incrementally loading the graph one node per transaction. It's faster: about 38 seconds instead of 4 minutes, but I think I'll stick with IORefs and one thread for the present. -jim Ryan Ingram wrote:
Both readTVar and writeTVar are worse than O(1); they have to look up the TVar in the transaction log to see if you have made local changes to it.
Right now it looks like that operation is O(n) where n is the number of TVars accessed by a transaction, so your big transaction which is just accessing a ton of TVars is likely O(n^2).
From ghc/HEAD/rts/STM.c:
static TRecEntry *get_entry_for(StgTRecHeader *trec, StgTVar *tvar, StgTRecHeader **in) { TRecEntry *result = NULL;
TRACE("%p : get_entry_for TVar %p", trec, tvar); ASSERT(trec != NO_TREC);
do { FOR_EACH_ENTRY(trec, e, { if (e -> tvar == tvar) { result = e; if (in != NULL) { *in = trec; } BREAK_FOR_EACH; } }); trec = trec -> enclosing_trec; } while (result == NULL && trec != NO_TREC);
return result; }
STM performance is not really geared towards "big" transactions right now; in large part because big transactions are likely to starve under any real workload anyways. If you have a single-threaded startup bit to populate your data followed by concurrent small mutations, you can put the startup in IO using small transactions to populate the data.
-- ryan

On Mon, Dec 29, 2008 at 1:15 PM, Ryan Ingram
Both readTVar and writeTVar are worse than O(1); they have to look up the TVar in the transaction log to see if you have made local changes to it.
Right now it looks like that operation is O(n) where n is the number of TVars accessed by a transaction, so your big transaction which is just accessing a ton of TVars is likely O(n^2).
So this actually brings up a tangential question I've wondered about for a while. The lock-free datastructure paper at http://research.microsoft.com/users/Cambridge/simonpj/papers/stm/lock-free-f... shows lock vs. STM with very similar performance for single processor with STM quickly winning on multi-processors. I understand the basic idea as being that the STM versions are optimistic in that they only pay a cost on a transaction collision, while locks have to go frob some bit of memory on every read or write. I'm guessing that STM would also show less contention on a read-heavy load since you don't need block anyone if you are just reading (though I guess you can get the same effect with shared-read locks in a locking style?). But every STM operation has to modify a transaction log, which seems like it should be even more expensive than frobbing a lock bit. So it seems like if the per-operation STM overhead is higher, and blocking contention is the same (assuming the lock implementation uses shared locks for reads), I don't see how the STM implementation could be faster. I'm sure it's all more complicated than this... but where exactly is the STM performance coming from in those papers? Also, they disclaim in the testing section that the STM implementation was immature, but now parallel GC and perhaps STM doesn't do so much dynamic allocation (?), shouldn't the STM numbers be even better?

On Sun, Dec 28, 2008 at 11:08 PM, Evan Laforge
But every STM operation has to modify a transaction log, which seems like it should be even more expensive than frobbing a lock bit. So it seems like if the per-operation STM overhead is higher, and blocking contention is the same (assuming the lock implementation uses shared locks for reads), I don't see how the STM implementation could be faster.
Absolutely, although theoretically for lots of repeated operations your transaction log remains in cache, so it shouldn't be too bad. When I talked to Simon Peyton-Jones at MSR this fall, he talked about locks being like "the assembly language of concurrency"; STM is then a high-level language. You pay some amount in performance to get abstraction & the ability to write composable programs that remain correct, a problem that locking hasn't been able to solve despite decades of research. It's always possible to decompose the final program down into one that could be based on locks. But this is often at the cost of maintainability; the code becomes one mess of spaghetti locking in order to maintain whatever invariants need to be maintained to prevent deadlock and compose code together.
I'm sure it's all more complicated than this... but where exactly is the STM performance coming from in those papers? Also, they disclaim in the testing section that the STM implementation was immature, but now parallel GC and perhaps STM doesn't do so much dynamic allocation (?), shouldn't the STM numbers be even better?
The paper probably got its performance from the "optimistic concurrency" that STM allows; transactions that read many shared parts of a data structure but only modify a small amount that is likely far away from other transactions, like in a binary tree, is ideal for STM performance. You are correct that multiple-reader locks could get some of this benefit back. But those systems have serious drawbacks when a read-only lock needs to be changed into a read-write lock in the middle of an operation. There's also the same deadlock/lock-ordering problems as in any lock-based solution. STM cuts this Gordian knot by forcing all lock-taking actions to be otherwise pure, besides the mutation to the data structures protected by the locks. This way, a failing transaction can always just be killed and restarted if something goes wrong. It might be possible to enforce the same sort of purity on top of a lock-based system. I don't think the STM runtime has gotten a lot of love since that paper; given that TVar operations are using a linear search over the transaction log, I suspect it could go a long way if it had a capable volunteer. This is part of the motivation behind trying to port the concurrency substrate from C into Haskell [1]; getting STM out of the GHC RTS and into the hands of library writers will put a lot more eyeballs on the performance problems. -- ryan [1] Li, Peng. Programmable Concurrency in a Pure and Lazy Language. http://www.seas.upenn.edu/~lipeng/homepage/

It's always possible to decompose the final program down into one that could be based on locks. But this is often at the cost of maintainability; the code becomes one mess of spaghetti locking in order to maintain whatever invariants need to be maintained to prevent deadlock and compose code together.
Agreed, I don't need any convincing that STM is 100x more fun than locks.
You are correct that multiple-reader locks could get some of this benefit back. But those systems have serious drawbacks when a read-only lock needs to be changed into a read-write lock in the middle of an operation. There's also the same deadlock/lock-ordering problems as in any lock-based solution. STM cuts this Gordian knot by forcing all lock-taking actions to be otherwise pure, besides the mutation to the data structures protected by the locks. This way, a failing transaction can always just be killed and restarted if something goes wrong. It might be possible to enforce the same sort of purity on top of a lock-based system.
I see, so would it be safe to say that the simple but reasonable locking implementations in the paper perform poorly compared to STM because of read contention, and while read contention can be eliminated with read locks, it's too much complicated hassle and most locking data structures won't do it? It seems, naively, like relatively simple but popular structures like queues shouldn't have too much trouble taking a read lock for peekQueue but a rw lock for takeQueue.
I don't think the STM runtime has gotten a lot of love since that paper; given that TVar operations are using a linear search over the transaction log, I suspect it could go a long way if it had a capable volunteer. This is part of the motivation behind trying to port the concurrency substrate from C into Haskell [1]; getting STM out of the GHC RTS and into the hands of library writers will put a lot more eyeballs on the performance problems.
But I thought a haskell RTS was a win for correctness and ease of modification, but a lose for performance? Thanks for the paper link, I'll check it out in a bit. There's a seemingly endless supply of interesting haskell papers out there...

On Mon, Dec 29, 2008 at 7:56 AM, Evan Laforge
But I thought a haskell RTS was a win for correctness and ease of modification, but a lose for performance? Thanks for the paper link, I'll check it out in a bit. There's a seemingly endless supply of interesting haskell papers out there...
In the short term, yes. But it's my opinion that most anything that's a win for ease of modification is a win for performance in the long run, if it gets more people writing code. -- ryan

In the short term, yes. But it's my opinion that most anything that's a win for ease of modification is a win for performance in the long run, if it gets more people writing code.
Also agreed, in theory. The language is definitely also in line with this philosophy. In practice, though, it's also nice for low level stuff to be fast right now. However, I expect that if the haskell RTS gets implemented it'll be a switch at least until it's as fast or faster than the C version, so it's all good. Anyway, thanks for the clarification on the STM performance. I guess it means I can't brag too much about pure performance, but there's plenty else to brag about.

Evan Laforge wrote:
On Mon, Dec 29, 2008 at 1:15 PM, Ryan Ingram
wrote: Both readTVar and writeTVar are worse than O(1); they have to look up the TVar in the transaction log to see if you have made local changes to it.
Right now it looks like that operation is O(n) where n is the number of TVars accessed by a transaction, so your big transaction which is just accessing a ton of TVars is likely O(n^2).
So this actually brings up a tangential question I've wondered about for a while.
The lock-free datastructure paper at http://research.microsoft.com/users/Cambridge/simonpj/papers/stm/lock-free-f... shows lock vs. STM with very similar performance for single processor with STM quickly winning on multi-processors.
I have not verified this, but a possible cause is that Control.Concurrent.QSem isn't efficient if there are many waiters. It should use two lists for managing the waiters (ala Okasaki). But why does it manually manage the waiters at all? MVars are fair, in ghc at least. So this should work: newtype Sem = Sem (MVar Int) (MVar Int) newSem :: Int -> IO Sem newSem initial = liftM2 Sem (newMVar initial) newEmptyMVar -- | Wait for a unit to become available waitSem :: Sem -> IO () waitSem (Sem sem wakeup) = do avail' <- modifyMVar sem (\avail -> return (avail-1, avail-1)) when (avail' < 0) $ takeMVar wakeup >>= putMVar sem -- | Signal that a unit of the 'Sem' is available signalSem :: Sem -> IO () signalSem (Sem sem wakeup) = do avail <- takeMVar sem if avail < 0 then putMVar wakeup (avail+1) else putMVar sem (avail+1) (I should turn this into a library proposal.) Bertram

I think I can improve on your code. Bertram Felgenhauer wrote:
But why does it manually manage the waiters at all? MVars are fair, in ghc at least. So this should work:
data Sem = Sem (MVar Int) (MVar Int)
I changed the above to be a data
newSem :: Int -> IO Sem newSem initial = liftM2 Sem (newMVar initial) newEmptyMVar
-- | Wait for a unit to become available waitSem :: Sem -> IO () waitSem (Sem sem wakeup) = do avail' <- modifyMVar sem (\avail -> return (avail-1, avail-1))
Threads can get out of order at this point. This "order bug" may be undesirable. Also, killing the thread while it waits for "wakeup" below would be bad. You need an exception handler and some kind of cleanup.
when (avail' < 0) $ takeMVar wakeup >>= putMVar sem
-- | Signal that a unit of the 'Sem' is available signalSem :: Sem -> IO () signalSem (Sem sem wakeup) = do avail <- takeMVar sem if avail < 0 then putMVar wakeup (avail+1) else putMVar sem (avail+1)
You should change this from "= do" to "= block $ do".
(I should turn this into a library proposal.)
Bertram
If you do not need to take N at a time then the untested code below has no "order bug" and is fair.
module Sem where
import Control.Concurrent.MVar import Control.Monad(when,liftM2)
data Sem = Sem { avail :: MVar Int -- ^ provides fast path and fair queue , lock :: MVar () } -- ^ Held while signalling the queue
-- It makes no sense here to initialize with a negative number, so -- this is treated the same as initializing with 0. newSem :: Int -> IO Sem newSem init | init < 1 = liftM2 Sem newEmptyMVar (newMVar ()) | otherwise = liftM2 Sem (newMVar init) (newMVar ())
waitSem :: Sem -> IO () waitSem (Sem sem _) = block $ do avail <- takeMVar sem when (avail > 1) (signalSemN (pred avail))
signalSem :: Sem -> IO () signalSem = signalSemN 1
signalSemN :: Int -> Sem -> IO () signalSemN i (Sem sem lock) | i <= 1 = return () | otherwise = withMVar lock $ \ _ -> block $ do old <- tryTakeMVar sem case old of Nothing -> putMVar sem i Just v -> putMVar sem $! succ i
All waitSem block in arrival order with the takeMVar in waitSem. The signalSemN avoid conflicting by serializing on the "MVar ()" lock. The above is quite fast so long as the semaphore holds no more than the value 1. Once it hold more than 1 the waiter must take time to add back the remaining value. Note that once threads are woken up in order, they may still go out of order blocking for the () lock when adding back the remaining value (in the presence of other signalers). The above is also exception safe. The only place it can die is during the takeMVar and this merely remove a blocked waiter. I see no way to add a fair waitSemN without changing Sem. But if I change Sem then I can make a fair waitSemN. The untested code is below:
module Sem where
import Control.Concurrent.MVar import Control.Monad(when,liftM3) import Control.Exception.Base
data Sem = Sem { semWait :: MVar () -- for serializing waiting threads , semAvail :: MVar Int -- positive quantity available , semSignal :: MVar () -- for serializing signaling threads }
newSem i | i<=0 = liftM3 Sem (newMVar ()) newEmptyMVar (newMVar ()) | otherwise = liftM3 Sem (newMVar ()) (newMVar i) (newMVar ())
waitSem :: Sem -> IO () waitSem = waitSemN 1
waitSemN :: Int -> Sem -> IO () waitSemN i sem@(Sem w a s) | i<=0 = return () | otherwise = withMVar w $ \ _ -> block $ do let go n = do avail <- onException (takeMVar a) (signalSemN (i-n) sem) case compare avail n of LT -> go $! n-avail EQ -> return () GT -> signalSemN (avail-n) sem go i
signalSem :: Sem -> IO () signalSem = signalSemN 1
signalSemN :: Int -> Sem -> IO () signalSemN i (Sem _ a s) | i<=0 = return () | otherwise = withMVar s $ \ _ -> block $ do ma <- tryTakeMVar a case ma of Nothing -> putMVar a i Just v -> putMVar a $! v+i
Trying for exception safety makes the above slightly tricky. It works by allowing only a single thread to get the semWait lock. This keeps all the arriving threads in the fair blocking queue for the semWait lock. The holder of the semWait lock then nibbles at semAvail's positive value until it is satisfied. Excess value is added back safely with signalSemN. Cheers, Chris Kuklewicz

jsnow:
This seems like it ought to be pretty innocuous, unless the whole list is getting evaluated each time I cons a new entry, or if readTVar or writeTVar are much more expensive than they appear.
For general details on the complexity of IORef, MVar or TVar structures, see Comparing the performance of concurrent linked-list implementations in Haskell (Martin Sulzmann, Edmund S. L. Lam, Simon Marlow) Submitted to DAMP'09, October 2008 http://www.haskell.org/~simonmar/papers/concurrent-data.pdf
participants (6)
-
Bertram Felgenhauer
-
ChrisK
-
Don Stewart
-
Evan Laforge
-
Jim Snow
-
Ryan Ingram