
Hi, I'm looking at benchmarking several different concurrency libraries against each other. The benchmarks involve things like repeatedly sending values between two threads. I'm likely to use Criterion for the task. However, one thing I've found is that the libraries have noticeably different behaviour in terms of the amount of garbage created. Criterion has an option to perform GC between each benchmark, but I think that the benchmark is only fair if it takes into account the GC time for each system; it doesn't seem right for two systems to be counted as equal if the times to get the results are the same, but then one has to spend twice as long as the other in GC afterwards. Here's some options I've considered: * I could make a small change to Criterion to include the time for performing GC in each benchmark run, but I worry that running the GC so often is also misrepresentative (might 100 small GCs take a lot longer than one large GC of the same data?) -- it may also add a lot of overhead to quick benchmarks, but I can avoid that problem. * Alternatively, I could run the GC once at the end of all the runs, then apportion the cost equally to each of the benchmark times (so if 100 benchmarks require 0.7s of GC, add 0.007s to each time) -- but if GC is triggered somewhere in the middle of the runs, that upsets the strategy a little. * I guess a further alternative is to make each benchmark a whole program (and decently long), then just time the whole thing, rather than using Criterion. Has anyone run into these issues before, and can anyone offer an opinion on what the best option is? Thanks, Neil.

On Thu, Mar 4, 2010 at 7:16 PM, Neil Brown
However, one thing I've found is that the libraries have noticeably different behaviour in terms of the amount of garbage created.
In fact, CML relies on the garbage collector for some implementation constructions. John H. Reppys "Concurrent Programming in ML" is worth a read if you haven't. My guess is that the Haskell implementation of CML is bloody expensive. It is based on the paper http://www.cs.umd.edu/~avik/projects/cmllch/ where Chaudhuri first constructs an abstract machine for CML and then binds this to the Haskell MVar and forkIO constructions. In any case, implementing sorting networks, Power Series multiplication (Doug Mcilroy, Rob Pike - Squinting at power series) or the good old prime sieve are also interesting benchmarks. -- J.

Jesper Louis Andersen wrote:
On Thu, Mar 4, 2010 at 7:16 PM, Neil Brown
wrote: However, one thing I've found is that the libraries have noticeably different behaviour in terms of the amount of garbage created.
In fact, CML relies on the garbage collector for some implementation constructions. John H. Reppys "Concurrent Programming in ML" is worth a read if you haven't. My guess is that the Haskell implementation of CML is bloody expensive. It is based on the paper http://www.cs.umd.edu/~avik/projects/cmllch/ where Chaudhuri first constructs an abstract machine for CML and then binds this to the Haskell MVar and forkIO constructions.
CML is indeed the library that has the most markedly different behaviour. In Haskell, the CML package manages to produce timings like this for fairly simple benchmarks: INIT time 0.00s ( 0.00s elapsed) MUT time 2.47s ( 2.49s elapsed) GC time 59.43s ( 60.56s elapsed) EXIT time 0.00s ( 0.01s elapsed) Total time 61.68s ( 63.07s elapsed) %GC time 96.3% (96.0% elapsed) Alloc rate 784,401,525 bytes per MUT second Productivity 3.7% of total user, 3.6% of total elapsed I knew from reading the code that CML's implementation would do something like this, although I do wonder if it triggers some pathological case in the GC. The problem is that when I benchmark the program, it seems to finish it decent time; then spends 60 seconds doing GC before finally terminating! So I need some way of timing that will reflect this; I wonder if just timing the entire run-time (and making the benchmarks long enough to not be swallowed by program start-up times, etc) is the best thing to do. A secondary issue is whether I should even include CML at all considering the timings! Thanks, Neil

On Thu, Mar 4, 2010 at 8:35 PM, Neil Brown
CML is indeed the library that has the most markedly different behaviour. In Haskell, the CML package manages to produce timings like this for fairly simple benchmarks:
%GC time 96.3% (96.0% elapsed)
I knew from reading the code that CML's implementation would do something like this, although I do wonder if it triggers some pathological case in the GC.
That result is peculiar. What are you doing to the library, and what do you expect happens? Since I have some code invested on top of CML, I'd like to gain a little insight if possible. -- J.

Jesper Louis Andersen wrote:
On Thu, Mar 4, 2010 at 8:35 PM, Neil Brown
wrote: CML is indeed the library that has the most markedly different behaviour. In Haskell, the CML package manages to produce timings like this for fairly simple benchmarks:
%GC time 96.3% (96.0% elapsed)
I knew from reading the code that CML's implementation would do something like this, although I do wonder if it triggers some pathological case in the GC.
That result is peculiar. What are you doing to the library, and what do you expect happens? Since I have some code invested on top of CML, I'd like to gain a little insight if possible.
In trying to simplify my code, the added time has moved from GC time to EXIT time (and increased!). This shift isn't too surprising -- I believe the time is really spent trying to kill lots of threads. Here's my very simple benchmark; the main thread repeatedly chooses between receiving from two threads that are sending to it: ==== import Control.Concurrent import Control.Concurrent.CML import Control.Monad main :: IO () main = do let numChoices = 2 cs <- replicateM numChoices channel mapM_ forkIO [replicateM_ (100000 `div` numChoices) $ sync $ transmit c () | c <- cs] replicateM_ 100000 $ sync $ choose [receive c (const True) | c <- cs] ==== Compiling with -threaded, and running with +RTS -s, I get: INIT time 0.00s ( 0.00s elapsed) MUT time 2.68s ( 3.56s elapsed) GC time 1.84s ( 1.90s elapsed) EXIT time 89.30s ( 90.71s elapsed) Total time 93.82s ( 96.15s elapsed) I think the issue with the CML library is that it spawns a lot of threads (search the source for forkIO: http://hackage.haskell.org/packages/archive/cml/0.1.3/doc/html/src/Control-C...). Presumably the Haskell RTS isn't optimised for this approach (maybe the ML RTS was, from what you said?), and at the end of the program it spends a lot of time reaping the threads. The problem isn't nearly as bad if you don't use choose, though: ==== import Control.Concurrent import Control.Concurrent.CML import Control.Monad main :: IO () main = do c <- channel forkIO $ replicateM_ 100000 $ sync $ transmit c () replicateM_ 100000 $ sync $ receive c (const True) ==== I get: INIT time 0.00s ( 0.00s elapsed) MUT time 1.92s ( 2.65s elapsed) GC time 0.92s ( 0.93s elapsed) EXIT time 0.00s ( 0.02s elapsed) Total time 2.65s ( 3.59s elapsed) %GC time 34.6% (25.8% elapsed) Hope that helps, Neil.

On 04/03/2010 22:01, Neil Brown wrote:
Jesper Louis Andersen wrote:
On Thu, Mar 4, 2010 at 8:35 PM, Neil Brown
wrote: CML is indeed the library that has the most markedly different behaviour. In Haskell, the CML package manages to produce timings like this for fairly simple benchmarks:
%GC time 96.3% (96.0% elapsed)
I knew from reading the code that CML's implementation would do something like this, although I do wonder if it triggers some pathological case in the GC.
That result is peculiar. What are you doing to the library, and what do you expect happens? Since I have some code invested on top of CML, I'd like to gain a little insight if possible.
In trying to simplify my code, the added time has moved from GC time to EXIT time (and increased!). This shift isn't too surprising -- I believe the time is really spent trying to kill lots of threads. Here's my very simple benchmark; the main thread repeatedly chooses between receiving from two threads that are sending to it:
==== import Control.Concurrent import Control.Concurrent.CML import Control.Monad
main :: IO () main = do let numChoices = 2 cs <- replicateM numChoices channel mapM_ forkIO [replicateM_ (100000 `div` numChoices) $ sync $ transmit c () | c <- cs] replicateM_ 100000 $ sync $ choose [receive c (const True) | c <- cs] ====
Compiling with -threaded, and running with +RTS -s, I get:
INIT time 0.00s ( 0.00s elapsed) MUT time 2.68s ( 3.56s elapsed) GC time 1.84s ( 1.90s elapsed) EXIT time 89.30s ( 90.71s elapsed) Total time 93.82s ( 96.15s elapsed)
Good grief. Can I get a copy of this program? It might be something simple that we can fix. Just having lots of threads shouldn't be a performance problem per se, we have benchmarks that create millions of threads without any problems. Cheers, Simon

Simon Marlow wrote:
import Control.Concurrent import Control.Concurrent.CML import Control.Monad
main :: IO () main = do let numChoices = 2 cs <- replicateM numChoices channel mapM_ forkIO [replicateM_ (100000 `div` numChoices) $ sync $ transmit c () | c <- cs] replicateM_ 100000 $ sync $ choose [receive c (const True) | c <- cs]
Good grief. Can I get a copy of this program? It might be something simple that we can fix. Just having lots of threads shouldn't be a performance problem per se, we have benchmarks that create millions of threads without any problems. That's all the code you need, along with the cml package from Hackage. Put the above few lines into GoodGrief.hs (the reply has munged the indentation slightly), and do:
cabal install cml ghc --make -threaded GoodGrief.hs ./GoodGrief +RTS -s That got me the listed results on GHC 6.12.1 (I did use -threaded but not -N as I was on a single-core machine; I believe the same problem occurs without -threaded). The problem is in the CML library that the above code uses. Neil.

On 05/03/2010 13:45, Neil Brown wrote:
Simon Marlow wrote:
import Control.Concurrent import Control.Concurrent.CML import Control.Monad
main :: IO () main = do let numChoices = 2 cs <- replicateM numChoices channel mapM_ forkIO [replicateM_ (100000 `div` numChoices) $ sync $ transmit c () | c <- cs] replicateM_ 100000 $ sync $ choose [receive c (const True) | c <- cs]
Good grief. Can I get a copy of this program? It might be something simple that we can fix. Just having lots of threads shouldn't be a performance problem per se, we have benchmarks that create millions of threads without any problems. That's all the code you need, along with the cml package from Hackage. Put the above few lines into GoodGrief.hs (the reply has munged the indentation slightly), and do:
cabal install cml ghc --make -threaded GoodGrief.hs ./GoodGrief +RTS -s
That got me the listed results on GHC 6.12.1 (I did use -threaded but not -N as I was on a single-core machine; I believe the same problem occurs without -threaded). The problem is in the CML library that the above code uses.
Yes, I see what the problem is. Removing a thread from the queue attached to an MVar is a O(n) operation, and in this case you have thousands of threads attached to MVars and the system wants to clean them all up before exiting, hence things go O(n^2). Threads only need to be removed from the queue when - they are the target of throwTo from another thread - the system is shutting down, and deleting threads - the thread is unreachable and needs to be woken up to be sent the BlockedIndefinitelyOnMVar exception (this is probably why you sometimes get the 60s GC just after your benchmark has finished, but before shutdown). So as it happens Simon PJ and I were discussing some changes yesterday that will eventually make removing a thread from a queue an O(1) operation. Hopefully 6.14.1 will be better in this regard. Cheers, Simon

On 04/03/2010 22:01, Neil Brown wrote:
Jesper Louis Andersen wrote:
On Thu, Mar 4, 2010 at 8:35 PM, Neil Brown
wrote: CML is indeed the library that has the most markedly different behaviour. In Haskell, the CML package manages to produce timings like this for fairly simple benchmarks:
%GC time 96.3% (96.0% elapsed)
I knew from reading the code that CML's implementation would do something like this, although I do wonder if it triggers some pathological case in the GC.
That result is peculiar. What are you doing to the library, and what do you expect happens? Since I have some code invested on top of CML, I'd like to gain a little insight if possible.
In trying to simplify my code, the added time has moved from GC time to EXIT time (and increased!). This shift isn't too surprising -- I believe the time is really spent trying to kill lots of threads. Here's my very simple benchmark; the main thread repeatedly chooses between receiving from two threads that are sending to it:
==== import Control.Concurrent import Control.Concurrent.CML import Control.Monad
main :: IO () main = do let numChoices = 2 cs <- replicateM numChoices channel mapM_ forkIO [replicateM_ (100000 `div` numChoices) $ sync $ transmit c () | c <- cs] replicateM_ 100000 $ sync $ choose [receive c (const True) | c <- cs] ====
Compiling with -threaded, and running with +RTS -s, I get:
INIT time 0.00s ( 0.00s elapsed) MUT time 2.68s ( 3.56s elapsed) GC time 1.84s ( 1.90s elapsed) EXIT time 89.30s ( 90.71s elapsed) Total time 93.82s ( 96.15s elapsed)
FYI, I've now fixed this in my working branch: INIT time 0.00s ( 0.00s elapsed) MUT time 0.88s ( 0.88s elapsed) GC time 0.85s ( 1.04s elapsed) EXIT time 0.05s ( 0.07s elapsed) Total time 1.78s ( 1.97s elapsed) Cheers, Simon
participants (3)
-
Jesper Louis Andersen
-
Neil Brown
-
Simon Marlow