Right approach to profiling and optimizing a concurrent data structure?

Happy New Year, all, I started what I thought would be a pretty straightforward project to implement a concurrent queue (with semantics like Chan) which I hoped would be faster, but the process of trying to measure and test performance has been super frustrating. I started with a really big criterion benchmark suite that ran through a bunch of Chan-like implementations as well as comparing different var primitives; I was compiling that with `-O2 -threaded` and running with +RTS -N (as that seemed realistic, and results were very consistent). Short version: at some point I realized I had (in my cabal config) enabled executable-profiling, which when disabled completely changed all timing and actually *hurt* performance. Then after a lot more head-banging I realized that +RTS -N seems to run on only one core when compiled with -prof (I didn't see that documented anywhere) although I could *force* the -prof version to use more with -N2, and so apparently for my tests[1], running on a single core just *happened* to be faster (I can see why it might; I probably can't expect a speedup when I'm just measuring throughput). I'd be interested in any comments on above, but mostly I'm trying to understand what my approach should be at this point; should I be benchmarking on 1 core and trying to maximize throughput? Should I also profile on just 1 core? How should I benchmark the effects of lots of contention and interpret the results? How can I avoid benchmarking arbitrary decisions of the thread scheduler, while still having my benchmarks be realistic? Are there any RTS flags or compile-time settings that I should *definitely* have on? Thanks for any clarity on this, Brandon http://brandon.si [1] Here's the test I used while most of the forehead-bloodying occurred, here using `Control.Concurrent.Chan`; for no combination of readers/writers/messages could I manage to get this going as fast on 2 cores as on the single-core bound -prof version runC :: Int -> Int -> Int -> IO () runC writers readers n = do let nNice = n - rem n (lcm writers readers) perReader = nNice `quot` readers perWriter = (nNice `quot` writers) vs <- replicateM readers newEmptyMVar c <- C.newChan let doRead = replicateM_ perReader $ theRead theRead = C.readChan c doWrite = replicateM_ perWriter $ theWrite theWrite = C.writeChan c (1 :: Int) mapM_ (\v-> forkIO (doRead >> putMVar v ())) vs replicateM writers $ forkIO $ doWrite mapM_ takeMVar vs -- await readers

First, be aware of https://ghc.haskell.org/trac/ghc/ticket/8453, which
causes programs compiled with -threaded and -prof to occasionally die with
an assertion failure (there are a few other, possibly related, tickets
about rts problems with -threaded and non-vanilla ways).
Next, define what you mean by "faster": more throughput? Lower latency?
Something else?
One approach is to build with profiling and try to optimize the functions
exposed by your API. You could do this on one core. The optimizations
you'd get from this would be generally useful, but they wouldn't be
optimizations to reduce contention.
To look into contention issues, I think the best way is to build with the
eventlog enabled and use threadscope. This will show pretty clearly where
threads are blocked, for how long, etc. I've also had success with timing
actions within my test executable and adding that information to the
eventlog with Debug.Trace.traceEventIO. Then you can see that information
within threadscope, or grep it out of the eventlog for extra processing
(min/max/mean, that sort of thing).
Running with -N1 can be faster because there essentially is no contention:
only a single Haskell thread will be executing at any given time. If -N1
is markedly faster than -N2 (as in, the runtime is longer to complete the
same amount of work), I would try debugging with Threadscope first.
One example of a test driver I used is
https://github.com/JohnLato/kickchan/blob/master/bench/bench_t3.hs .
(KickChan is similar to a bounded Chan, but heavily biased towards fast
writes)
I'd appreciate any further suggestions also.
John L.
On Tue, Jan 7, 2014 at 5:19 PM, Brandon Simmons wrote: Happy New Year, all, I started what I thought would be a pretty straightforward project to
implement a concurrent queue (with semantics like Chan) which I hoped would
be faster, but the process of trying to measure and test performance has
been super frustrating. I started with a really big criterion benchmark suite that ran through a
bunch of Chan-like implementations as well as comparing different var
primitives; I was compiling that with `-O2 -threaded` and running with
+RTS -N (as that seemed realistic, and results were very consistent). Short version: at some point I realized I had (in my cabal config) enabled
executable-profiling, which when disabled completely changed all timing and
actually *hurt* performance. Then after a lot more head-banging I realized
that +RTS -N seems to run on only one core when compiled with -prof (I
didn't see that documented anywhere) although I could *force* the -prof
version to use more with -N2, and so apparently for my tests[1], running on
a single core just *happened* to be faster (I can see why it might; I
probably can't expect a speedup when I'm just measuring throughput). I'd be interested in any comments on above, but mostly I'm trying to
understand what my approach should be at this point; should I be
benchmarking on 1 core and trying to maximize throughput? Should I also
profile on just 1 core? How should I benchmark the effects of lots of
contention and interpret the results? How can I avoid benchmarking
arbitrary decisions of the thread scheduler, while still having my
benchmarks be realistic? Are there any RTS flags or compile-time settings
that I should *definitely* have on? Thanks for any clarity on this,
Brandon
http://brandon.si [1] Here's the test I used while most of the forehead-bloodying occurred,
here using `Control.Concurrent.Chan`; for no combination of
readers/writers/messages could I manage to get this going as fast on 2
cores as on the single-core bound -prof version runC :: Int -> Int -> Int -> IO ()
runC writers readers n = do
let nNice = n - rem n (lcm writers readers)
perReader = nNice `quot` readers
perWriter = (nNice `quot` writers)
vs <- replicateM readers newEmptyMVar
c <- C.newChan
let doRead = replicateM_ perReader $ theRead
theRead = C.readChan c
doWrite = replicateM_ perWriter $ theWrite
theWrite = C.writeChan c (1 :: Int)
mapM_ (\v-> forkIO (doRead >> putMVar v ())) vs
replicateM writers $ forkIO $ doWrite
mapM_ takeMVar vs -- await readers _______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
participants (2)
-
Brandon Simmons
-
John Lato