New subject: Right approach to profiling and optimizing a concurrent data structure?

7 Jan 2014

      Happy New Year, all,

I started what I thought would be a pretty straightforward project to
implement a concurrent queue (with semantics like Chan) which I hoped would
be faster, but the process of trying to measure and test performance has
been super frustrating.

I started with a really big criterion benchmark suite that ran through a
bunch of Chan-like implementations as well as comparing different var
primitives; I was compiling that with `-O2  -threaded` and running with
+RTS -N (as that seemed realistic, and results were very consistent).

Short version: at some point I realized I had (in my cabal config) enabled
executable-profiling, which when disabled completely changed all timing and
actually *hurt* performance. Then after a lot more head-banging I realized
that +RTS -N seems to run on only one core when compiled with -prof (I
didn't see that documented anywhere) although I could *force* the -prof
version to use more with -N2, and so apparently for my tests[1], running on
a single core just *happened* to be faster (I can see why it might; I
probably can't expect a speedup when I'm just measuring throughput).

I'd be interested in any comments on above, but mostly I'm trying to
understand what my approach should be at this point; should I be
benchmarking on 1 core and trying to maximize throughput? Should I also
profile on just 1 core? How should I benchmark the effects of lots of
contention and interpret the results? How can I avoid benchmarking
arbitrary decisions of the thread scheduler, while still having my
benchmarks be realistic? Are there any RTS flags or compile-time settings
that I should *definitely* have on?

Thanks for any clarity on this,
Brandon
http://brandon.si

[1] Here's the test I used while most of the forehead-bloodying occurred,
here using `Control.Concurrent.Chan`; for no combination of
readers/writers/messages could I manage to get this going as fast on 2
cores as on the single-core bound -prof version

runC :: Int -> Int -> Int -> IO ()
runC writers readers n = do
  let nNice = n - rem n (lcm writers readers)
      perReader = nNice `quot` readers
      perWriter = (nNice `quot` writers)
  vs <- replicateM readers newEmptyMVar
  c <- C.newChan
  let doRead = replicateM_ perReader $ theRead
      theRead = C.readChan c
      doWrite = replicateM_ perWriter $ theWrite
      theWrite = C.writeChan c (1 :: Int)
  mapM_ (\v-> forkIO (doRead >> putMVar v ())) vs
  replicateM writers $ forkIO $ doWrite
  mapM_ takeMVar vs -- await readers

Right approach to profiling and optimizing a concurrent data structure?

Brandon Simmons

John Lato

tags

participants (2)