Happy New Year, all,

I started what I thought would be a pretty straightforward project to implement a concurrent queue (with semantics like Chan) which I hoped would be faster, but the process of trying to measure and test performance has been super frustrating.

I started with a really big criterion benchmark suite that ran through a bunch of Chan-like implementations as well as comparing different var primitives; I was compiling that with `-O2 -threaded` and running with +RTS -N (as that seemed realistic, and results were very consistent).

Short version: at some point I realized I had (in my cabal config) enabled executable-profiling, which when disabled completely changed all timing and actually *hurt* performance. Then after a lot more head-banging I realized that +RTS -N seems to run on only one core when compiled with -prof (I didn't see that documented anywhere) although I could *force* the -prof version to use more with -N2, and so apparently for my tests[1], running on a single core just *happened* to be faster (I can see why it might; I probably can't expect a speedup when I'm just measuring throughput).

I'd be interested in any comments on above, but mostly I'm trying to understand what my approach should be at this point; should I be benchmarking on 1 core and trying to maximize throughput? Should I also profile on just 1 core? How should I benchmark the effects of lots of contention and interpret the results? How can I avoid benchmarking arbitrary decisions of the thread scheduler, while still having my benchmarks be realistic? Are there any RTS flags or compile-time settings that I should *definitely* have on?

Thanks for any clarity on this,

Brandon

http://brandon.si

[1] Here's the test I used while most of the forehead-bloodying occurred, here using `Control.Concurrent.Chan`; for no combination of readers/writers/messages could I manage to get this going as fast on 2 cores as on the single-core bound -prof version

runC :: Int -> Int -> Int -> IO ()
runC writers readers n = do
let nNice = n - rem n (lcm writers readers)
      perReader = nNice `quot` readers
      perWriter = (nNice `quot` writers)
vs <- replicateM readers newEmptyMVar
c <- C.newChan
let doRead = replicateM_ perReader $ theRead
      theRead = C.readChan c
      doWrite = replicateM_ perWriter $ theWrite
      theWrite = C.writeChan c (1 :: Int)
mapM_ (\v-> forkIO (doRead >> putMVar v ())) vs
replicateM writers $ forkIO $ doWrite
mapM_ takeMVar vs -- await readers