
Happy New Year, all, I started what I thought would be a pretty straightforward project to implement a concurrent queue (with semantics like Chan) which I hoped would be faster, but the process of trying to measure and test performance has been super frustrating. I started with a really big criterion benchmark suite that ran through a bunch of Chan-like implementations as well as comparing different var primitives; I was compiling that with `-O2 -threaded` and running with +RTS -N (as that seemed realistic, and results were very consistent). Short version: at some point I realized I had (in my cabal config) enabled executable-profiling, which when disabled completely changed all timing and actually *hurt* performance. Then after a lot more head-banging I realized that +RTS -N seems to run on only one core when compiled with -prof (I didn't see that documented anywhere) although I could *force* the -prof version to use more with -N2, and so apparently for my tests[1], running on a single core just *happened* to be faster (I can see why it might; I probably can't expect a speedup when I'm just measuring throughput). I'd be interested in any comments on above, but mostly I'm trying to understand what my approach should be at this point; should I be benchmarking on 1 core and trying to maximize throughput? Should I also profile on just 1 core? How should I benchmark the effects of lots of contention and interpret the results? How can I avoid benchmarking arbitrary decisions of the thread scheduler, while still having my benchmarks be realistic? Are there any RTS flags or compile-time settings that I should *definitely* have on? Thanks for any clarity on this, Brandon http://brandon.si [1] Here's the test I used while most of the forehead-bloodying occurred, here using `Control.Concurrent.Chan`; for no combination of readers/writers/messages could I manage to get this going as fast on 2 cores as on the single-core bound -prof version runC :: Int -> Int -> Int -> IO () runC writers readers n = do let nNice = n - rem n (lcm writers readers) perReader = nNice `quot` readers perWriter = (nNice `quot` writers) vs <- replicateM readers newEmptyMVar c <- C.newChan let doRead = replicateM_ perReader $ theRead theRead = C.readChan c doWrite = replicateM_ perWriter $ theWrite theWrite = C.writeChan c (1 :: Int) mapM_ (\v-> forkIO (doRead >> putMVar v ())) vs replicateM writers $ forkIO $ doWrite mapM_ takeMVar vs -- await readers