First, be aware of https://ghc.haskell.org/trac/ghc/ticket/8453, which causes programs compiled with -threaded and -prof to occasionally die with an assertion failure (there are a few other, possibly related, tickets about rts problems with -threaded and non-vanilla ways).

Next, define what you mean by "faster": more throughput? Lower latency? Something else?

One approach is to build with profiling and try to optimize the functions exposed by your API. You could do this on one core. The optimizations you'd get from this would be generally useful, but they wouldn't be optimizations to reduce contention.

To look into contention issues, I think the best way is to build with the eventlog enabled and use threadscope. This will show pretty clearly where threads are blocked, for how long, etc. I've also had success with timing actions within my test executable and adding that information to the eventlog with Debug.Trace.traceEventIO. Then you can see that information within threadscope, or grep it out of the eventlog for extra processing (min/max/mean, that sort of thing).

Running with -N1 can be faster because there essentially is no contention: only a single Haskell thread will be executing at any given time. If -N1 is markedly faster than -N2 (as in, the runtime is longer to complete the same amount of work), I would try debugging with Threadscope first.

One example of a test driver I used is https://github.com/JohnLato/kickchan/blob/master/bench/bench_t3.hs . (KickChan is similar to a bounded Chan, but heavily biased towards fast writes)

I'd appreciate any further suggestions also.

John L.

On Tue, Jan 7, 2014 at 5:19 PM, Brandon Simmons <brandon.m.simmons@gmail.com> wrote:

Happy New Year, all,

I started what I thought would be a pretty straightforward project to implement a concurrent queue (with semantics like Chan) which I hoped would be faster, but the process of trying to measure and test performance has been super frustrating.

I started with a really big criterion benchmark suite that ran through a bunch of Chan-like implementations as well as comparing different var primitives; I was compiling that with `-O2 -threaded` and running with +RTS -N (as that seemed realistic, and results were very consistent).

Short version: at some point I realized I had (in my cabal config) enabled executable-profiling, which when disabled completely changed all timing and actually *hurt* performance. Then after a lot more head-banging I realized that +RTS -N seems to run on only one core when compiled with -prof (I didn't see that documented anywhere) although I could *force* the -prof version to use more with -N2, and so apparently for my tests[1], running on a single core just *happened* to be faster (I can see why it might; I probably can't expect a speedup when I'm just measuring throughput).

I'd be interested in any comments on above, but mostly I'm trying to understand what my approach should be at this point; should I be benchmarking on 1 core and trying to maximize throughput? Should I also profile on just 1 core? How should I benchmark the effects of lots of contention and interpret the results? How can I avoid benchmarking arbitrary decisions of the thread scheduler, while still having my benchmarks be realistic? Are there any RTS flags or compile-time settings that I should *definitely* have on?

Thanks for any clarity on this,
Brandon
http://brandon.si

[1] Here's the test I used while most of the forehead-bloodying occurred, here using `Control.Concurrent.Chan`; for no combination of readers/writers/messages could I manage to get this going as fast on 2 cores as on the single-core bound -prof version

runC :: Int -> Int -> Int -> IO ()
runC writers readers n = do
let nNice = n - rem n (lcm writers readers)
      perReader = nNice `quot` readers
      perWriter = (nNice `quot` writers)
vs <- replicateM readers newEmptyMVar
c <- C.newChan
let doRead = replicateM_ perReader $ theRead
      theRead = C.readChan c
      doWrite = replicateM_ perWriter $ theWrite
      theWrite = C.writeChan c (1 :: Int)
mapM_ (\v-> forkIO (doRead >> putMVar v ())) vs
replicateM writers $ forkIO $ doWrite
mapM_ takeMVar vs -- await readers

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe