
jed:
On Sun 2008-08-24 11:03, Thomas M. DuBuisson wrote:
Yay, the multicore version pays off when the workload is non-trivial. CPU utilization is still rather low for the -N2 case (70%). I think the Haskell threads have an affinity for certain OS threads (and thus a CPU). Perhaps it results in a CPU having both tokens of work and the other having none?
This must be obvious to everyone but the original thread-ring cannot possibly be faster with multiple OS thread since a thread can only be running if it has the token, otherwise it is just blocked on the token. If there are threads executing simultaneously, the token must at least be written to the shared cache if not to main memory. With the single threaded runtime, the token may never leave L1. The difference between -threaded -N1 and -nothreaded may be influenced by the effectiveness of prefetching the next thread (since presumably not all 503 threads can reside in L1).
Simon Marlow sez: The thread-ring benchmark needs careful scheduling to get a speedup on multiple CPUs. I was only able to get a speedup by explicitly locking half of the ring onto each CPU. You can do this using GHC.Conc.forkOnIO in GHC 6.8.x, and you'll also need +RTS -qm -qw. Also make sure that you're not using the main thread for any part of the main computation, because the main thread is a bound thread and runs in its own OS thread, so communication between the main thread and any other thread is slow.