
Hmm thanks, that's interesting -- I was think it was probably caused by OS X, but it appears to happen on Linux too. Could you try running the old code too, and see if you experience the order of magnitude slowdown too?
The original program on my Linux 2.6.26 Core2 Duo: [tom@myhost Test]$ time ./tr-threaded 1000000 37 real 0m0.635s user 0m0.530s sys 0m0.077s [tom@myhost Test]$ time ./tr-nothreaded 1000000 37 real 0m0.352s user 0m0.350s sys 0m0.000s [tom@myhost Test]$ time ./tr-threaded 1000000 +RTS -N2 37 real 0m13.954s user 0m4.333s sys 0m5.736s -------------------------- Seeing as there still was obviously not enough computation to justify the OS threads in my last example, I made a test where it hashed a 32 byte string (show . md5 . encode $ val): [tom@myhost Test]$ time ./threadring-nothreaded 1000000 50 552 real 0m1.408s user 0m1.323s sys 0m0.083s [tom@myhost Test]$ time ./threadring-threaded 1000000 50 552 real 0m1.948s user 0m1.807s sys 0m0.143s [tom@myhost Test]$ time ./threadring-threaded 1000000 +RTS -N2 552 50 real 0m1.663s user 0m1.427s sys 0m0.237s [tom@myhost Test]$ --------------------------- Seeing as this still doesn't beat the old RTS, I decided to increase the per unit work a little more. This code will hash 10KB every time the token is passed / decremented. [tom@myhost Test]$ time ./threadring-nothreaded 100000 (308,77851ef5e9e781c04850a7df9cc855d2) real 2m56.453s user 2m55.399s sys 0m0.457s [tom@myhost Test]$ time ./threadring-threaded 100000 (308,77851ef5e9e781c04850a7df9cc855d2) real 3m6.430s user 3m5.868s sys 0m0.460s [tom@myhost Test]$ time ./threadring-threaded 100000 +RTS -N2 (810,77851ef5e9e781c04850a7df9cc855d2) (308,77851ef5e9e781c04850a7df9cc855d2) real 1m55.616s user 2m47.982s sys 0m3.586s * Yes, I notice its exiting before the output gets printed a couple times, oh well. ------------------------- REFLECTION Yay, the multicore version pays off when the workload is non-trivial. CPU utilization is still rather low for the -N2 case (70%). I think the Haskell threads have an affinity for certain OS threads (and thus a CPU). Perhaps it results in a CPU having both tokens of work and the other having none?