
Lo guys, I thought you'd like to know about this result. I've been playing with the debian language shootout programs under OS X, looking at how fast Haskell code is compared to C on OS X, rather than linux. Interestingly, Haskell comes out rather better on OS X than it did on Linux. Here's my results (times in seconds): C Haskell Relative speed Inverse Binary Trees 6.842 1.228 0.179479684302835 5.57166123778502 Fannkuch 5.683 15.73 2.76790427591061 0.361284170375079 Mandelbrot 1.183 2.287 1.93322062552832 0.517271534761697 nbody 10.275 16.219 1.57849148418491 0.633516246377705 nsieve 0.167 0.253 1.51497005988024 0.660079051383399 nsieve-bits 0.471 0.713 1.51380042462845 0.660589060308555 partial sums 1.047 1.313 1.25405921680993 0.797410510281797 pidigits 1.238 1.4 1.13085621970921 0.884285714285714 recursive 1.554 3.594 2.31274131274131 0.432387312186978 spectral-norm 27.939 19.165 0.685958695729983 1.45781372293243 threadring 91.284 1.389 0.0152162481924543 65.719222462203 ------------------------------------------------------------------------- Averages 1.35333620432893 0.738914688605306 Some notes: Hardware: 2Ghz Core2Duo, enough ram to not worry about paging Some programs are not included, this is because the C code produced compile errors. The Haskell code appears to be portable in *all* cases. The average slowdown for running Haskell is only 1.3 times on OS X! That's pretty damn good. I'm sure some people will say "yeh, but you have to optimise your code pretty heavily to get that kind of result". Interestingly, the programs that have the biggest speed advantage over C here are also the most "naïvely" written ones. The thing that seems to make C slower is the implementation of malloc in OS X. This has a couple of implications, first, if apple replaced the malloc library, it would probably push the result back to the 1.7 times slower we see under Linux. Second, Fannkuch can probably be optimised differently for Haskell to look even better -- at the moment, the haskell code actually uses malloc! Finally, that threading example... WOW! 65 times faster, and the code is *really* simple. The C on the other hand is a massive mess. Bob

2008/8/23 Thomas Davie
Finally, that threading example... WOW! 65 times faster, and the code is *really* simple. The C on the other hand is a massive mess.
I've been wondering about this, but I can't check because I don't have a multi core cpu. I've heard GHC's single threaded runtime is very very good. What are the results for the threading example when compiled with -threaded and run with at least +RTS -N2? Luke

On 23 Aug 2008, at 20:01, Luke Palmer wrote:
2008/8/23 Thomas Davie
: Finally, that threading example... WOW! 65 times faster, and the code is *really* simple. The C on the other hand is a massive mess.
I've been wondering about this, but I can't check because I don't have a multi core cpu. I've heard GHC's single threaded runtime is very very good. What are the results for the threading example when compiled with -threaded and run with at least +RTS -N2?
That's really interesting -- I just tried this. Compiling not using -threaded: 1.289 seconds Compiling using -threaded, but not running with -N2: 3.403 seconds Compiling using -threaded, and using -N2: 55.072 seconds Wow! Haskell's runtime really is a *lot* better than trying to use operating system threads. I wonder if there's a point at which it becomes better to use both CPUs, or if the overhead of using OS threads for that problem is just too high. Bob

Recently I wrote computation intensive program that could easily
utilize both cores. However, there was overhead just from compiling
with -threaded and making some forkIO's. Still, the overhead was not
larger than 50% and with 4 cores I would probably still get the
results faster - I didn't experience an order of magnitude slowdown.
Perhaps it's the issue with OS X.
On Sat, Aug 23, 2008 at 21:19, Thomas Davie
On 23 Aug 2008, at 20:01, Luke Palmer wrote:
2008/8/23 Thomas Davie
: Finally, that threading example... WOW! 65 times faster, and the code is *really* simple. The C on the other hand is a massive mess.
I've been wondering about this, but I can't check because I don't have a multi core cpu. I've heard GHC's single threaded runtime is very very good. What are the results for the threading example when compiled with -threaded and run with at least +RTS -N2?
That's really interesting -- I just tried this.
Compiling not using -threaded: 1.289 seconds Compiling using -threaded, but not running with -N2: 3.403 seconds Compiling using -threaded, and using -N2: 55.072 seconds
Wow! Haskell's runtime really is a *lot* better than trying to use operating system threads. I wonder if there's a point at which it becomes better to use both CPUs, or if the overhead of using OS threads for that problem is just too high.
Bob _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On 2008 Aug 23, at 18:34, Krzysztof Skrzętnicki wrote:
Recently I wrote computation intensive program that could easily utilize both cores. However, there was overhead just from compiling with -threaded and making some forkIO's. Still, the overhead was not larger than 50% and with 4 cores I would probably still get the results faster - I didn't experience an order of magnitude slowdown. Perhaps it's the issue with OS X.
All that's needed for multicore to be a *lot* slower is doing it wrong. Make sure you're forcing the right things in the right places, or you could quietly be building up thunks on both cores that will cause lots of cross-core signaling or locking. And, well, make sure the generated code isn't stupid. Quite possibly the PPC code is an order of magnitude worse than the better-tested Intel code. -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

On 24 Aug 2008, at 01:26, Brandon S. Allbery KF8NH wrote:
On 2008 Aug 23, at 18:34, Krzysztof Skrzętnicki wrote:
Recently I wrote computation intensive program that could easily utilize both cores. However, there was overhead just from compiling with -threaded and making some forkIO's. Still, the overhead was not larger than 50% and with 4 cores I would probably still get the results faster - I didn't experience an order of magnitude slowdown. Perhaps it's the issue with OS X.
All that's needed for multicore to be a *lot* slower is doing it wrong. Make sure you're forcing the right things in the right places, or you could quietly be building up thunks on both cores that will cause lots of cross-core signaling or locking. And, well, make sure the generated code isn't stupid. Quite possibly the PPC code is an order of magnitude worse than the better-tested Intel code.
Except that the test was running on a Core2Duo, and it runs very fast when ghc does the threading on one core. My personal guess is that to do it properly threaded requires *lots* of kernel boundary crosses to do the locking etc on OS X (being a nearly-micro-kernel). The test program was almost 100% made up of thread locking code. Bob

That's really interesting -- I just tried this.
Compiling not using -threaded: 1.289 seconds Compiling using -threaded, but not running with -N2: 3.403 seconds Compiling using -threaded, and using -N2: 55.072 seconds
I was hoping to see a relative improvement when introducting an opportunity parallelism in the program, so I made a version with two MVars filled at the start. This didn't work out though - perhaps some performance stands to be gained by improving the GHC scheduler wrt cpu / OS thread affinity for the Haskell threads? For the curious: -O2: 7.3 seconds (CPU: 99.7% user) -O2 -threaded: 11.5 seconds (CPU: 95% user, 5% system) -O2 -threaded ... +RTS -N2: killed after 3 minutes (CPUs: 15% user, 20% system) Thats quite a lot of CPU time going to the system. Specs: Linux 2.6.26 (Arch) x86_64 Intel Core 2 Duo 2.5GHz SOURCE Notice the threads still exist when a value of 0 is seen - this is OK as the other value will be terminating ringsize threads earlier and this thread will never be needed again. import Control.Monad import Control.Concurrent import System.Environment ring = 503 new l i = do r <- newEmptyMVar forkIO (thread i l r) return r thread :: Int -> MVar Int -> MVar Int -> IO () thread i l r = go where go = do m <- takeMVar l when (m == 1) (print i) putMVar r $! m - 1 when (m > 0) go main = do val <- return . read . head =<< getArgs a <- newMVar val b <- newMVar val y <- foldM new a [2..ring] forkIO $ thread (ring+1) y b z <- foldM new b [(ring + 2)..(ring + ring)] thread 1 z a

On 24 Aug 2008, at 06:31, Thomas M. DuBuisson wrote:
That's really interesting -- I just tried this.
Compiling not using -threaded: 1.289 seconds Compiling using -threaded, but not running with -N2: 3.403 seconds Compiling using -threaded, and using -N2: 55.072 seconds
I was hoping to see a relative improvement when introducting an opportunity parallelism in the program, so I made a version with two MVars filled at the start. This didn't work out though - perhaps some performance stands to be gained by improving the GHC scheduler wrt cpu / OS thread affinity for the Haskell threads?
For the curious:
-O2: 7.3 seconds (CPU: 99.7% user) -O2 -threaded: 11.5 seconds (CPU: 95% user, 5% system) -O2 -threaded ... +RTS -N2: killed after 3 minutes (CPUs: 15% user, 20% system)
Thats quite a lot of CPU time going to the system.
Specs: Linux 2.6.26 (Arch) x86_64 Intel Core 2 Duo 2.5GHz
Hmm thanks, that's interesting -- I was think it was probably caused by OS X, but it appears to happen on Linux too. Could you try running the old code too, and see if you experience the order of magnitude slowdown too? Bob

Hmm thanks, that's interesting -- I was think it was probably caused by OS X, but it appears to happen on Linux too. Could you try running the old code too, and see if you experience the order of magnitude slowdown too?
The original program on my Linux 2.6.26 Core2 Duo: [tom@myhost Test]$ time ./tr-threaded 1000000 37 real 0m0.635s user 0m0.530s sys 0m0.077s [tom@myhost Test]$ time ./tr-nothreaded 1000000 37 real 0m0.352s user 0m0.350s sys 0m0.000s [tom@myhost Test]$ time ./tr-threaded 1000000 +RTS -N2 37 real 0m13.954s user 0m4.333s sys 0m5.736s -------------------------- Seeing as there still was obviously not enough computation to justify the OS threads in my last example, I made a test where it hashed a 32 byte string (show . md5 . encode $ val): [tom@myhost Test]$ time ./threadring-nothreaded 1000000 50 552 real 0m1.408s user 0m1.323s sys 0m0.083s [tom@myhost Test]$ time ./threadring-threaded 1000000 50 552 real 0m1.948s user 0m1.807s sys 0m0.143s [tom@myhost Test]$ time ./threadring-threaded 1000000 +RTS -N2 552 50 real 0m1.663s user 0m1.427s sys 0m0.237s [tom@myhost Test]$ --------------------------- Seeing as this still doesn't beat the old RTS, I decided to increase the per unit work a little more. This code will hash 10KB every time the token is passed / decremented. [tom@myhost Test]$ time ./threadring-nothreaded 100000 (308,77851ef5e9e781c04850a7df9cc855d2) real 2m56.453s user 2m55.399s sys 0m0.457s [tom@myhost Test]$ time ./threadring-threaded 100000 (308,77851ef5e9e781c04850a7df9cc855d2) real 3m6.430s user 3m5.868s sys 0m0.460s [tom@myhost Test]$ time ./threadring-threaded 100000 +RTS -N2 (810,77851ef5e9e781c04850a7df9cc855d2) (308,77851ef5e9e781c04850a7df9cc855d2) real 1m55.616s user 2m47.982s sys 0m3.586s * Yes, I notice its exiting before the output gets printed a couple times, oh well. ------------------------- REFLECTION Yay, the multicore version pays off when the workload is non-trivial. CPU utilization is still rather low for the -N2 case (70%). I think the Haskell threads have an affinity for certain OS threads (and thus a CPU). Perhaps it results in a CPU having both tokens of work and the other having none?

On Sun 2008-08-24 11:03, Thomas M. DuBuisson wrote:
Yay, the multicore version pays off when the workload is non-trivial. CPU utilization is still rather low for the -N2 case (70%). I think the Haskell threads have an affinity for certain OS threads (and thus a CPU). Perhaps it results in a CPU having both tokens of work and the other having none?
This must be obvious to everyone but the original thread-ring cannot possibly be faster with multiple OS thread since a thread can only be running if it has the token, otherwise it is just blocked on the token. If there are threads executing simultaneously, the token must at least be written to the shared cache if not to main memory. With the single threaded runtime, the token may never leave L1. The difference between -threaded -N1 and -nothreaded may be influenced by the effectiveness of prefetching the next thread (since presumably not all 503 threads can reside in L1). Jed

jed:
On Sun 2008-08-24 11:03, Thomas M. DuBuisson wrote:
Yay, the multicore version pays off when the workload is non-trivial. CPU utilization is still rather low for the -N2 case (70%). I think the Haskell threads have an affinity for certain OS threads (and thus a CPU). Perhaps it results in a CPU having both tokens of work and the other having none?
This must be obvious to everyone but the original thread-ring cannot possibly be faster with multiple OS thread since a thread can only be running if it has the token, otherwise it is just blocked on the token. If there are threads executing simultaneously, the token must at least be written to the shared cache if not to main memory. With the single threaded runtime, the token may never leave L1. The difference between -threaded -N1 and -nothreaded may be influenced by the effectiveness of prefetching the next thread (since presumably not all 503 threads can reside in L1).
Simon Marlow sez: The thread-ring benchmark needs careful scheduling to get a speedup on multiple CPUs. I was only able to get a speedup by explicitly locking half of the ring onto each CPU. You can do this using GHC.Conc.forkOnIO in GHC 6.8.x, and you'll also need +RTS -qm -qw. Also make sure that you're not using the main thread for any part of the main computation, because the main thread is a bound thread and runs in its own OS thread, so communication between the main thread and any other thread is slow.

dons:
Simon Marlow sez:
The thread-ring benchmark needs careful scheduling to get a speedup on multiple CPUs. I was only able to get a speedup by explicitly locking half of the ring onto each CPU. You can do this using GHC.Conc.forkOnIO in GHC 6.8.x, and you'll also need +RTS -qm -qw.
Also make sure that you're not using the main thread for any part of the main computation, because the main thread is a bound thread and runs in its own OS thread, so communication between the main thread and any other thread is slow.
I had to see the results for myself :-) old RTS: 0m54.296s threaded RTS (-N1): 0m56.839s threaded RTS (-N2): 0m52.623s Wow! 3x the performance for a simple change. Frustrating that there isn't a protable/standard way to express this. Also frustrating that the threaded version doesn't improve on the situation (utilization is back at 50%). Anyway, that was a fun miro-benchmark to play with. Tom

thomas.dubuisson:
dons:
Simon Marlow sez:
The thread-ring benchmark needs careful scheduling to get a speedup on multiple CPUs. I was only able to get a speedup by explicitly locking half of the ring onto each CPU. You can do this using GHC.Conc.forkOnIO in GHC 6.8.x, and you'll also need +RTS -qm -qw.
Also make sure that you're not using the main thread for any part of the main computation, because the main thread is a bound thread and runs in its own OS thread, so communication between the main thread and any other thread is slow.
I had to see the results for myself :-)
old RTS: 0m54.296s threaded RTS (-N1): 0m56.839s threaded RTS (-N2): 0m52.623s
Wow! 3x the performance for a simple change. Frustrating that there isn't a protable/standard way to express this. Also frustrating that the threaded version doesn't improve on the situation (utilization is back at 50%).
Anyway, that was a fun miro-benchmark to play with.
Did we gain any insights for submitting to the multicore shootout, http://shootout.alioth.debian.org/u64q/benchmark.php?test=all&lang=all (Where I note GHC is currently in second place, though we've not submitted any parallel programs yet). Also CC'd Isaac, Mr. Shootout. Isaac, is the quad core shootout open for business? Should we rally the troops? -- Don

Wow! 3x the performance for a simple change. Frustrating that there isn't a protable/standard way to express this. Also frustrating that the threaded version doesn't improve on the situation (utilization is back at 50%). GRrrrr, retraction, retraction!
I was obviously too tired when I posted this. In generallizing the system to take run-time specified number of CPUs (for forkOnIO) and tokens the behavior changed from my 3 minute runs. I'll play with it more tonight. Thomas
participants (7)
-
Brandon S. Allbery KF8NH
-
Don Stewart
-
Jed Brown
-
Krzysztof Skrzętnicki
-
Luke Palmer
-
Thomas Davie
-
Thomas M. DuBuisson