No "last core parallel slowdown" on OS X

I'm a huge fan of the recent paper http://ghcmutterings.wordpress.com/2009/03/03/new-paper-runtime-support-for-... which put me over the top to get started writing parallel code in Haskell. Parallel code is now integral to my and my Ph.D. students' research. For example, we recently checked an assertion for the roughly 69 billion atomic lattices on six atoms, in a day rather than a week, using perhaps 6 lines of parallel code in otherwise sequential code. When you're anxiously waiting for the answer, a day is a lot better than a week. (The enumeration itself is down to two hours on 7 cores, which astounds me. I see no reason to ever use another language.) In that paper, they routinely benchmark N-1 cores on an N core Linux box, because of a noticeable falloff using the last core, which can do more harm than good. I had confirmed this on my four core Linux box, but was puzzled that my two core MacBook showed no such falloff. Hey, two cores isn't representative of many cores, cache issues yada yada, so I waited. I just got an EFi-X "boot processor" (efi-x.com) working on a nearly identical quad core box that I built, and I tested the same computations with OS X. For my test case, there's a mild cost to moving to parallel at all, but... Compared to 2 cores, using 3, 4 cores on a four core Linux box gives speedups of 1.37x, 1.38x Compared to 2 cores, using 3, 4 cores on an equivalent four core box running OS X gives speedups of 1.45x, 1.9x Here 1.5x, 2.0x is ideal, so I'm thrilled. If we can't shame Linux into fixing this, I'm never looking back. How true is this for other parallel languages? Haskell alone is perhaps too fringe to cause a Linux scandal over this, even if it should... The EFi-X boot processor itself is rather expensive ($240 now), and there's sticking to a specific hardware compatibility list, and I needed to update my motherboard BIOS and the EFi-X firmware, but no other fiddling for me. These boxes are just compute servers for me, I would have been ok returning to Linux, but not if it means giving up a core. People worry about compatibility, "I sensed a softness in the surround sound in game X...", but for me the above numbers put all this in perspective. Another way to put this, especially for those who don't have a strong preference for building their own machines, and can't wait for Linux to get its act together: If you're serious about parallel Haskell, buy a Mac Pro.

That looks great! I wonder what about Mac OS leads to such good performance...
Now if only we could get a nice x86_64-producing GHC for Mac OS too, I
could use all my RAM and the extra registers my Mac Pro gives me :)
On Sat, Apr 18, 2009 at 2:39 PM, Dave Bayer
I'm a huge fan of the recent paper
http://ghcmutterings.wordpress.com/2009/03/03/new-paper-runtime-support-for-...
which put me over the top to get started writing parallel code in Haskell. Parallel code is now integral to my and my Ph.D. students' research. For example, we recently checked an assertion for the roughly 69 billion atomic lattices on six atoms, in a day rather than a week, using perhaps 6 lines of parallel code in otherwise sequential code. When you're anxiously waiting for the answer, a day is a lot better than a week. (The enumeration itself is down to two hours on 7 cores, which astounds me. I see no reason to ever use another language.)
In that paper, they routinely benchmark N-1 cores on an N core Linux box, because of a noticeable falloff using the last core, which can do more harm than good. I had confirmed this on my four core Linux box, but was puzzled that my two core MacBook showed no such falloff. Hey, two cores isn't representative of many cores, cache issues yada yada, so I waited.
I just got an EFi-X "boot processor" (efi-x.com) working on a nearly identical quad core box that I built, and I tested the same computations with OS X. For my test case, there's a mild cost to moving to parallel at all, but...
Compared to 2 cores, using 3, 4 cores on a four core Linux box gives speedups of
1.37x, 1.38x
Compared to 2 cores, using 3, 4 cores on an equivalent four core box running OS X gives speedups of
1.45x, 1.9x
Here 1.5x, 2.0x is ideal, so I'm thrilled. If we can't shame Linux into fixing this, I'm never looking back. How true is this for other parallel languages? Haskell alone is perhaps too fringe to cause a Linux scandal over this, even if it should...
The EFi-X boot processor itself is rather expensive ($240 now), and there's sticking to a specific hardware compatibility list, and I needed to update my motherboard BIOS and the EFi-X firmware, but no other fiddling for me. These boxes are just compute servers for me, I would have been ok returning to Linux, but not if it means giving up a core. People worry about compatibility, "I sensed a softness in the surround sound in game X...", but for me the above numbers put all this in perspective.
Another way to put this, especially for those who don't have a strong preference for building their own machines, and can't wait for Linux to get its act together:
If you're serious about parallel Haskell, buy a Mac Pro. _______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

Yikes! You're right. I never noticed, but I never had an 8 GB Mac before. I looked at ./configure for the GHC 6.10.2 source, and realized there was already something there. I tried ./configure --build=x86_64-apple-darwin and it didn't work. However, it did give me something to Google, leading me to Ticket #2965 (new feature request) GHC on OS X does not compile 64-bit 1/19/09 http://hackage.haskell.org/trac/ghc/ticket/2965 Apparently this isn't a one-liner. Once my semester ends, I'll see if I can help. I've got to wonder if it would be less work to sneak a gift Mac Pro past Microsoft security, and just wait? ;-) The GHC team has been busting their humps on parallel code lately, and OS X does so much better... They should stop having to apologize in papers for the poor parallel performance of Linux itself. A decked-out Mac Pro should be the flagship platform for 64 bit, parallel GHC. On Apr 18, 2009, at 1:46 PM, Daniel Peebles wrote:
That looks great! I wonder what about Mac OS leads to such good performance...
Now if only we could get a nice x86_64-producing GHC for Mac OS too, I could use all my RAM and the extra registers my Mac Pro gives me :)

Excerpts from Dave Bayer's message of Sat Apr 18 19:05:34 -0500 2009:
Yikes! You're right. I never noticed, but I never had an 8 GB Mac before.
I looked at ./configure for the GHC 6.10.2 source, and realized there was already something there. I tried
./configure --build=x86_64-apple-darwin
and it didn't work. However, it did give me something to Google, leading me to
Ticket #2965 (new feature request) GHC on OS X does not compile 64-bit 1/19/09 http://hackage.haskell.org/trac/ghc/ticket/2965
Apparently this isn't a one-liner. Once my semester ends, I'll see if I can help.
I've got to wonder if it would be less work to sneak a gift Mac Pro past Microsoft security, and just wait? ;-)
The GHC team has been busting their humps on parallel code lately, and OS X does so much better... They should stop having to apologize in papers for the poor parallel performance of Linux itself. A decked-out Mac Pro should be the flagship platform for 64 bit, parallel GHC.
On Apr 18, 2009, at 1:46 PM, Daniel Peebles wrote:
That looks great! I wonder what about Mac OS leads to such good performance...
Now if only we could get a nice x86_64-producing GHC for Mac OS too, I could use all my RAM and the extra registers my Mac Pro gives me :)
Please add yourself to the CC list of the bug - more people need to show they care! I'm currently the owner of the bug, so if you bother me enough I'll get to working on it quicker (once I have more time...) At the very least, once the build system fixes are in place to allow hc-bootstrapping, it should happen fairly quickly. Right now I'm just not quite sure what the full path necessary is for getting a copy of GHC head to be 64-bit on OS X. Daniel, have you gotten anywhere with your version of GHC 6.6 on OS X? Austin

On April 18, 2009 16:46:44 Daniel Peebles wrote:
That looks great! I wonder what about Mac OS leads to such good performance...
Now if only we could get a nice x86_64-producing GHC for Mac OS too, I could use all my RAM and the extra registers my Mac Pro gives me :)
I was a bit surprised when I read the initial report because 1- I thought GHC had a hard time with 32bit x86 code due to the integer register pressure and hacking around the stack based FPU, and 2- I though OS X had multithreading performance issues (or at least that is what I had read in various reports regarding using it as a server). This leave me wondering how do the absolute numbers compare? Could the extra overhead due to the various 32bit issues be giving more room for better threading performance? What do you get if you use 32bit GHC with Linux? Cheers! -Tyson

On Apr 19, 2009, at 9:59 PM, Tyson Whitehead wrote:
This leave me wondering how do the absolute numbers compare? Could the extra overhead due to the various 32bit issues be giving more room for better threading performance? What do you get if you use 32bit GHC with Linux?
Oddly enough, these are 32 bit GHC implementations in both cases. Our departmental sys admin has stayed with 32 bit Linux. cores, real, user, ratio
Linux 2 x 3.16 GHz Xeon X5460 1 2 3 4 466.7 250.8 183.7 149.3 466.4 479.0 505.2 528.1 1.00 1.91 2.75 3.54
OS X 2.4 GHx Q6600 1 2 3 4 676.9 359.4 246.7 191.4 673.4 673.7 675.9 674.8 0.99 1.87 2.74 3.53

Dave Bayer:
In that paper, they routinely benchmark N-1 cores on an N core Linux box, because of a noticeable falloff using the last core, which can do more harm than good. I had confirmed this on my four core Linux box, but was puzzled that my two core MacBook showed no such falloff. Hey, two cores isn't representative of many cores, cache issues yada yada, so I waited. [..] Compared to 2 cores, using 3, 4 cores on an equivalent four core box running OS X gives speedups of
1.45x, 1.9x
As another data point, in our work on Data Parallel Haskell, we ran benchmarks on an 8-core Xserve (OS X) and an 8-core Sun T2 (Solaris). On both machines, we had no problem using all 8 cores. Manuel

Manuel M T Chakravarty wrote:
Dave Bayer:
In that paper, they routinely benchmark N-1 cores on an N core Linux box, because of a noticeable falloff using the last core, which can do more harm than good. I had confirmed this on my four core Linux box, but was puzzled that my two core MacBook showed no such falloff. Hey, two cores isn't representative of many cores, cache issues yada yada, so I waited. [..] Compared to 2 cores, using 3, 4 cores on an equivalent four core box running OS X gives speedups of
1.45x, 1.9x
As another data point, in our work on Data Parallel Haskell, we ran benchmarks on an 8-core Xserve (OS X) and an 8-core Sun T2 (Solaris). On both machines, we had no problem using all 8 cores.
I suspect some scheduling weirdness in Linux, at least in the kernel we're using here (2.6.25). Traces appeared to show that one of our threads was being descheduled for a few ms, and this can be particularly severe in GHC since our stop-the-world GC needs frequent synchronisations. One advantage of moving to processor-independent GCs would be that we could degrade more gracefully if the CPUs are contended, or the OS scheduler just decides to use a core for something else for a while. Cheers, Simon

I ran some longer trials, and noticed a further pattern I wish I could explain: I'm comparing the enumeration of the roughly 69 billion atomic lattices on six atoms, on my four core, 2.4 GHz Q6600 box running OS X, against an eight core, 2 x 3.16 Ghz Xeon X5460 box at my department running Linux. Note that my processor now costs $200 (it's the venerable "Dodge Dart" of quad core chips), while the pair of Xeon processors cost $2400. The Haskell code is straightforward; it uses bit fields and reverse search, but it doesn't take advantage of symmetry, so it must "touch" every lattice to complete the enumeration. Its memory footprint is insignificant. Never mind 7 cores, Linux performs worse before it runs out of cores. Comparing 1, 2, 3, 4 cores on each machine, look at "real" and "user" time in minutes, and the ratio: Linux 2 x 3.16 GHz Xeon X5460 1 2 3 4 466.7 250.8 183.7 149.3 466.4 479.0 505.2 528.1 1.00 1.91 2.75 3.54 OS X 2.4 GHx Q6600 1 2 3 4 676.9 359.4 246.7 191.4 673.4 673.7 675.9 674.8 0.99 1.87 2.74 3.53 These ratios match up like physical constants, or at least invariants of my Haskell implementation. However, the user time is constant on OS X, so these ratios reflect the actual parallel speedup on OS X. The user time climbs steadily on Linux, significantly diluting the parallel speedup on Linux. Somehow, whatever is going wrong in the interaction between Haskell and Linux is being captured in this increase in user time. I love how my cheap little box comes close to pulling even with a departmental compute server I can't afford, because of this difference in operating systems.

2009/4/20 Dave Bayer
I ran some longer trials, and noticed a further pattern I wish I could explain:
I'm comparing the enumeration of the roughly 69 billion atomic lattices on six atoms, on my four core, 2.4 GHz Q6600 box running OS X, against an eight core, 2 x 3.16 Ghz Xeon X5460 box at my department running Linux. Note that my processor now costs $200 (it's the venerable "Dodge Dart" of quad core chips), while the pair of Xeon processors cost $2400. The Haskell code is straightforward; it uses bit fields and reverse search, but it doesn't take advantage of symmetry, so it must "touch" every lattice to complete the enumeration. Its memory footprint is insignificant.
Never mind 7 cores, Linux performs worse before it runs out of cores. Comparing 1, 2, 3, 4 cores on each machine, look at "real" and "user" time in minutes, and the ratio:
Linux 2 x 3.16 GHz Xeon X5460 1 2 3 4 466.7 250.8 183.7 149.3 466.4 479.0 505.2 528.1 1.00 1.91 2.75 3.54
OS X 2.4 GHx Q6600 1 2 3 4 676.9 359.4 246.7 191.4 673.4 673.7 675.9 674.8 0.99 1.87 2.74 3.53
These ratios match up like physical constants, or at least invariants of my Haskell implementation. However, the user time is constant on OS X, so these ratios reflect the actual parallel speedup on OS X. The user time climbs steadily on Linux, significantly diluting the parallel speedup on Linux. Somehow, whatever is going wrong in the interaction between Haskell and Linux is being captured in this increase in user time.
We can't necessarily blame this on Linux: the two machines have different hardware. There could be cache-effects at play, for example. Maybe you could try the new affinity options (+RTS -qa) and see if that makes any difference? That would reduce the effect of scheduling effects due to the OS (although when the number of cores you use is less than the real number of cores in the machine, the OS is still free to move threads around. To get reliable numbers you should really disable some of the cores at boot-time). Cheers, Simon

marlowsd:
2009/4/20 Dave Bayer
: I ran some longer trials, and noticed a further pattern I wish I could explain:
I'm comparing the enumeration of the roughly 69 billion atomic lattices on six atoms, on my four core, 2.4 GHz Q6600 box running OS X, against an eight core, 2 x 3.16 Ghz Xeon X5460 box at my department running Linux. Note that my processor now costs $200 (it's the venerable "Dodge Dart" of quad core chips), while the pair of Xeon processors cost $2400. The Haskell code is straightforward; it uses bit fields and reverse search, but it doesn't take advantage of symmetry, so it must "touch" every lattice to complete the enumeration. Its memory footprint is insignificant.
Never mind 7 cores, Linux performs worse before it runs out of cores. Comparing 1, 2, 3, 4 cores on each machine, look at "real" and "user" time in minutes, and the ratio:
Linux 2 x 3.16 GHz Xeon X5460 1 2 3 4 466.7 250.8 183.7 149.3 466.4 479.0 505.2 528.1 1.00 1.91 2.75 3.54
OS X 2.4 GHx Q6600 1 2 3 4 676.9 359.4 246.7 191.4 673.4 673.7 675.9 674.8 0.99 1.87 2.74 3.53
These ratios match up like physical constants, or at least invariants of my Haskell implementation. However, the user time is constant on OS X, so these ratios reflect the actual parallel speedup on OS X. The user time climbs steadily on Linux, significantly diluting the parallel speedup on Linux. Somehow, whatever is going wrong in the interaction between Haskell and Linux is being captured in this increase in user time.
We can't necessarily blame this on Linux: the two machines have different hardware. There could be cache-effects at play, for example.
Maybe you could try the new affinity options (+RTS -qa) and see if that makes any difference? That would reduce the effect of scheduling effects due to the OS (although when the number of cores you use is less than the real number of cores in the machine, the OS is still free to move threads around. To get reliable numbers you should really disable some of the cores at boot-time).
Little advice and tidbits are creeping out of Simon's head. Is it time for a parallel performance wiki, where every question that becomes an FAQ gets documented live? http://haskell.org/haskellwiki/Performance/Parallel Maybe put details on the wiki so we can grow a large FAQ to capture this "oral tradition". -- Don

2009/4/21 Don Stewart
Little advice and tidbits are creeping out of Simon's head.
Is it time for a parallel performance wiki, where every question that becomes an FAQ gets documented live?
http://haskell.org/haskellwiki/Performance/Parallel
Maybe put details on the wiki so we can grow a large FAQ to capture this "oral tradition".
Absolutely. One reservation I have is that advice is likely to go out of date quite quickly; for example I'm planning to change the RTS options again before we release 6.12.1 to improve the default behaviour. Another reservation I have is that it's very difficult to pin down techniques that work consistently over different OSs and hardware. The best we can do is to document the techniques we know about, and advise people to try a variety of things to see which works best. Even that would be better than nothing, of course. Does anyone feel able to make a start setting up a wiki tree for parallel performance? I'd be more than happy to contribute and review content. Cheers, Simon

On April 21, 2009 04:39:40 Simon Marlow wrote:
These ratios match up like physical constants, or at least invariants of my Haskell implementation. However, the user time is constant on OS X, so these ratios reflect the actual parallel speedup on OS X. The user time climbs steadily on Linux, significantly diluting the parallel speedup on Linux. Somehow, whatever is going wrong in the interaction between Haskell and Linux is being captured in this increase in user time.
We can't necessarily blame this on Linux: the two machines have different hardware. There could be cache-effects at play, for example.
Why not try booting a CD or thumb-drive linux distro (e.g., ubuntu live) on your 2.4 GHz Q6600 OS X box and see how things stack up. It would certainly eliminate any questions of hardware differences. Cheers! -Tyson

My first post was comparing almost identical machines: Different Q6600 steppings (the earlier chip makes a good space heater!) on different motherboards, same memory, both stock speeds. In a few weeks when the semester ends, I'll be able to try Linux -vs- BSD -vs- OS X on identical hardware, and try Simon's settings. (I do love overclocking, but five minutes improving Haskell code is generally more effective than a day tweaking motherboard voltages. We're too "green" to use A/C in the hot California summer, and this computer exhausts through a dryer hose out my office window as it is. I don't want it any hotter, I just want more cores!) I do have some experience comparing this code on four different Linux boxes, and three different Macs, and Linux does consistently worse. I waited to post until I could compare 4 cores against 4 cores on nearly identical hardware. Also, I tried many approaches to this code, and what I've been testing is my best version, which also happens to be one of the simplest approaches to parallelism. (It so often works that way with Haskell.) In fairness, I should also run the standard test suite used in the paper. On Apr 21, 2009, at 10:14 AM, Tyson Whitehead wrote:
Why not try booting a CD or thumb-drive linux distro (e.g., ubuntu live) on your 2.4 GHz Q6600 OS X box and see how things stack up. It would certainly eliminate any questions of hardware differences.
Cheers! -Tyson
I can do even better: This $65 bay device takes four 2.5" SATA or SAS drives: http://addonics.com/products/raid_system/ae4rcs25nsa.asp It has a surprising build quality, makes it trivial to juggle 2.5" SATA drives. Removing the high-low jumper disables the loud fan, which is probably only needed for SAS drives. My primary drive is an OCZ Vertex SSD, for which this is perfect. I also have an assortment of spare laptop drives I can use, so an OS survey will be easy.

[Sorry if this turns out to be a dup, it appears that my first send got lost, while my followup message went through.] I ran some longer trials, and noticed a further pattern I wish I could explain: I'm comparing the enumeration of the roughly 69 billion atomic lattices on six atoms, on my four core, 2.4 GHz Q6600 box running OS X, against an eight core, 2 x 3.16 Ghz Xeon X5460 box at my department running Linux. Note that my processor now costs $200 (it's the venerable "Dodge Dart" of quad core chips), while the pair of Xeon processors cost $2400. The Haskell code is straightforward; it uses bit fields and reverse search, but it doesn't take advantage of symmetry, so it must "touch" every lattice to complete the enumeration. Its memory footprint is insignificant. Never mind 7 cores, Linux performs worse before it runs out of cores. Comparing 1, 2, 3, 4 cores on each machine, look at "real" and "user" time in minutes, and the ratio: Linux 2 x 3.16 GHz Xeon X5460 1 2 3 4 466.7 250.8 183.7 149.3 466.4 479.0 505.2 528.1 1.00 1.91 2.75 3.54 OS X 2.4 GHx Q6600 1 2 3 4 676.9 359.4 246.7 191.4 673.4 673.7 675.9 674.8 0.99 1.87 2.74 3.53 These ratios match up like physical constants, or at least invariants of my Haskell implementation. However, the user time is constant on OS X, so these ratios reflect the actual parallel speedup on OS X. The user time climbs steadily on Linux, significantly diluting the parallel speedup on Linux. Somehow, whatever is going wrong in the interaction between Haskell and Linux is being captured in this increase in user time. I love how my cheap little box comes close to pulling even with a departmental compute server I can't afford, because of this difference in operating systems.
participants (7)
-
Austin Seipp
-
Daniel Peebles
-
Dave Bayer
-
Don Stewart
-
Manuel M T Chakravarty
-
Simon Marlow
-
Tyson Whitehead