ANNOUNCE: Sun Microsystems and Haskell.org joint project on OpenSPARC

http://haskell.org/opensparc/ I am very pleased to announce a joint project between Sun Microsystems and the Haskell.org community to exploit the high performance capabilities of Sun's latest multi-core OpenSPARC systems via Haskell! http://opensparc.net/ Sun has donated a powerful 8 core SPARC Enterprise T5120 Server to the Haskell community, and $10,000 to fund a student, to further develop support for high performance Haskell on the SPARC. The aim of the project is to improve the SPARC native code generator in GHC and to demonstrate and improve the results of parallel Haskell benchmarks. The student will work with a mentor from Haskell.org and an adviser from Sun's SPARC compiler team. ** We are now inviting applications from students ** Please forward this announcement to any and all mailing lists where you think interested students might be lurking. Further details for students may be found below, and on the project website. Haskell and Multi-core Systems ------------------------------ The latest generation of multi-core machines pose a number of problems for traditional languages and parallel programming techniques. Haskell, in contrast, supports a wealth of approaches for writing correct parallel programs: traditional explicit threads and locks (forkIO and MVars), pure parallel evaluation strategies (par) and also Software Transactional Memory (STM). GHC has supported lightweight preemptable threads for a long time, and for the last couple of years it has been able to take advantage of machines with multiple CPUs or CPU cores. The GHC runtime has also recently gained a parallel garbage collector. OpenSPARC --------- We think the UltraSPARC T1/T2 architecture is a very interesting platform for Haskell. In particular the way that each core multiplexes many threads as a way of hiding memory latency. Memory latency is a performance bottleneck for Haskell code because the execution model uses a lot of memory indirections. Essentially, when one thread blocks due to a main memory read, the next thread is able to continue. This is in contrast to traditional architectures where the CPU core would stall until the result of the memory read was available. This approach can achieve high utilisation as long as there is enough parallelism available. The Project ----------- GHC is increasingly relying on its native code backend for high performance. Respectable single-threaded performance is a prerequisite for decent parallel performance. The first stage of the project therefore is to implement a new SPARC native code generator, taking advantage of the recent and ongoing infrastructure improvements in the C-- and native layers of the GHC backend. There is some existing support for SPARC in the native code generator but it has not kept up with changes in the GHC backend in the last few years. Once the code generator is working we will want to get a range of single threaded and parallel benchmarks running and look for various opportunities for improvement. There is plenty of ongoing work on the generic parts of the GHC backend and run-time system so the project will focus on SPARC-specific aspects. The UltraSPARC T1/T2 architecture supports very fast thread synchronisation (by taking advantage of the fact that all threads share the same L2 cache). We would like to optimise the synchronisation primitives in the GHC libraries and run-time system to take advantage of this. This should provide the basis for exploring whether the lower synchronisation costs make it advantageous to use more fine-grained parallelism. The Server ---------- The T5120 server has a T2 UltraSPARC processor with 8 cores running at 1.2GHz. Each core multiplexes 8 threads giving 64 hardware threads overall. It comes equipped with 32GB of memory. It also has two 146GB 10k RPM SAS disks. http://www.sun.com/servers/coolthreads/t5120/ http://www.sun.com/processors/UltraSPARC-T2/ This server is a donation to the whole Haskell community. We will make accounts available on the same basis as the existing community server as soon as is practical. Our friends at Chalmers University of Technology are kindly hosting the server on our behalf. We will encourage people to use the server for building, testing and benchmarking their Haskell software on SPARC, under both Solaris and Linux. Student applications -------------------- This is a challenging and exciting project and will need a high calibre student. Familiarity with Haskell is obviously important as is some experience with code generation for RISC instruction sets. The summer is now upon us so we do not expect students to be able to work 3 months all in one go. We are inviting students to suggest their own schedule when they apply. This may involve blocks of time in the next 9 months or so. It should add up to the equivalent of 3 months full time work. The application process is relatively informal. Students should send their application to: opensparc@community.haskell.org The deadline for applications in Friday 5th September 2008. If that deadline likely to be a problem for you then do get in touch. The application should detail skills and experience. Applications will be reviewed by a panel including the mentor, the adviser from Sun and a number of other Haskell.org community members who have helped with reviewing student projects in the past. The review will be partly interactive; students can expect to get questions and feedback from the reviewers. Students are welcome to contact myself or anyone else to help improve the quality of their application or if they have any questions. The $10k student funding will be paid in three phases, at the beginning ($3k), an intermediate point ($3k) and at the end ($4k). The exact timing will depend on the agreed schedule. The intermediate and final payments will be subject to positive reviews from the mentor. Duncan (project coordinator)

On 24 Jul 2008, at 3:52 am, Duncan Coutts wrote: [Sun have donated a T5120 server + USD10k to develop support for Haskell on the SPARC.] This is wonderful news. I have a 500MHz UltraSPARC II on my desktop running Solaris 2.10. Some time ago I tried to install GHC 6.6.1 on it, but ended up with something that compiles to C ok, but then invokes some C compiler with option "-fwrapv", which no compiler on that machine accepts, certainly none that was present when I installed it. I would really love to be able to use GHC on that machine. I also have an account on a T1 server, but the research group who Sun gave it to chose to run Linux on it, of all things. So binary distributions for SPARC/Solaris and SPARC/Linux would be very very nice things for this new project to deliver early. (Or some kind of source distribution that doesn't need a working GHC to start with.)

On 2008 Jul 24, at 0:43, Richard A. O'Keefe wrote:
So binary distributions for SPARC/Solaris and SPARC/Linux would be very very nice things for this new project to deliver early. (Or some kind of source distribution that doesn't need a working GHC to start with.
I'm still working on SPARC/Solaris here as well. (Still trying to get a build that doesn't produce executables that throw "schedule re- entered unsafely" immediately on startup.) -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

On Thu, 2008-07-24 at 16:43 +1200, Richard A. O'Keefe wrote:
On 24 Jul 2008, at 3:52 am, Duncan Coutts wrote: [Sun have donated a T5120 server + USD10k to develop support for Haskell on the SPARC.]
This is wonderful news.
I have a 500MHz UltraSPARC II on my desktop running Solaris 2.10.
I have 500MHz UltraSPARC II on my desktop running Gentoo Linux. :-)
Some time ago I tried to install GHC 6.6.1 on it, but ended up with something that compiles to C ok, but then invokes some C compiler with option "-fwrapv", which no compiler on that machine accepts, certainly none that was present when I installed it.
I've got ghc 6.8.2 working, but only -fvia-C and only unregisterised. "-fwrapv" is an option to some version of gcc, but I couldn't tell you which.
I would really love to be able to use GHC on that machine.
Me too :-), or in my case use it a bit quicker. Unregisterised ghc builds are pretty slow.
I also have an account on a T1 server, but the research group who Sun gave it to chose to run Linux on it, of all things.
Our tentative plan is to partition our T2 server using logical domains and run both Solaris and Linux. We'd like to set up ghc build bots on both OSs.
So binary distributions for SPARC/Solaris and SPARC/Linux would be very very nice things for this new project to deliver early.
I guess this is something that anyone with an account on the box could do. So once we get to the stage where we're handing out accounts then hopefully this would follow. The project isn't aiming to get the registerised C backend working nicely, we're aiming to get a decent native backend. That should also be much less fragile by not depending on the version of gcc so closely.
(Or some kind of source distribution that doesn't need a working GHC to start with.)
That's a tad harder. Needs lot of build system hacking. Duncan

Neat stuff. I used to work at Sun in the solaris kernel group, the SPARC architecture is quite elegant. I wonder if we can find an interesting use for the register windows in a haskell compiler. Many compilers for non c-like languages (such as boquist's one that jhc is based on (in spirit, if not code)) just ignore the windows and treat the architecture as having a flat 32 register file. John -- John Meacham - ⑆repetae.net⑆john⑈

On Thu, 2008-07-24 at 14:38 -0700, John Meacham wrote:
Neat stuff. I used to work at Sun in the solaris kernel group, the SPARC architecture is quite elegant. I wonder if we can find an interesting use for the register windows in a haskell compiler. Many compilers for non c-like languages (such as boquist's one that jhc is based on (in spirit, if not code)) just ignore the windows and treat the architecture as having a flat 32 register file.
Right. GHC on SPARC has also always disabled the register window when running Haskell code (at least for registerised builds) and only uses it when using the C stack and calling C functions. We should discuss this with our project adviser from the SPARC compiler group. The problem of course is recursion and deeply nested call stacks which don't make good use of register windows because they keep having to interrupt to spill them to the save area. I vaguely wondered if they might be useful for leaf calls or more generally where we can see statically that the call depth is small (and we can see all callers of said function, since it'd change the calling convention). But now you mention it, I wonder if there is anything even more cunning we could do, perhaps with lightweight threads or something. Or perhaps an area to quickly save registers at GC safe points. Duncan

On 25/07/2008, at 8:55 AM, Duncan Coutts wrote:
Right. GHC on SPARC has also always disabled the register window when running Haskell code (at least for registerised builds) and only uses it when using the C stack and calling C functions.
I'm not sure whether register windows and continuation based back-ends are ever going to be very good matches - I don't remember the last time I saw a 'ret' instruction in the generated code :). If there's a killer application for register windows in GHC it'd be something tricky. I'd be more interested in the 8 x hardware threads per core, [1] suggests that (single threaded) GHC code spends over half its time stalled due to L2 data cache miss. 64 threads per machine is a good incentive for trying out a few `par` calls.. Ben. [1] http://www.cl.cam.ac.uk/~am21/papers/msp02.ps.gz

... The UltraSPARC T1/T2 architecture supports very fast thread synchronisation (by taking advantage of the fact that all threads share the same L2 cache). ... Ah, scratch that second part then - though this is perhaps less of an issue when you have 4MB of L2 cache, vs the 256k cache for the machine in the paper. Ben. On 25/07/2008, at 10:38 AM, Ben Lippmeier wrote:
On 25/07/2008, at 8:55 AM, Duncan Coutts wrote:
Right. GHC on SPARC has also always disabled the register window when running Haskell code (at least for registerised builds) and only uses it when using the C stack and calling C functions.
I'm not sure whether register windows and continuation based back- ends are ever going to be very good matches - I don't remember the last time I saw a 'ret' instruction in the generated code :). If there's a killer application for register windows in GHC it'd be something tricky.
I'd be more interested in the 8 x hardware threads per core, [1] suggests that (single threaded) GHC code spends over half its time stalled due to L2 data cache miss. 64 threads per machine is a good incentive for trying out a few `par` calls..
Ben.

On Fri, 2008-07-25 at 10:38 +1000, Ben Lippmeier wrote:
I'd be more interested in the 8 x hardware threads per core, [1] suggests that (single threaded) GHC code spends over half its time stalled due to L2 data cache miss.
Right, that's what I think is most interesting and why I wanted to get this project going in the first place. If we spend so long blocked on memory reads that we're only utilising 50% of a core's time then there's lots of room for improvements if we can fill in that wasted time by running another thread. So that's the supposed advantage of multiplexing several threads per core. If Haskell is suffering more than other languages with the memory latency and low utilisation then we've also got most to gain with this multiplexing approach.
64 threads per machine is a good incentive for trying out a few `par` calls..
Of course then it means we need to have enough work to do. Indeed we need quite a bit just to break even because each core is relatively stripped down without all the out-of-order execution etc. Duncan

On 25/07/2008, at 12:42 PM, Duncan Coutts wrote:
Of course then it means we need to have enough work to do. Indeed we need quite a bit just to break even because each core is relatively stripped down without all the out-of-order execution etc.
I don't think that will hurt too much. The code that GHC emits is very regular and the basic blocks tend to be small. A good proportion of it is just for copying data between the stack and the heap. On the upside, it's all very clean and amenable to some simple peephole optimization / compile time reordering. I remember someone telling me that one of the outcomes of the Itanium project was that they didn't get the (low level) compile-time optimizations to perform as well as they had hoped. The reasoning was that a highly speculative/out-of-order processor with all the trimmings had a lot more dynamic information about the state of the program, and could make decisions on the fly which were better than what you could ever get statically at compile time. -- does anyone have a reference for this? Anyway, this problem is moot with GHC code. There's barely any instruction level parallelism to exploit anyway, but adding an extra hardware thread is just a `par` away. To quote a talk from that paper earlier: "GHC programs turn an Athlon into a 486 with a high clock speed!" Ben.

Hi!
If we spend so long blocked on memory reads that we're only utilising 50% of a core's time then there's lots of room for improvements if we can fill in that wasted time by running another thread.
How can you see how much does your program wait because of L2 misses? I have been playing lately with dual Quad-Core Intel Xeon Mac Pros with 12 MB L2 cache per CPU and 1.6 GHz bus speed and it would be interesting to check this things there. Mitar

http://valgrind.org/info/tools.html On 26/07/2008, at 11:02 AM, Mitar wrote:
Hi!
If we spend so long blocked on memory reads that we're only utilising 50% of a core's time then there's lots of room for improvements if we can fill in that wasted time by running another thread.
How can you see how much does your program wait because of L2 misses? I have been playing lately with dual Quad-Core Intel Xeon Mac Pros with 12 MB L2 cache per CPU and 1.6 GHz bus speed and it would be interesting to check this things there.
Mitar

A tool originally developed to measure cache misses in GHC :) Ben.Lippmeier:
http://valgrind.org/info/tools.html
On 26/07/2008, at 11:02 AM, Mitar wrote:
Hi!
If we spend so long blocked on memory reads that we're only utilising 50% of a core's time then there's lots of room for improvements if we can fill in that wasted time by running another thread.
How can you see how much does your program wait because of L2 misses? I have been playing lately with dual Quad-Core Intel Xeon Mac Pros with 12 MB L2 cache per CPU and 1.6 GHz bus speed and it would be interesting to check this things there.
Mitar
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Hi!
On Sat, Jul 26, 2008 at 3:17 AM, Ben Lippmeier
No support for Mac OS X. :-( Mitar

On Sat, 2008-07-26 at 03:02 +0200, Mitar wrote:
Hi!
If we spend so long blocked on memory reads that we're only utilising 50% of a core's time then there's lots of room for improvements if we can fill in that wasted time by running another thread.
How can you see how much does your program wait because of L2 misses? I have been playing lately with dual Quad-Core Intel Xeon Mac Pros with 12 MB L2 cache per CPU and 1.6 GHz bus speed and it would be interesting to check this things there.
Take a look at the paper that Ben referred to http://www.cl.cam.ac.uk/~am21/papers/msp02.ps.gz They use hardware performance counters. Duncan

On 25 Jul 2008, at 10:55 am, Duncan Coutts wrote:
The problem of course is recursion and deeply nested call stacks which don't make good use of register windows because they keep having to interrupt to spill them to the save area.
A fair bit of thought was put into SPARC V9 to making saving and restoring register windows a lot cheaper than it used to be. (And the Sun C compiler learned how to do TRO.) It's nice to have 3 windows: <C world> startup <Haskell world> normal Haskell code <millicode> special support code so that normal code doesn't have to leave registers spare for millicode routines.
participants (7)
-
Ben Lippmeier
-
Brandon S. Allbery KF8NH
-
Don Stewart
-
Duncan Coutts
-
John Meacham
-
Mitar
-
Richard A. O'Keefe