Re: [Haskell-cafe] optimising for vector units

Well I have to say the dataflow style of lazy programming made me think Haskell would be ideal for multi-processor use (and now HyperThreading is common most PCs have more than one processor from the code's point of view)... I was disappointed to find GHC only uses one thread, and therefore will only use one CPU. Vectorisation is much more limited than hyperthreading - as the instructions on each vector unit must be the same... The easiest implementation would be to add primitive vector types that map directly to the underlying SIMD operations. This is probably a good starting point - it would be nice to vectorise normal expressions, but that requires an optimisation search to find instructions that can be grouped together ... this becomes a lot easier if the primitives are supported as you just need to transform the code rather than dealing with non-portable assembly instructions. The problem is that different platforms support different SIMD style instructions... What would be really needed is a IEEE/ISO standard for vector instructions much like the current one for floating point. In the absence of such a standard, the best that can be done is to abstract vectorisation by word size and number of words, and supply a software implementation of all vector ops to use if the hardware does not support certain primitives. A futher point is that for the compiler to know the vector size, you would have to resort to type level naturals... something like: data Nil data Suc x = Suc x type Four = Suc (Suc (Suc (Suc Nil))) myFn :: Vector Four Word16 Of course this would need to be integrated with the compiler. As an interim measure a C-library providing primitive operations on vectors could be written and used via the FFI. Keean.

On Mon, 2004-07-26 at 10:58, MR K P SCHUPKE wrote:
In the absence of such a standard, the best that can be done is to abstract vectorisation by word size and number of words, and supply a software implementation of all vector ops to use if the hardware does not support certain primitives.
Recent version of gcc provides a semi-portable vector extension. You define some vector data types and gcc will make use of whatever vector capability exists on the machine you are compiling for. Otherwise it uses narrower vector units if possible or just scaler units if that is all that is available. http://gcc.gnu.org/onlinedocs/gcc-3.4.1/gcc/Vector-Extensions.html Correct me if I'm wrong but I thought that there was some overhead to making FFI (even safe/unsafe) which would probably kill any advantage from using small vector units. In which case it'd need to be implemented as ghc primitive ops which is rather more work. Duncan

MR K P SCHUPKE
Well I have to say the dataflow style of lazy programming made me think Haskell would be ideal for multi-processor use (and now HyperThreading is common most PCs have more than one processor from the code's point of view)...
I was disappointed to find GHC only uses one thread, and therefore will only use one CPU.
I'm sure somebody, somewhere, is working on speculative execution of Haskell code. Now that Sun is building Niagara with IIRC 8 cores on a chip, Intel is rumoured to put up to 16 cores on the post-Montecito Tukwila, Sony/IBM's Cell is said to be some kind of parallel design, and every self-respecting chip vendor is putting at least two cores on a chip and/or adding multithreaded cores. Even if there's a lot of rumor-mongering going on, it seems fairly clear that you can only go on for so long, adding more cache to your old designs, and the bottleneck then becomes the inherent parallelism in your code. One would expect a lazy and pure language to be excellent for parallelization, since the programmer is generally removed from the actual flow of execution anyway. At some point (for some n), being able to spawn n threads will gain you more than a factor c constant overhead, and Haskell programs, with a run-time system that can evaluate expressions in paralllel, will outperform single threaded C code. (But it probably isn't that simple, or we would have it already :-) -kzm -- If I haven't seen further, it is by standing in the footprints of giants

Ketil Malde wrote:
I'm sure somebody, somewhere, is working on speculative execution of Haskell code.
I think we both graduated (myself from MIT, Robert Ennals from Cambridge; I'm building compilers for supercomputers at Sun). If anyone else is working seriously on speculative evaluation of Haskell at the moment, please drop me a line, I'd love to hear of your existence.
Now that Sun is building Niagara with IIRC 8 cores on a chip, Intel is rumoured to put up to 16 cores on the post-Montecito Tukwila, Sony/IBM's Cell is said to be some kind of parallel design, and every self-respecting chip vendor is putting at least two cores on a chip and/or adding multithreaded cores.
One would expect a lazy and pure language to be excellent for parallelization, since the programmer is generally removed from the actual flow of execution anyway.
As I recall, this was one of the original goals of the Grip project (which built the original incarnation of GHC, if I have my history straight). Indeed, if you read any early paper on strictness analysis, you'd think that it was all about parallelizing lazy code. I'm now of the opinion that strictness analysis tells us where to *serialize* our code, because parallelism just won't be worth it. There are, I believe, a couple of major challenges: * It's easy to identify very small pieces of parallel work, but much harder to identify large, yet finite, pieces of work. Only the latter are really worth parallelizing. * If you don't compute speculatively, you'll never find enough work to do. * If you compute speculatively, you need some way to *stop* working on useless, yet infinite computations. Rob and I had two rather different takes on the same basic idea: run with some sort of resource limits, and stop working when you exhaust them. Roughly speaking, Rob then made sure you never speculated that bit of code again, while I just kept relying on the bounds to kick in. On pretty eager code, I did all right, but on code that actually needed the laziness (this was after a good bit of compiler munging to wring out most of the unnecessary laziness) I got killed---Rob's approach fixed the mistake by patching the code, and did much better thereafter. My ultimate goal was parallelization. However, I'm now convinced it would take about 2-3 programmer-years more of effort to realize that goal. Parallel garbage collection (which is *not* optional on a parallel Haskell implementation, as our experience with pH demonstrated) and very fine-grained thunk-update protocols are both tricky technical challenges. And that leaves aside the question of whether there's enough coarse-grained work buried among all that fine-grained work to make it all worthwhile. Anyone want to pay me to do this? :-)
At some point (for some n), being able to spawn n threads will gain you more than a factor c constant overhead, and Haskell programs, with a run-time system that can evaluate expressions in paralllel, will outperform single threaded C code.
Only if you can keep the granularity of the work large. This hasn't been so easy. -Jan-Willem Maessen Eager Haskell Implementor

Jan-Willem Maessen - Sun Labs East
There are, I believe, a couple of major challenges: * It's easy to identify very small pieces of parallel work, but much harder to identify large, yet finite, pieces of work. Only the latter are really worth parallelizing.
By the former, are you thinking of so small grain that it is handled by out-of-order execution units in the CPU? And/or the C compiler?
* If you don't compute speculatively, you'll never find enough work to do.
Although I'm not familiar with the issues, my point is that the number of CPUs available, even in common household pee cees, is already more than one (P4 hyper-threading), and could be something like eight in the not-so-distant future. It no longer matters (much) if you waste `cycles, cycles are cheap. (The next next IA64, Montecito is 1.7G transistors, including 24Mb on-chip cache. The P4 is big, but you could fit thirty of them in that space. No way Montecito is going to have anywhere near 30x the performance) So speculative execution, even if you end up throwing away 50% of the work you do, could in theory make your program faster anyway. This is a headache for C programs; my hope would be that a functional language would make it easier.
* If you compute speculatively, you need some way to *stop* working on useless, yet infinite computations.
And you need to choose which computations to start working on, I guess. Predicting the future never was easy :-) [perhaps getting off-topic, but hey, this is -cafe] -kzm -- If I haven't seen further, it is by standing in the footprints of giants

Ketil Malde wrote:
Jan-Willem Maessen - Sun Labs East
writes: There are, I believe, a couple of major challenges: * It's easy to identify very small pieces of parallel work, but much harder to identify large, yet finite, pieces of work. Only the latter are really worth parallelizing.
By the former, are you thinking of so small grain that it is handled by out-of-order execution units in the CPU? And/or the C compiler?
I'm thinking of so small grain that the cost of fine-grained thread creation (tens of instructions) is comparable to the cost of running the thread itself (hundreds of instructions). ILP is a bit of a sideshow here. Usually there's plenty of integer ILP hanging about already.
* If you don't compute speculatively, you'll never find enough work to do.
Although I'm not familiar with the issues, my point is that the number of CPUs available, even in common household pee cees, is already more than one (P4 hyper-threading), and could be something like eight in the not-so-distant future. It no longer matters (much) if you waste `cycles, cycles are cheap. (The next next IA64, Montecito is 1.7G transistors, including 24Mb on-chip cache. The P4 is big, but you could fit thirty of them in that space. No way Montecito is going to have anywhere near 30x the performance)
So speculative execution, even if you end up throwing away 50% of the work you do, could in theory make your program faster anyway. This is a headache for C programs; my hope would be that a functional language would make it easier.
Be careful. A P4 will slow down if you get it hot enough. So "throwing away" that bit of extra performance may actually make things slower... And who gets to use the throwaway performance, anyhow? You want it, the OS wants it, other applications on your machine want it---you have to be able to adjust to varying amounts of "extra" compute power hanging around. [Actually, speaking from experience, half the compute power on my SMP ends up going to the GUI, the OS, or my browser---and it's great to have those things not slow down uniprocessor compute-bound tasks.]
* If you compute speculatively, you need some way to *stop* working on useless, yet infinite computations.
And you need to choose which computations to start working on, I guess. Predicting the future never was easy :-)
Indeed.
[perhaps getting off-topic, but hey, this is -cafe]
That's why I subscribe to -cafe at all! -Jan-Willem Maessen
-kzm

Jan-Willem Maessen - Sun Labs East
I'm building compilers for supercomputers at Sun
<casual>So, any plans for compilers for functional languages making use of Niagara? -kzm -- If I haven't seen further, it is by standing in the footprints of giants

Jan-Willem Maessen writes (in the Haskell Cafe):
My ultimate goal was parallelization. However, I'm now convinced it would take about 2-3 programmer-years more of effort to realize that goal. Parallel garbage collection (which is *not* optional on a parallel Haskell implementation, as our experience with pH demonstrated) and very fine-grained thunk-update protocols are both tricky technical challenges. And that leaves aside the question of whether there's enough coarse-grained work buried among all that fine-grained work to make it all worthwhile.
I'm not sure how to interpret this. Will you also have solved the granularity problem in these two to three years? Cheers, Ronny Wichers Schreur

Ronny Wichers Schreur wrote:
Jan-Willem Maessen writes (in the Haskell Cafe):
My ultimate goal was parallelization. However, I'm now convinced it would take about 2-3 programmer-years more of effort to realize that goal. Parallel garbage collection (which is *not* optional on a parallel Haskell implementation, as our experience with pH demonstrated) and very fine-grained thunk-update protocols are both tricky technical challenges. And that leaves aside the question of whether there's enough coarse-grained work buried among all that fine-grained work to make it all worthwhile.
I'm not sure how to interpret this. Will you also have solved the granularity problem in these two to three years?
Probably not. One might hope to do well by choosing an appropriate schedule (favor leaves locally, roots when work is moved between processors). But there's no evidence generating the work lazily would even leave enough work to be done. There's a difference between a system that would work (and probably work OK for 2-4 processors) and a system that would scale. The latter would also require some hard thought by the application programmer, as well. -Jan-Willem Maessen
participants (5)
-
Duncan Coutts
-
Jan-Willem Maessen - Sun Labs East
-
Ketil Malde
-
MR K P SCHUPKE
-
Ronny Wichers Schreur