
OK, so... If you were going to forget everything we humans know about digital computer design - the von Neuman architecture, the fetch/decode/execute loop, the whole shooting match - and design a computer *explicitly* for the purpose of executing Haskell programs... what would it look like? What design decisions would you make? Why?

Andrew Coppin
OK, so... If you were going to forget everything we humans know about digital computer design - the von Neuman architecture, the fetch/decode/execute loop, the whole shooting match - and design a computer *explicitly* for the purpose of executing Haskell programs... what would it look like?
Back in the eighties (I don't remeber exactly when), Thomas Clarke, Stuart Wray and I got together to think this through (we had the possiblity of funding to make something). We had lots of ideas, but after much arguing back and forth the conclusion we reached was that anything we could do would either be slower than mainstream hardware or would be overtaken by it in a very short space of time. I'm not sure that the conclusion still holds because conventional architectures are approaching an impasse, but there's a lot of force left in the arguments: most of the improvements I can think of also benefit imperative languages, so if they're worthwhile they'll happen anyway. One of the things that is different now is the availability of parallelism, but the mainstream is working pretty hard on that. There's an opportunity there, in that functional languages have some nice properties when it comes to parallel execution, but to make an impact we'd have to get on with it pretty sharpish. -- Jón Fairbairn Jon.Fairbairn@cl.cam.ac.uk

either be slower than mainstream hardware or would be overtaken by it in a very short space of time.
i'd like to underline the last of these two points, and i'm impressed that you came to that conclusion as early as the eighties. i'm not into hardware research myself, but while i was working on my MSc, a couple of PhD students in the same group were developing an abstract machine and a hardware realisation (that must have been very early nineties?) for a reduction language (purely functional). unlike earlier designs, the hardware only leaned toward functional, rather than being specific to it (mostly RISC, with large register files organised as very fast stack windows for a small number of stacks), and numbers from the hand-configured prototype suggested that it would be about twice as fast as contemporary standard hardware. which was great, until it became clear that, in the time it would have taken to go from that prototype to production, the next generation of that standard hardware would have been on the market, also twice as fast (with the next next generation already on its way).. for a non-fp example, see the one-laptop-per-child project: now that they're actually looking for firm orders, they have the first mainstream competitors, and anything those do, they tend to do with a lot of backup, and twice as well a short time later.. the suggestion that the mainstream might be running out of steam along one particular dimension is interesting, but in my naive view, there is still the difference between any one-shot research project and a snapshot in a development pipeline of great momentum (are they still running several overlapping teams to double the flow through that pipeline?). i wouldn't want to dishearten anyone, though. just try not to aim for a single piece of hardware, but for a suitable change to one of those mainstream development pipelines, perhaps? claus

"Claus Reinke"
either be slower than mainstream hardware or would be overtaken by it in a very short space of time.
i'd like to underline the last of these two points, and i'm impressed that you came to that conclusion as early as the eighties.
Well, Stuart and I had just been working on TIM, and Thomas Clarke was one of the folk who built SKIM, so we had a fair bit of relevant experience between us. It must have been between 1986 (when we'd finished TIM) and 1989 (when I got ill).
unlike earlier designs, the hardware only leaned toward functional, rather than being specific to it (mostly RISC, with large register files organised as very fast stack windows for a small number of stacks),
We had that sort of thing in mind -- my first implementation of TIM was on an ARM, so I knew something of what RISC had to offer.
and numbers from the hand-configured prototype suggested that it would be about twice as fast as contemporary standard hardware. which was great, until it became clear that, in the time it would have taken to go from that prototype to production, the next generation of that standard hardware would have been on the market, also twice as fast (with the next next generation already on its way)..
Yup, that's what we figured (we knew the ARM guys too, and knowing the rate at which they worked probably helped us see things quite clearly :-).
the suggestion that the mainstream might be running out of steam along one particular dimension is interesting, but in my naive view, there is still the difference between any one-shot research project and a snapshot in a development pipeline of great momentum
Yes. I think the best bet is to get hold of prototypes and research fp implementations for them; my TIM implementation of Ponder must have been the first fp language on ARM, and the process of doing that probably informed much of the detailed design of TIM. Doing an fp implementatio for a huge number of cores strikes me as an exciting avenue. -- Jón Fairbairn Jon.Fairbairn@cl.cam.ac.uk

Hello Jon, Friday, June 1, 2007, 11:17:07 PM, you wrote:
(we had the possiblity of funding to make something). We had lots of ideas, but after much arguing back and forth the conclusion we reached was that anything we could do would either be slower than mainstream hardware or would be
this looks rather strange for me. it's well known that Neuman architecture is a bottleneck with only one operation executed each time. it was developed in 1946 because those times all programming was in binary code and simplicity of programming was favored but more efficient computational model exists. if cpu consists from huge amount of execution engines which synchronizes their operations only when one unit uses results produces by another then we got processor with huge level of natural parallelism and friendlier for FP programs. it seems that now we move right into this direction with GPUs -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat
That was done to death as well in the '80s - data flow architectures
where the execution was data-availability driven. The issue becomes
one of getting the most of the available silicon area. Unfortunately
with very small amounts of computation per work unit you:
a) spend a lot of time/area making the matching decision - the what
to do next
b) the basic sequential blocks of code are too small - can't
efficiently pipeline
Locality is essential for performance. It is needed to hide all the
(relatively large) latencies in fetching things.
If anyone wants to build the new style of functional programming
execution hardware, it is those issues that need to be sorted.
Not to say that Haskell is the wrong beast to think about these things
in. It's demand driven execution framework, and it's ability to
perform multiple concurrent evaluations safely are the unique points.
Neil
PS if any of you really want to attack this seriously - do get in
touch - I've got notes and stuff from the '80s when we (at Bristol)
looked into this. I've also got evaluation frameworks for modeling
communication behaviour and performance (which you'll need) -
naturally all written in Haskell!
On 02/06/07, Bulat Ziganshin
Hello Jon,
Friday, June 1, 2007, 11:17:07 PM, you wrote:
(we had the possiblity of funding to make something). We had lots of ideas, but after much arguing back and forth the conclusion we reached was that anything we could do would either be slower than mainstream hardware or would be
this looks rather strange for me. it's well known that Neuman architecture is a bottleneck with only one operation executed each time. it was developed in 1946 because those times all programming was in binary code and simplicity of programming was favored
but more efficient computational model exists. if cpu consists from huge amount of execution engines which synchronizes their operations only when one unit uses results produces by another then we got processor with huge level of natural parallelism and friendlier for FP programs. it seems that now we move right into this direction with GPUs
-- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

"Neil Davies"
Bulat
That was done to death as well in the '80s - data flow architectures where the execution was data-availability driven. The issue becomes one of getting the most of the available silicon area. Unfortunately with very small amounts of computation per work unit you: a) spend a lot of time/area making the matching decision - the what to do next b) the basic sequential blocks of code are too small - can't efficiently pipeline
Locality is essential for performance. It is needed to hide all the (relatively large) latencies in fetching things.
If anyone wants to build the new style of functional programming execution hardware, it is those issues that need to be sorted.
Yes indeed, though a lot has changed (and a lot stayed the same) in hardware since then. There may be greater possibilities for integrating garbage collection in the memory, for example, and there's always the possibility that someone will come up with a new and radically different way of spreading a functional programme across multiple CPU cores. -- Jón Fairbairn Jon.Fairbairn@cl.cam.ac.uk

Bulat Ziganshin wrote:
it seems that now we move right into this direction with GPUs
I was just thinking that GPUs might make a good target for a reduction language like Haskell. They are hugely parallel, and they have the commercial momentum to keep them current. It also occurred to me that the cell processor (in the Playstation 3) might also be a good target considering its explicitly parallel architecture.

it seems that now we move right into this direction with GPUs
I was just thinking that GPUs might make a good target for a reduction language like Haskell. They are hugely parallel, and they have the commercial momentum to keep them current. It also occurred to me that the cell processor (in the Playstation 3) might also be a good target considering its explicitly parallel architecture.
They are no good. GPU's have no synchronisation between them which is needed for graph reduction. A delayed computation undergo several steps when created and actually computed/forced and we'd like to save effort. Also, they are slow when dealing with memory accesses. Some are slow on conditional execution. Take a look at BrookGPU: http://graphics.stanford.edu/projects/brookgpu/ They have raytracer on GPU and it is SLOW because of high cost of tree traversal.

On 6/5/07, szefirov@ot.ru
it seems that now we move right into this direction with GPUs They are no good.
GPU's have no synchronisation between them which is needed for graph reduction.
GPUs are intrinsically parallel devices and might work very well for parallel graph reduction.
Also, they are slow when dealing with memory accesses. Some are slow on conditional execution.
Ummm...they read and write memory blazingly fast. Modern games happily run at 60Hz at hidef resolutions using multipass renders and combining many layers of texture to render each pixel. And when you say 'some' you must be referring to older devices.
Take a look at BrookGPU: http://graphics.stanford.edu/projects/brookgpu/ They have raytracer on GPU and it is SLOW because of high cost of tree traversal.
And these guys have a fast ray tracer: http://www.nvidia.com/page/gz_home.html so you have demonstrated that some people can write SLOW ray tracers in one particular language. I'm not convinced that GPUs are terribly suitable for graph reduction. But not for the reasons you give. The main problem I see is that you can't immediately read back memory you've just written because GPUs use a streaming model, and they are limited in how many instructions they can execute at a time. And those problems may already have gone away by now as I'm slightly out of date in my knowledge of GPUs... -- Dan

but more efficient computational model exists. if cpu consists from huge amount of execution engines which synchronizes their operations only when one unit uses results produces by another then we got processor with huge level of natural parallelism and friendlier for FP programs. it seems that now we move right into this direction with GPUs
GPU's are pretty normal processors. They are different in that they usually write into a single predetermined memory location and fetch from texture memory with floating point indexes, so called "gather" operation mode. The dataflow approach ("scatter" approach) splits into static dataflow with limited parallelism and dynamic dataflow with huge potential parallelism. Dynamic dataflow approach (currently investigated by our research team I am proud member of) requires substantial hardware support in terms of associative memory for matching computation contexts. Also, "huge amount of execution engines" should be connected somehow. Connection network is also a non-trivial task. Problems, problems, problems, problems and no easy solution. ;) BTW, second of our modeling engines was written in Haskell. ;)

Hello, Did you see the recently announce reduceron project? (Or perhaps you are already involved in that project?) http://www-users.cs.york.ac.uk/~mfn/reduceron/index.html If you search scholar.google.com for graph reduction machine you should turn up a bunch of papers about attempts to build a processor for Haskell and Haskell-like languages. j. At Fri, 01 Jun 2007 19:47:28 +0100, Andrew Coppin wrote:
OK, so... If you were going to forget everything we humans know about digital computer design - the von Neuman architecture, the fetch/decode/execute loop, the whole shooting match - and design a computer *explicitly* for the purpose of executing Haskell programs... what would it look like?
What design decisions would you make?
Why?
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
participants (9)
-
Al Falloon
-
Andrew Coppin
-
Bulat Ziganshin
-
Claus Reinke
-
Dan Piponi
-
Jeremy Shaw
-
Jon Fairbairn
-
Neil Davies
-
szefirov@ot.ru