Dynamic thread management?

Haskell/FP seems to have solved the hardest bit of threading, which is making it obvious which bits of a program are parallelizable, and which are not. Remains to actually parallelize out the programs. Am I being naive or is this trivial? There has been a lot of talk about parallelizing out a program statically, at compile time, but this would seem to suffer from the halting problem? The only way to know how long a function, or one of it's sub-functions, will take to run is to actually run it? Is there some reason why we cant just start a function running in a single thread, whilst running profiling, then after a while we check which bits of the function are taking time to run, and are parellizable, and we parallelize those out? This sounds reasonably trivial and workable? Basically, "parallizable" in a Haskell context means, as a first approximation, any map, foldr, foldl or derivatives, and any independent let assignments, then we can always add in extra parallizable cases later. Profiling is already built into Haskell AFAIK? so the profiling information is already available? Thoughts? Is there some reason why such an approach has been disregarded/is harder than it sounds? (By the way, can someone provide me a non-google link to the SPJ video talking about nested data parallism? google video is unavailable from my current location, and although I did watch this video once, it was on my second day of doing Haskell, a few weeks ago, so much of it was incomprehensible to me ;-) )

hughperkins:
Haskell/FP seems to have solved the hardest bit of threading, which is making it obvious which bits of a program are parallelizable, and which are not. Remains to actually parallelize out the programs. Am I being naive or is this trivial?
Is there some reason why we cant just start a function running in a single thread, whilst running profiling, then after a while we check which bits of the function are taking time to run, and are parellizable, and we parallelize those out?
Perhaps have a look at this new paper: "Feedback directed implicit parallelism in Haskell" http://research.microsoft.com/~tharris/papers/2007-fdip.pdf -- Don

On 8/10/07, Donald Bruce Stewart
Perhaps have a look at this new paper:
"Feedback directed implicit parallelism in Haskell" http://research.microsoft.com/~tharris/papers/2007-fdip.pdf
-- Don
Ok interesting. So: it's a viable strategy, it's sortof already being done. A key difference between the approach in the Harris/Singh paper and the proposal above is that the Harris/Singh paper proposes recompiling the program multiple times. The proposal above is taking a more vm approach, something like Java or C#, where the vm adapts continuously to the program's behavior during execution. I suppose this does partly answer my question however about "how hard can this be?", in that a pre-requisite to do this is to make Haskell run as a vm? To what extent is making Haskell run in a vm possible? To what extent does ghci attempt to do this/ meet the requirements for an efficient vm? (off to read the rest of the paper, though everything after the abstract looks scary :-D )

Not many replies on this thread? Am I so wrong that no-one's even telling me? I find it hard to believe that if there were obvious errors in the proposition that anyone would resist pointing them out to me ;-) So, that leaves a couple of possibilites: some people are agreeing, but see no point in saying; or noone cares, because we all only have 1 or 2 core machines. I'm going to kindof run with the second possibility for now. However, I do believe it's the right time to solve this, what with 64-core Niagara's around the corner and so on. What would be neat would be a way to test solutions on simulated 1024-core machines, using a single-core machine. Are there any utilities or virtual environments around that might make this kind of testing feasible?

On Aug 10, 2007, at 9:31 AM, Hugh Perkins wrote:
Not many replies on this thread? Am I so wrong that no-one's even telling me? I find it hard to believe that if there were obvious errors in the proposition that anyone would resist pointing them out to me ;-)
So, that leaves a couple of possibilites: some people are agreeing, but see no point in saying; or noone cares, because we all only have 1 or 2 core machines.
I'm going to kindof run with the second possibility for now. However, I do believe it's the right time to solve this, what with 64-core Niagara's around the corner and so on.
What would be neat would be a way to test solutions on simulated 1024-core machines, using a single-core machine. Are there any utilities or virtual environments around that might make this kind of testing feasible?
It's actually difficult to do realistic simulations of large machines like this; most of the performance effects you'll see depend on the behavior of the cache and memory subsystems, and it's difficult and expensive to simulate those well. So, for example, you could use something like Simics to simulate a 1024-core machine, but it'd be expensive (Simics costs money), slow (100x? slower than ordinary execution) and non-parallel (so you wouldn't be able to run it on a currently-extant multiprocessor box in the hopes of speeding up the simulation). Between the simulator slowdown and the fact that you're simulating 1024 cores using only 1 thread, you can expect to wait a long time for simulation results. Also, these things tend to require an awful lot of care and feeding. [Full disclosure: I don't personally work with Simics or its ilk, but my colleagues do.] -Jan-Willem Maessen

From: haskell-cafe-bounces@haskell.org [mailto:haskell-cafe-bounces@haskell.org] On Behalf Of Hugh Perkins
Not many replies on this thread? Am I so wrong that no-one's even telling me? I find it hard to believe that if there were obvious errors in the proposition that anyone would resist pointing them out to me ;-)
Well, I vaguely recall a study which analysed a number of programs and determined that there was not a lot of inherent parallelism in them, so the effort to automate parallelism wouldn't have much payoff. Wait a sec... This isn't it, but seems relevant: http://www.detreville.org/papers/Limits.pdf Here you go: http://citeseer.ist.psu.edu/tremblay95impact.html
So, that leaves a couple of possibilites: some people are agreeing, but see no point in saying; or noone cares, because we all only have 1 or 2 core machines.
Well, the Harris/Singh paper summarises the common problems: - not many programs are inherently parallel - parallelism must be quite coarse to offset overheads (which I think is the problem with expecting things like map and fold to parallelised automagically; they're just too small grained for it to be worthwhile) - lazy evaluation makes it harder Basically, it's harder than it sounds. This is where NDP looks good. Although you have to explicitly label parallel operations, you get to use your functional tools like map and fold, etc. Alistair ***************************************************************** Confidentiality Note: The information contained in this message, and any attachments, may contain confidential and/or privileged material. It is intended solely for the person(s) or entity to which it is addressed. Any review, retransmission, dissemination, or taking of any action in reliance upon this information by persons or entities other than the intended recipient(s) is prohibited. If you received this in error, please contact the sender and delete the material from any computer. *****************************************************************

On 8/10/07, Bayley, Alistair
Well, the Harris/Singh paper summarises the common problems: - not many programs are inherently parallel
If thats the case, then multicores are not going to be very useful. Where there's a will there's a way. What I think is: if maps etc will be parallelized out where necessary by the runtime, then people will adapt their programs to use them. Right now, many people use imperative loops in Haskell to get the highest performance (see the shootout examples for example). In the future, this will be shunned in preference of standard map and foldb operations. - parallelism must be quite coarse to offset overheads
(which I think is the problem with expecting things like map and fold to parallelised automagically; they're just too small grained for it to be worthwhile)
Someone else said that. I dont understand what you mean. Imagine I have a map like this: map f [1..100000] ... and I have 64 cores.. So I divide the 100000 elements in my list by 64, and give each chunk of 1000-ish elements to each thread. Easy I think? Note that NDP is doing this too, *but NDP is trying to do this for assymetric elements at compile-time, which is generally impossible*. I propose doing it at runtime, inside the VM, which is trivial and optimal. More details/examples: It's trivial to improve on NDP at runtime. For example: - we split our map into 64 pieces, give each one to a thread - one piece finishes earlier than the others -> we split one of the other pieces in two, and give one piece to the finished thread - and so on... -> reasonably optimal Another example: - we have a map of 64 elements, so we give one element to each thread - all of the pieces but one finishes really quickly (say, within 1 second) - so, we start to look at what is happening within the remaining piece -> turns out it contains another map, so we split that map amongst our idle threads -> and so on Easy, flexible. - lazy evaluation makes it harder Not at runtime ;-) Basically, it's harder than it sounds. No ;-) The more I talk about this the more confident I feel that this is an easy, optimal solution. The bits that stand out as being hard: - people making their programs use map, foldb etc -> but I dont have to do that, it will happen automatically. Its sortof like distributing bread automatically to people in London, via the free market. Do you know that the Russians once asked the Western world who was responsible for distributing bread to people in London? Answer: noone is, it happens automatically ;-) - migrate Haskell/GHC/Erlang/some-other-FP to have a bytecode/VM layer - create a 1024-ish-core simulator. In fact this is the only bit that I dont have a clear vision for.

Hugh Perkins wrote:
- parallelism must be quite coarse to offset overheads (which I think is the problem with expecting things like map and fold to parallelised automagically; they're just too small grained for it to be worthwhile)
Someone else said that. I dont understand what you mean.
If you do 1+2+3+4, it's not very efficient to spawn a new thread, have that compute 1+2, compute 3+4 in the current thread, synchronise, and do the final addition. You waste more time starting and stopping threads than doing useful work. If you're going to spawn a new thread, it needs to do "lots" of work for it to be worth it. And this is kinda what makes this stuff so tricky - between lazyness and the Halting Problem, it's hard to work out what is "worth" sparking a thread for...
Imagine I have a map like this:
map f [1..100000]
... and I have 64 cores..
So I divide the 100000 elements in my list by 64, and give each chunk of 1000-ish elements to each thread. Easy I think?
...you realise that that's a linked-list you have there, right? ;-) Now, if you had an *array*, and you know that "f" does a constant amount of work for all elements, and that all array elements will eventually be "needed"...
Note that NDP is doing this too, *but NDP is trying to do this for assymetric elements at compile-time, which is generally impossible*. I propose doing it at runtime, inside the VM, which is trivial and optimal.
Parallel and trivial in the same sentence? Hmm, I wonder... how did this problem not get solved 20 years ago? Oh, wait, maybe because it's harder than it looks. ;-) It certainly looks *easier* to partition the work at run-time than compile-time. But trivial? I doubt it. (For a start, and autosensing magic you add mustn't slow the program down more than the parallelism it's supposed to be generating!)
It's trivial to improve on NDP at runtime. For example:
- we have a map of 64 elements, so we give one element to each thread - all of the pieces but one finishes really quickly (say, within 1 second) - so, we start to look at what is happening within the remaining piece -> turns out it contains another map, so we split that map amongst our idle threads -> and so on
Easy, flexible.
Seems to require running some sort of "monitor" process which can somehow "see" what operation(s) a thread is trying to do so it can split up the work and share it around. That's going to add overhead. Now that's OK so long as the gains outweight the overhead...
- lazy evaluation makes it harder
Not at runtime ;-)
O RLY? The problem with things being lazy is that you don't know whether some data is "needed" until the moment you actually come to need it. OTOH, ideally you'd want to know way in advance so that you can compute it in parallel and it will already be "ready" when you come to use it. But with lazyness, you can't [easily] do that. Doing the resting at run-time isn't helping you here...
Basically, it's harder than it sounds.
No ;-) The more I talk about this the more confident I feel that this is an easy, optimal solution.
How does the saying go? "Anything is easy for the man who does not have to implement it..."

On 8/11/07, Hugh Perkins
- parallelism must be quite coarse to offset overheads (which I think is the problem with expecting things like map and fold to parallelised automagically; they're just too small grained for it to be worthwhile)
Someone else said that. I dont understand what you mean.
There are many papers about this in the Parallel Logic Programming area. It is commonly called "Embarrassing Parallelism". Creating a thread, or even just scheduling a chunk of work for evaluation has packaging-up costs, synchronization costs, and so on. It is all too easy for these costs to outweigh the work to be done, so by parallelizing your code, you make it run slower. So, if you want to parallelize "map f xs", unless f is *really* expensive, you'll only get a benefit if you can break xs into chunks of e.g. 10^3 elements or more, and more like 10^5 or more for more usual 'f's. Some tasks, like Monte Carlo integration are very amenable to such, but most tasks are not. cheers, T. -- Dr Thomas Conway drtomc@gmail.com Silence is the perfectest herald of joy: I were but little happy, if I could say how much.

On 8/11/07, Thomas Conway
There are many papers about this in the Parallel Logic Programming area. It is commonly called "Embarrassing Parallelism".
Ah, I wasnt very precise ;-) I didnt mean I dont understand the problem; I meant I dont understand why people think it is difficult to solve ;-) (And then I tried to explain by examples why it is easy, but it is true that my communication sucks ;-) )
you'll only get a benefit if you can break xs into chunks of e.g. 10^3 elements or more, and more like 10^5 or more for more usual 'f's.
Actually, the number of elements is irrelevant. The only measure that is important is how long the function is taking to execute, and whether it is parellizable. Example, the following only has 3 elements but will take a while to run: strPrtLn $ sum $ map getNumberPrimes [10240000, 20480000, 40960000 ] The following has 10 million elements but runs quickly: strPrtLn $ sum $ map (+1) [1..10000000 ] In the second, we start the function running, in a single thread, and after a second, the function has already finished, so great! Were done! In the first, we start the function running, in a single thread. After a second the function is still running, so we look at what is taking the time and whether it is parallelizable. Turns out the vm has been chugging away on the map for the last second, and that that maps are parallelizeable, so we split the map into Math.Min( <numberelements>, <number cores>) pieces, which on a 1024-core machine, given we have 3 elements, is Math.Min( 1024, 3 ), which is 3. So, we assign each of the 3 elements of the map to a thread. So, now we're using 3 of our 64 cores. A second later, each thread is still chugging away at its element, so we think, ok, maybe we can parallelize more, because we still have 61 threads/cores available, so now we look at the getNumberOfPrimes function itself, and continue the above type of analysis. This continues until all 64 cores have been assigned some work to do. Whenever a thread finishes, we look around for some work for it to do. If there is some, we assign it. If there isnt, we look around for an existing thread that is doing something parallelizeable, and split it into two pieces, giving one to the old thread, and one to the available thread. Not sure why this is perceived as difficult (added the word "perceived" this time ;-) ). I think the main barrier to understanding why it is easy is understanding that this needs to be done from a VM, at runtime. It is not possible to do it statically at compilation, but I really need to finish watching SPJ's video before declaring that SPJ's proposal is doomed to fail ;-) Not least, probably not good to make an enemy of SPJ ;-)

Hugh,
I certainly think it would be wrong to declare that NDP is doomed to
failure... not because you would be making an enemy of SPJ (I'm pretty
sure you wouldn't!) but because it actually aims to solves a less
ambitious problem: the problem of parallelising the SAME task applied
to different data, rather than a collection of arbitrary tasks.
Because the task does not change, we know that e.g. taking half the
data cuts the amount of work in half. Therefore an up-front scheduling
approach can work.
If you fast forward to about 42 minutes into the London HUG video, you
see that Simon talks about the problem of parallelizing (f x) + (g y),
and says that he spent "quite a lot of time in the eighties trying to
make this game go fast [..] and the result was pretty much abject
failure".
You're absolutely right that a dynamic/adaptive approach is the only
one that will work when the tasks are of unknown size. Whether this
approach is as easy as you think is open for you to prove. I look
forward to testing your VM implementation, or at the very least
reading your paper on the subject ;-)
Regards,
Neil
On 8/11/07, Hugh Perkins
On 8/11/07, Thomas Conway
wrote: There are many papers about this in the Parallel Logic Programming area. It is commonly called "Embarrassing Parallelism".
Ah, I wasnt very precise ;-) I didnt mean I dont understand the problem; I meant I dont understand why people think it is difficult to solve ;-) (And then I tried to explain by examples why it is easy, but it is true that my communication sucks ;-) )
you'll only get a benefit if you can break xs into chunks of e.g. 10^3 elements or more, and more like 10^5 or more for more usual 'f's.
Actually, the number of elements is irrelevant. The only measure that is important is how long the function is taking to execute, and whether it is parellizable.
Example, the following only has 3 elements but will take a while to run:
strPrtLn $ sum $ map getNumberPrimes [10240000, 20480000, 40960000 ]
The following has 10 million elements but runs quickly:
strPrtLn $ sum $ map (+1) [1..10000000 ]
In the second, we start the function running, in a single thread, and after a second, the function has already finished, so great! Were done!
In the first, we start the function running, in a single thread. After a second the function is still running, so we look at what is taking the time and whether it is parallelizable.
Turns out the vm has been chugging away on the map for the last second, and that that maps are parallelizeable, so we split the map into Math.Min( <numberelements>, <number cores>) pieces, which on a 1024-core machine, given we have 3 elements, is Math.Min( 1024, 3 ), which is 3.
So, we assign each of the 3 elements of the map to a thread. So, now we're using 3 of our 64 cores.
A second later, each thread is still chugging away at its element, so we think, ok, maybe we can parallelize more, because we still have 61 threads/cores available, so now we look at the getNumberOfPrimes function itself, and continue the above type of analysis.
This continues until all 64 cores have been assigned some work to do.
Whenever a thread finishes, we look around for some work for it to do. If there is some, we assign it. If there isnt, we look around for an existing thread that is doing something parallelizeable, and split it into two pieces, giving one to the old thread, and one to the available thread.
Not sure why this is perceived as difficult (added the word "perceived" this time ;-) ). I think the main barrier to understanding why it is easy is understanding that this needs to be done from a VM, at runtime. It is not possible to do it statically at compilation, but I really need to finish watching SPJ's video before declaring that SPJ's proposal is doomed to fail ;-) Not least, probably not good to make an enemy of SPJ ;-) _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

You guys might also want to take a look at the Cilk programming language, and how it managed threads. If you know C, learning Cilk is about 2 hours of work, as it's C with half a dozen extra keywords and a few new concepts. I'd love to see Cilk - C + Haskell as a programming language. The key idea of Cilk is that it's easier to deparallelize than it is to parallelize, especially automatically. So the idea is that the program is written incredibly parallel, with huge numbers of microthreads, which are (on average) very cheap to spawn. The runtime then deparallelizes the microthreads, running multiple microthreads sequentially within a single real thread (a worker thread). Microthreads that live their entire life within a single real thread are cheap to spawn (as in "not much more expensive than a normal function call" cheap). The problem that Cilk runs into is that it's, well, C. It doesn't deal with contention at well at all- a single microthread blocking blocks the whole worker thread- meaning, among other things, that you can have "false deadlocks", where one microthread blocks on another microthread in the same real thread, and thus acts like it's deadlocked even though it really isn't. You have greatly increased the likelyhood of raceconditions as well (mutable data and multithreading just don't mix). Plus you have all the normal fun you have with C bugs- wild pointers, buffer over runs, etc. All of which you avoid if you replace the C aspects of Cilk with Haskell. With STM you avoid the deadlock problem entirely, and with immutable data (except for carefully coded monads) you avoid the whole race condition problem. Plus all the normal advantages Haskell has over C. Just my $0.02. Brian On Sat, 11 Aug 2007, Neil Bartlett wrote:
Hugh,
I certainly think it would be wrong to declare that NDP is doomed to failure... not because you would be making an enemy of SPJ (I'm pretty sure you wouldn't!) but because it actually aims to solves a less ambitious problem: the problem of parallelising the SAME task applied to different data, rather than a collection of arbitrary tasks. Because the task does not change, we know that e.g. taking half the data cuts the amount of work in half. Therefore an up-front scheduling approach can work.
If you fast forward to about 42 minutes into the London HUG video, you see that Simon talks about the problem of parallelizing (f x) + (g y), and says that he spent "quite a lot of time in the eighties trying to make this game go fast [..] and the result was pretty much abject failure".
You're absolutely right that a dynamic/adaptive approach is the only one that will work when the tasks are of unknown size. Whether this approach is as easy as you think is open for you to prove. I look forward to testing your VM implementation, or at the very least reading your paper on the subject ;-)
Regards, Neil
On 8/11/07, Hugh Perkins
wrote: On 8/11/07, Thomas Conway
wrote: There are many papers about this in the Parallel Logic Programming area. It is commonly called "Embarrassing Parallelism".
Ah, I wasnt very precise ;-) I didnt mean I dont understand the problem; I meant I dont understand why people think it is difficult to solve ;-) (And then I tried to explain by examples why it is easy, but it is true that my communication sucks ;-) )
you'll only get a benefit if you can break xs into chunks of e.g. 10^3 elements or more, and more like 10^5 or more for more usual 'f's.
Actually, the number of elements is irrelevant. The only measure that is important is how long the function is taking to execute, and whether it is parellizable.
Example, the following only has 3 elements but will take a while to run:
strPrtLn $ sum $ map getNumberPrimes [10240000, 20480000, 40960000 ]
The following has 10 million elements but runs quickly:
strPrtLn $ sum $ map (+1) [1..10000000 ]
In the second, we start the function running, in a single thread, and after a second, the function has already finished, so great! Were done!
In the first, we start the function running, in a single thread. After a second the function is still running, so we look at what is taking the time and whether it is parallelizable.
Turns out the vm has been chugging away on the map for the last second, and that that maps are parallelizeable, so we split the map into Math.Min( <numberelements>, <number cores>) pieces, which on a 1024-core machine, given we have 3 elements, is Math.Min( 1024, 3 ), which is 3.
So, we assign each of the 3 elements of the map to a thread. So, now we're using 3 of our 64 cores.
A second later, each thread is still chugging away at its element, so we think, ok, maybe we can parallelize more, because we still have 61 threads/cores available, so now we look at the getNumberOfPrimes function itself, and continue the above type of analysis.
This continues until all 64 cores have been assigned some work to do.
Whenever a thread finishes, we look around for some work for it to do. If there is some, we assign it. If there isnt, we look around for an existing thread that is doing something parallelizeable, and split it into two pieces, giving one to the old thread, and one to the available thread.
Not sure why this is perceived as difficult (added the word "perceived" this time ;-) ). I think the main barrier to understanding why it is easy is understanding that this needs to be done from a VM, at runtime. It is not possible to do it statically at compilation, but I really need to finish watching SPJ's video before declaring that SPJ's proposal is doomed to fail ;-) Not least, probably not good to make an enemy of SPJ ;-) _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Brian Hurt wrote:
The key idea of Cilk is that it's easier to deparallelize than it is to parallelize, especially automatically. So the idea is that the program is written incredibly parallel, with huge numbers of microthreads, which are (on average) very cheap to spawn. The runtime then deparallelizes the microthreads, running multiple microthreads sequentially within a single real thread (a worker thread). Microthreads that live their entire life within a single real thread are cheap to spawn (as in "not much more expensive than a normal function call" cheap).
That's so absurd, it might even work! I quite like the concept though - rather than asking "what can we make parallel?", you say "what should we do serially?" That sounds like an easier question to answer...

On 11/08/07, Brian Hurt
You guys might also want to take a look at the Cilk programming language, and how it managed threads. If you know C, learning Cilk is about 2 hours of work, as it's C with half a dozen extra keywords and a few new concepts. I'd love to see Cilk - C + Haskell as a programming language.
The key idea of Cilk is that it's easier to deparallelize than it is to parallelize, especially automatically. So the idea is that the program is written incredibly parallel, with huge numbers of microthreads, which are (on average) very cheap to spawn. The runtime then deparallelizes the microthreads, running multiple microthreads sequentially within a single real thread (a worker thread). Microthreads that live their entire life within a single real thread are cheap to spawn (as in "not much more expensive than a normal function call" cheap).
The problem that Cilk runs into is that it's, well, C. It doesn't deal with contention at well at all- a single microthread blocking blocks the whole worker thread- meaning, among other things, that you can have "false deadlocks", where one microthread blocks on another microthread in the same real thread, and thus acts like it's deadlocked even though it really isn't. You have greatly increased the likelyhood of raceconditions as well (mutable data and multithreading just don't mix). Plus you have all the normal fun you have with C bugs- wild pointers, buffer over runs, etc.
All of which you avoid if you replace the C aspects of Cilk with Haskell. With STM you avoid the deadlock problem entirely, and with immutable data (except for carefully coded monads) you avoid the whole race condition problem. Plus all the normal advantages Haskell has over C.
How is this any better than using "par" in Haskell? -- Sebastian Sylvan +44(0)7857-300802 UIN: 44640862

On Sat, 11 Aug 2007, Sebastian Sylvan wrote:
How is this any better than using "par" in Haskell?
Mainly how the threads are actually scheduled. Mind you, I'm an *incredible* Haskell newbie, so take all of my comments with a 5-pound salt block, but as I understand how the current implementation of parallel Haskell works, all threads spawned go into a communal heap. When a thread blocks, it's put back into the communal queue, and a new thread is selected to run by the worker thread. In Cilk, the task structure is more tree-like. When thread A spawns thread B, the worker thread stops executing thread A and starts executing thread B. When thread B exits, the worker thread then resumes thread A. So in any given point in time, the worker thread has a stack of processes waiting for subthreads to complete so they can be resumed- not unlike function calls in other languages, or nested lazy blocks in Haskell. When a worker thread runs out of work to do, when it has no more threads to execute, it picks another worker thread at random, and asks the other worker thread for work to do. The other worker thread (assuming it has work to do) then picks the microthread at the bottom of the resume stack, that is farthest away from being resumed, and passes that microthread's state to the original worker thread.
From the user program's perspective, this is no different from the current implementation. Threads get spawned, get executed in some unspecified order, and then complete.
What's interesting (to me, at least) are the optimization opportunities this model provides that the shared-queue model doesn't. Consider the following structural model: we have a two-tiered heap. Each worker thread has it's own local heap, which only microthreads it is managing can see. Plus there is a global heap with objects that all microthreads in all worker threads can see. Objects are originally allocated in the local heap. When a microthread to start being executed on another worker thread, all objects it can see (reach, in the conservative GC sense) are promoted to the global heap. The advantage of all of this is that now we can easily distinguish between objects that can be seen from other real threads, and those that can't. All of the mutable data structures- tvars, lazy thunks, everything- can be much more cheaply implemented if you know that no other thread can access them. Take the example of a parallel map, where I'm using a tvar in my map function. The likelyhood is that all of the (short-lived) microthreads I'm spawning will execute on the same worker thread- and that thus the tvar will never leave the local heap, and thus can be optimized down to something close to a single-threaded mvar. I have, via optimization, turned a parallel map and a synchronized tvar back into something approaching a serial map and an mvar. Brian

Hello Brian, Saturday, August 11, 2007, 8:35:49 PM, you wrote:
The key idea of Cilk is that it's easier to deparallelize than it is to parallelize, especially automatically. So the idea is that the program is written incredibly parallel, with huge numbers of microthreads, which are (on average) very cheap to spawn. The runtime then deparallelizes the microthreads, running multiple microthreads sequentially within a single real thread (a worker thread).
this idea is in wide use now: it's how ghc, erlang, ruby and virtually any other interpreting languages work :)) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On Aug 11, 2007, at 12:35 PM, Brian Hurt wrote:
You guys might also want to take a look at the Cilk programming language, and how it managed threads. If you know C, learning Cilk is about 2 hours of work, as it's C with half a dozen extra keywords and a few new concepts. I'd love to see Cilk - C + Haskell as a programming language.
It was called pH, and we (meaning Alejandro Caro and myself) implemented it back in the mid/late 90's using Lennart Augustsson's hbcc front end (which he hacked to add a bunch of pH-specific syntax). Arvind and Nikhil wrote a textbook on pH programming. There are two problems, still: one is that laziness means you can't actually prove you need something until very close to the time you actually want it. By the time I know that I'm adding f x to g y, it's probably too late to usefully run them in parallel (unless they're both *very* large). We used eager evaluation in pH---to the point that we actually gave up the ability to manipulate infinite lazy data structures. In NDP they've done much the same thing, first instrumenting the program to see that the eagerness they introduce won't disrupt execution. Even the "par" annotation has this feel: we are telling the implementation that it's OK to do some computation even if it isn't yet obvious that we'll need the results. The second problem is controlling the overhead. More on this below.
The key idea of Cilk is that it's easier to deparallelize than it is to parallelize, especially automatically. So the idea is that the program is written incredibly parallel, with huge numbers of microthreads, which are (on average) very cheap to spawn. The runtime then deparallelizes the microthreads, running multiple microthreads sequentially within a single real thread (a worker thread). Microthreads that live their entire life within a single real thread are cheap to spawn (as in "not much more expensive than a normal function call" cheap).
The problem here is that while Cilk spawns are incredibly cheap, they're still more than a simple procedure call (2-10x as expensive if my fading memory serves me rightly). Let's imagine we have a nice, parallelizable computation that we've expressed using recursive subdivision (the Cilk folks like to use matrix multiplication as an example here). Near the leaves of that computation we still spend the majority of our time paying the overhead of spawning. So we end up actually imposing a depth bound, and writing two versions of our computation---the one that spawns, which we use for coarse-grained computations, and the one that doesn't, which we use when computation gets fine-grained. It makes a really big difference in practice. The programmer is free to use this trick in any programming language. But we haven't yet figured out a way to *avoid* the need to do so. This continues to worry me to this day, because making the right choices is black magic and specific to a particular combination of algorithm and machine. That said, there is some reason for optimism: the overhead of creating work in Cilk is comparable to the overhead of creating a thunk in Haskell.
The problem that Cilk runs into is that it's, well, C. It doesn't deal with contention at well at all- a single microthread blocking blocks the whole worker thread- meaning, among other things, that you can have "false deadlocks", where one microthread blocks on another microthread in the same real thread, and thus acts like it's deadlocked even though it really isn't.
This is actually a fundamental problem with the threading model: there is no guarantee of fairness using work stealing, so if you do something that requires fair scheduling you get into a lot of trouble fast. It's not fair to blame C for this. You have to be very careful to define the interaction between fair IO-style threads and unfair get-my-work-done threads.
You have greatly increased the likelyhood of raceconditions as well (mutable data and multithreading just don't mix). Plus you have all the normal fun you have with C bugs- wild pointers, buffer over runs, etc.
This, however, *is* C's fault. :-) More on pH: we got our programs to scale, but had troubles going past 8 threads. We found ourselves limited by a non-parallel GC (my fault, but labor-intensive to get right) and the absence of parallelism in the underlying algorithms. For the latter problem there simply is no magic bullet. -Jan-Willem Maessen

Hi!
I am thinking about a model where you would have only n threads on a
n-core (or processor) machine. They would be your worker threads and
you would spawn them only once (at the beginning of the program) and
then just delegate work between them.
On 8/13/07, Jan-Willem Maessen
The problem here is that while Cilk spawns are incredibly cheap, they're still more than a simple procedure call (2-10x as expensive if my fading memory serves me rightly). Let's imagine we have a nice, parallelizable computation that we've expressed using recursive subdivision (the Cilk folks like to use matrix multiplication as an example here). Near the leaves of that computation we still spend the majority of our time paying the overhead of spawning. So we end up actually imposing a depth bound, and writing two versions of our computation---the one that spawns, which we use for coarse-grained computations, and the one that doesn't, which we use when computation gets fine-grained. It makes a really big difference in practice.
But this could be done at the runtime too. If the lazy-non-evaluated-yet chunk is "big" then divide it into a few parts and run each part in its thread. But if the chunk is small (you are at the end of the evaluation and you already evaluated necessary subexpressions) you do it in the thread which encounters this situation (which is probably near the end of the program or the end of the monadic IO action). And this line when you choose to delegate or not can be determined at runtime too. In combination with some transactional memory or some other trick which would be behind this delegation this could be probably possible. We could also hint runtime that the function would probably take a long time to compute (even if it is lazy) with making a type for such functions which would signal this. Of course this could also make things worse if used improperly. But sometimes you know that you will be running the map of time-consuming function. Yes, you have parMap but the problem I saw with it (and please correct me if I am wrong) is that it spawns a new thread for every application of the function to the element? But what if functions are small? Then this is quite an overhead. And you have to know this in advance if you want to use something else than the default parMap which is not always possible (if we are passing a function as an argument to the function which calls map). For example: calculate f xs = foldl (+) 0 $ map f xs -- or foldr, I am not sure And I would like to see something like this: it gets to the point when we need to evaluate this function call, for some "big" f and some "big" list of xs, so the thread which gets to it starts evaluating the first value and when it starts with another (it is recursive call so it is a "similar" evaluation) it sees that the other thread is empty and the function would probably take a long time (it speculates) so it delegates it there and continues with the third element that is a dummy recursive call to the function, in this case foldl (dummy because it did not really evaluate everything at the previous level). Now, if the second thread finishes first, it goes to the next element (recursive call) but sees that it is already (still) evaluating, so it gets to the fourth. Otherwise, it the first thread finishes first just goes to the next element. This would be some kind of speculative evaluating. If the CPUs know how to do that why would not we at the higher level? It would be also an interesting lazy evaluation of the, in this example, foldl function. The system would make a recursive call but would just go another call deeper if it finds that it is impossible (because it is still evaluating somewhere else) to evaluate it. And at every level it finishes it will check previous levels if it can collapse them (and maybe prematurely finish the whole thing). It would be like that I unwind the loop (in this case recursion), evaluate everything (on many threads) and then (try to) calculate the value. If it finishes prematurely ... sad, but for "big" lists and "big" functions it would be a saver. And the size of this window (unwinding) would be a heuristic speculative function of number of possible threads (cores, processors) and of information how long have previous evaluations of the same function taken. I really messed this explanation up. Or maybe it is completely wrong and this is why it looks like a mess to me. :-) Mitar

On Aug 13, 2007, at 2:53 PM, Mitar wrote:
Hi!
I am thinking about a model where you would have only n threads on a n-core (or processor) machine. They would be your worker threads and you would spawn them only once (at the beginning of the program) and then just delegate work between them.
On 8/13/07, Jan-Willem Maessen
wrote: The problem here is that while Cilk spawns are incredibly cheap, they're still more than a simple procedure call (2-10x as expensive if my fading memory serves me rightly). Let's imagine we have a nice, parallelizable computation that we've expressed using recursive subdivision (the Cilk folks like to use matrix multiplication as an example here). Near the leaves of that computation we still spend the majority of our time paying the overhead of spawning. So we end up actually imposing a depth bound, and writing two versions of our computation---the one that spawns, which we use for coarse-grained computations, and the one that doesn't, which we use when computation gets fine-grained. It makes a really big difference in practice.
But this could be done at the runtime too. If the lazy-non-evaluated-yet chunk is "big" then divide it into a few parts and run each part in its thread. But if the chunk is small (you are at the end of the evaluation and you already evaluated necessary subexpressions) you do it in the thread which encounters this situation (which is probably near the end of the program or the end of the monadic IO action).
I didn't make my point very well. The hard part is determining exactly when a chunk is "big" or "small" without actually computing its value. Consider recursive fib (which I use only because it's easy to understand, and has been used as an example of this problem by the Cilk implementors): fib n = if n <= 1 then n else fib (n-1) + fib (n-2) Imagine we're computing (fib 30). We can divide and conquer; plenty of parallelism there! But do most calls to fib represent enough work to justify running them in parallel? No, because most of the calls are to (fib 0) or (fib 1)! We should only pay the spawn cost up to a certain bound --- say n >= 5 --- and then run serially for smaller n. This has a dramatic effect on how fast fib runs, but of course the best possible choice of bound is going to be machine-dependent. We can instrument our program and have some chance of doing OK for fib :: Int -> Int; it's not at all obvious what to do for: myFunction :: [Int] -> (Int -> Bool) -> [Frob] In effect, I end up needing to write myFunction with a built-in bound on computation, and I need to do it in such a way that the underlying systems knows that one branch should be serial and the other branch parallel. This is the problem I was attempting to allude to above. Yes, we can decide on a function-by-function or callsite-by-callsite basis that "enough work" is being done to justify parallelism. But the answer to this question is often "maybe", or even "no" (as in fib).
Yes, you have parMap but the problem I saw with it (and please correct me if I am wrong) is that it spawns a new thread for every application of the function to the element?
But what if functions are small? Then this is quite an overhead. And you have to know this in advance if you want to use something else than the default parMap which is not always possible (if we are passing a function as an argument to the function which calls map).
For example:
calculate f xs = foldl (+) 0 $ map f xs -- or foldr, I am not sure
You seem to be arguing that we can pick the right implementation of "map" and "fold" (serial or parallel) here if we only knew that xs was "big enough" and f "expensive enough". I agree. But that begs the question: let's assume "calculate" is a function that's called from numerous places, with a mixture of "big" and "small" arguments. Now I need two versions of calculate, and I need to decide at each call site whether to call "big calculate" or "small calculate". We also need to make sure any eagerness we introduce is semantically sound, too, but I think we've got a pretty good handle on that part in practice between my work on resource- bounded evaluation in Eager Haskell, Rob Ennals' work on eager evaluation in GHC, and the Singh & Harris paper. That isn't to say that any of this is impossible---but it's going to take a while to figure out how to get it right [it'd be a great Ph.D. project to even knock off the low-hanging fruit like fib and recursive matrix multiply]. Meanwhile, we also need to work hard educating programmers how to write code that'll run in parallel in the first place, and giving them the libraries that'll make it easy. -Jan-Willem Maessen

On 8/11/07, Neil Bartlett
You're absolutely right that a dynamic/adaptive approach is the only one that will work when the tasks are of unknown size. Whether this approach is as easy as you think is open for you to prove. I look forward to testing your VM implementation,
Well... obviously migrating Haskell to use a VM is itself non-trivial ;-) There are two obstacles: - technical - political The technical obstacle means implementing it. Given that high performance VMs exist this is largely pure software engineering, rather than research? The political obstacle means: pursuading people to use it if it were written. If no-one uses it, it wont be maintained, and is basically pretty useless. The main reasons why it might not be used are: - breaks the status quo / de facto standard - provides no advantage in a single-core environment Breaking the status quo is not an inconsiderable obstacle, but it would be broken if there was a real advantage of using automatic threading, which there is not right now because most machines are single-cored. Whilst it's the right time to think about how to implement things, it's maybe a year or two early to actually implement it and expect people to use it. What I think is: - automatic threading is not really that hard. Once you've got a pure FP running in a VM, the actual automatic threading bit is pretty easy (largely software engineering, not research) - when machines become multicored, Microsoft will just take F# (which already runs in a VM I guess? but not sure if it's an FP-dedicated VM, they might need to build one), and just slot in the automatic threading bit.
or at the very least reading your paper on the subject ;-)
Writing a paper would be fun. I think I'm a little out of my depth to be writing a paper ;-) but just on the off-chance, how does one go about writing a paper and getting it published? Does one have to be a member of an accredited institution, or can one write one as a "freelancer"? If one has to be a member of an accredited institution, what are the options?

On 8/21/07, Hugh Perkins
On 8/11/07, Neil Bartlett
wrote: You're absolutely right that a dynamic/adaptive approach is the only one that will work when the tasks are of unknown size. Whether this approach is as easy as you think is open for you to prove. I look forward to testing your VM implementation,
Well... obviously migrating Haskell to use a VM is itself non-trivial ;-) There are two obstacles: - technical - political
The technical obstacle means implementing it. Given that high performance VMs exist this is largely pure software engineering, rather than research?
GHCi, of course, is a bytecode interpreter, so that's sort of like a VM. You might start by looking at how GHCi works and see what you would need to change if performance rather than interactivity was your goal.
The political obstacle means: pursuading people to use it if it were written. If no-one uses it, it wont be maintained, and is basically pretty useless. The main reasons why it might not be used are: - breaks the status quo / de facto standard - provides no advantage in a single-core environment
Breaking the status quo is not an inconsiderable obstacle, but it would be broken if there was a real advantage of using automatic threading, which there is not right now because most machines are single-cored. Whilst it's the right time to think about how to implement things, it's maybe a year or two early to actually implement it and expect people to use it.
I don't think you have to worry too much about the political obstacles. People want automatic multithreading, and in a year or two we'll all have multicore boxen. In any case, if you don't solve the technical problems, the political ones will never surface; if you build it, people will come, or if they don't, you at least know something that you wouldn't have known if you didn't build it :-)
Writing a paper would be fun. I think I'm a little out of my depth to be writing a paper ;-) but just on the off-chance, how does one go about writing a paper and getting it published? Does one have to be a member of an accredited institution, or can one write one as a "freelancer"? If one has to be a member of an accredited institution, what are the options?
Anyone can submit a paper to a CS journal or conference. While most people who do so are affiliated with universities, research labs, or (more rarely) non-research companies, there are independent researchers out there, and sometimes you'll notice a paper where someone is listed by just their name with no affiliation. Conferences issue calls for papers (you might see some posted on this mailing list) that give you an idea for the rough format of the paper and submission guidelines. But really, you'll want to find a mentor who can give you advice on how to write a paper that will fit the mold. First come up with a technical result that you believe is paper-worthy, then find other people to talk to who can confirm that opinion and help you get your paper submitted :-) Cheers, Tim -- Tim Chevalier * catamorphism.org * Often in error, never in doubt "I cannot remember a time when I did not take it as understood that everybody has at least two, if not twenty-two, sides to him."--Robertson Davies

On 8/21/07, Tim Chevalier
I don't think you have to worry too much about the political obstacles. People want automatic multithreading, and in a year or two we'll all have multicore boxen. In any case, if you don't solve the technical problems, the political ones will never surface; if you build it, people will come, or if they don't, you at least know something that you wouldn't have known if you didn't build it :-)
Ok, that's good encouragement. Practical question: I dont have a multicore box ;-) It's my main show-stopper right now. Any clues on how to get access to one, eg via ssh? 32-core or higher would be favorite ;-) but I guess even just a 4-core or so is enough for proof-of-concept?
GHCi, of course, is a bytecode interpreter, so that's sort of like a VM. You might start by looking at how GHCi works and see what you would need to change if performance rather than interactivity was your goal.
Yes, I guess two approaches are to take GHCi and make it handle automatic threading, but keeping the interactivity, ie not seeking to rival ghc in real performance, but merely providing a PoC, ... ... or build a minimal vm, enough to get 3 or 4 somewhat interesting algorithms / programs to run, and get automatic threading working on a couple of targets, eg on maps, and whatever [ x | x <- somelist ] these things are called. (folds are a little harder from an implementation point of view, so can be a future upgrade).
Anyone can submit a paper to a CS journal or conference. While most people who do so are affiliated with universities, research labs, or (more rarely) non-research companies, there are independent researchers out there, and sometimes you'll notice a paper where someone is listed by just their name with no affiliation.
Again, that's quite encouraging :-) I'm far too lazy to sign my life away for 7 years of phd :-D (unless someone knows anyone looking for phd students in somewhere exotic like Paris, China or Singapore???), but working on it in my own time sounds fun.
First come up with a technical result that you believe is paper-worthy
I guess a "paperworthy technical result" doesnt have to be a fully fledged Haskell VM with in-depth automatic threading?, but either GHCi with some simple automatic threading, or a minimal vm implementation, again with some simple automatic threading? Basically, we give it one or three algorithms with automatic threading switched off, time execution, then run it on (ideally 32 or 64 cores but 4 is ok) a multicore machine, and show that the execution elapsed time is less?
But really, you'll want to find a mentor who can give you advice on how to write a paper that will fit the mold. , then find other people to talk to who can confirm that opinion and help you get your paper submitted :-)
Would you or Neil fancy being a mentor for this, if I can start to get somewhere on it?

multicore box ;-) It's my main show-stopper right now. Any clues on how to get access to one, eg via ssh? 32-core or higher would be favorite ;-) but I guess even just a 4-core or so is enough for proof-of-concept?
I think you'll have plenty of work to be before you get to the stage of needing a box with more than 4 cores. Even a dual core machine is likely to be enough for initial experimentation, I would say.
... or build a minimal vm, enough to get 3 or 4 somewhat interesting algorithms / programs to run, and get automatic threading working on a couple of targets, eg on maps, and whatever [ x | x <- somelist ] these things are called. (folds are a little harder from an implementation point of view, so can be a future upgrade).
The other thing to consider is that there are several other VMs out there, including many under open source licenses, that can be used as a testbed. Examples include the Java VM, Mono, Parrot, LLVM, etc.
Would you or Neil fancy being a mentor for this, if I can start to get somewhere on it?
Not me! I'm not an academic... I've never had a paper published and I'm not likely to either. Regards Neil

Tim Chevalier wrote:
Anyone can submit a paper to a CS journal or conference. While most people who do so are affiliated with universities, research labs, or (more rarely) non-research companies, there are independent researchers out there, and sometimes you'll notice a paper where someone is listed by just their name with no affiliation. Conferences issue calls for papers (you might see some posted on this mailing list) that give you an idea for the rough format of the paper and submission guidelines. But really, you'll want to find a mentor who can give you advice on how to write a paper that will fit the mold. First come up with a technical result that you believe is paper-worthy, then find other people to talk to who can confirm that opinion and help you get your paper submitted :-)
I highly doubt that automatic threading will happen any time this decade - but I just learned something worth while from reading this email. ;-)

On 8/21/07, Andrew Coppin
I highly doubt that automatic threading will happen any time this decade - but I just learned something worth while from reading this email. ;-)
That's an interesting observation. I cant say I dont believe it, but I'm interested to know why (but it could be just a feeling, or an observation in time-to-market lead times?). Are you saying this because multicores arent sufficiently widespread or powerful enough yet (4-cores doesnt really even make up for the overhead of using automatic threading, at least in initial implementations)? or are you saying this because you think the technical implementations are not sufficiently advanced? I kindof think automatic threading is like 3d graphics: as soon as the hardware became sufficiently powerful, 3d graphics became trivial. Enough money was thrown at the problem in a very short time by a few powerful companies that it was a non-issue. Nevertheless, if I could get a paper out of it before the big companies notice, that could be fun :-D

On Wed, Aug 22, 2007 at 04:07:22AM +0100, Hugh Perkins wrote:
On 8/21/07, Andrew Coppin
wrote: I highly doubt that automatic threading will happen any time this decade - but I just learned something worth while from reading this email. ;-)
That's an interesting observation. I cant say I dont believe it, but I'm interested to know why (but it could be just a feeling, or an observation in time-to-market lead times?). Are you saying this because multicores arent sufficiently widespread or powerful enough yet (4-cores doesnt really even make up for the overhead of using automatic threading, at least in initial implementations)? or are you saying this because you think the technical implementations are not sufficiently advanced?
Automatic threading is inherently limited by data dependencies. You can't run a function that branches on an argument in parallel with the computation producing that argument. Even with arbitrarily many processors and no communication overhead there is a limit to how much parallelism you can squeeze from any given program. You should read "Feedback Directed Implicit Parallelism" http://research.microsoft.com/~tharris/papers/2007-fdip.pdf and perhaps "Limits to Implicit Parallelism in Functional Applications" http://www.detreville.org/papers/Limits.pdf In short, with zero overhead and an oracle for scheduling you can get a factor of at most 2x to 32x by implicitly parallelizing existing Haskell code. In practice, with execution overhead it's a gain of perhaps 10% to 80% on a 4-core system. The experiments in the first paper are based on a fairly sophisticated implementation that reduces overhead by using profiling results at compile time to decide which thunks might be worth evaluating in parallel. For a fixed benchmark there's probably not much lost by using canned profiling results instead of adapting at runtime, and in any case the hard bounds from data dependencies still apply. You can do a lot better if you expect people to rewrite code, but "automatic threading" suggests something completely effortless. I think you can get much better results if you work on the programming style in connection with a better runtime. You can think of data parallel Haskell as a new programming style with more implicit parallelims, and the runtime support to exploit it.
I kindof think automatic threading is like 3d graphics: as soon as the hardware became sufficiently powerful, 3d graphics became trivial. Enough money was thrown at the problem in a very short time by a few powerful companies that it was a non-issue.
If you have cores to waste, you might try rewrites like f x => case x of C1 a1 a2 -> f (C1 a1 a2) C2 b -> f (C2 b) C3 -> f C3 and then speculatively execute several of the case branches. If you don't throw away too much type information you should even be able to do it at runtime. Brandon

On 8/22/07, Brandon Michael Moore
Automatic threading is inherently limited by data dependencies. You can't run a function that branches on an argument in parallel with the computation producing that argument. Even with arbitrarily many processors and no communication overhead there is a limit to how much parallelism you can squeeze from any given program.
Yes. Its worse than that in fact, because many real-world problems will involve functions that have side-effects, but we know the side-effects dont matter. The parallelisation algo of course doesnt know they dont matter (unless we tell it). Example: imagine we want to copy files from one machine to five others. Copying a file has a clear side-effect, but since we're copying to 5 independent machines, we can copy to each machine in parallel. The algo doesnt know that this is ok.
You should read "Feedback Directed Implicit Parallelism" http://research.microsoft.com/~tharris/papers/2007-fdip.pdf and perhaps "Limits to Implicit Parallelism in Functional Applications" http://www.detreville.org/papers/Limits.pdf
Ok
In short, with zero overhead and an oracle for scheduling you can get a factor of at most 2x to 32x by implicitly parallelizing existing Haskell code. In practice, with execution overhead it's a gain of perhaps 10% to 80% on a 4-core system.
This is a good argument that it's not enough to prototype on a 4 core system, but we really need some way to simulate a 1024 core system to carry out meaningful benchmarks.
You can do a lot better if you expect people to rewrite code, but "automatic threading" suggests something completely effortless.
Yes, I tend to overuse the word "automatic" ;-)
I think you can get much better results if you work on the programming style in connection with a better runtime. You can think of data parallel Haskell as a new programming style with more implicit parallelims, and the runtime support to exploit it.
Yes, you're right.
If you have cores to waste, you might try rewrites like
f x => case x of C1 a1 a2 -> f (C1 a1 a2) C2 b -> f (C2 b) C3 -> f C3
and then speculatively execute several of the case branches. If you don't throw away too much type information you should even be able to do it at runtime.
That's a good observation. Sortof anti-laziness :-D

On Thu, Aug 23, 2007 at 06:27:43AM +0100, Hugh Perkins wrote:
On 8/22/07, Brandon Michael Moore
wrote: Automatic threading is inherently limited by data dependencies. You can't run a function that branches on an argument in parallel with the computation producing that argument. Even with arbitrarily many processors and no communication overhead there is a limit to how much parallelism you can squeeze from any given program.
Yes. Its worse than that in fact, because many real-world problems will involve functions that have side-effects, but we know the side-effects dont matter. The parallelisation algo of course doesnt know they dont matter (unless we tell it).
Example: imagine we want to copy files from one machine to five others. Copying a file has a clear side-effect, but since we're copying to 5 independent machines, we can copy to each machine in parallel. The algo doesnt know that this is ok.
Actually, the algorithm was right, and the intuitive argument is wrong. Suppose all five computers are actually re-exporting network filesystem hosted on a sixth server, and the file names alias. By overruling the algorithm, you've introduced a corner case bug that will go unnoticed for years.
If you have cores to waste, you might try rewrites like
f x => case x of C1 a1 a2 -> f (C1 a1 a2) C2 b -> f (C2 b) C3 -> f C3
and then speculatively execute several of the case branches. If you don't throw away too much type information you should even be able to do it at runtime.
That's a good observation. Sortof anti-laziness :-D
It's called speculative execution, and microprocessors have been doing it for many years as part of their built-in automatic parallelization circuitry. (If you think emulator-based autoparallelization is going to be easy, consider that it takes a couple million transistors to do a reasonable job - and a chunk of 90's silicon can already do a million things at once, so you're at a huge starting disadvantage. Just my two cents.) Stefan

Couple of thoughts/observations: - Erlang has a vm, so that would avoid building a vm. On the downside, erlang is not pure: the message-passing and the io: commands imply the possibility of side-effects. Still, it could be good enough for a proof-of-concept? - implementation as a library function? Instead of building/modifying a vm to re-implement how map works, perhaps an easy way of doing this would be to create a new map function (dpmap? (=Dynamic parallel map), which chats to a scheduler process, which decides whether to split the map or not. Downside: constant overhead for all maps, even really small ones that would otherwise run really fast, and will ultimately be assigned only a single process anyway. (note: using erlang terminology here, where "process" means essentially "thread"). Of course, even modifying the vm to parallelize maps etc would have a constant overhead, but possibly the overhead could be much smaller if implemented directly in a vm? Possibly an approach could be: - impliment as a new dpmap library function - someone could optimize this in a vm later

On 8/21/07, Andrew Coppin
wrote: I highly doubt that automatic threading will happen any time this decade
Hm. I happen to have an old Sun C compiler on my old UltraSPARC box. cc -V => Sun Workshop 6 update 2 C 5.3 2001/05/15. One of its options is '-xautopar'. I'll let you guess what that does... I note that a company called Tilera claim to be shipping a 64 CPU (some flavour of MIPS, I gather) chip.
participants (15)
-
Andrew Coppin
-
Bayley, Alistair
-
Brandon Michael Moore
-
Brian Hurt
-
Bulat Ziganshin
-
dons@cse.unsw.edu.au
-
Hugh Perkins
-
Jan-Willem Maessen
-
Mitar
-
Neil Bartlett
-
ok
-
Sebastian Sylvan
-
Stefan O'Rear
-
Thomas Conway
-
Tim Chevalier