
Hi!
I am thinking about a model where you would have only n threads on a
n-core (or processor) machine. They would be your worker threads and
you would spawn them only once (at the beginning of the program) and
then just delegate work between them.
On 8/13/07, Jan-Willem Maessen
The problem here is that while Cilk spawns are incredibly cheap, they're still more than a simple procedure call (2-10x as expensive if my fading memory serves me rightly). Let's imagine we have a nice, parallelizable computation that we've expressed using recursive subdivision (the Cilk folks like to use matrix multiplication as an example here). Near the leaves of that computation we still spend the majority of our time paying the overhead of spawning. So we end up actually imposing a depth bound, and writing two versions of our computation---the one that spawns, which we use for coarse-grained computations, and the one that doesn't, which we use when computation gets fine-grained. It makes a really big difference in practice.
But this could be done at the runtime too. If the lazy-non-evaluated-yet chunk is "big" then divide it into a few parts and run each part in its thread. But if the chunk is small (you are at the end of the evaluation and you already evaluated necessary subexpressions) you do it in the thread which encounters this situation (which is probably near the end of the program or the end of the monadic IO action). And this line when you choose to delegate or not can be determined at runtime too. In combination with some transactional memory or some other trick which would be behind this delegation this could be probably possible. We could also hint runtime that the function would probably take a long time to compute (even if it is lazy) with making a type for such functions which would signal this. Of course this could also make things worse if used improperly. But sometimes you know that you will be running the map of time-consuming function. Yes, you have parMap but the problem I saw with it (and please correct me if I am wrong) is that it spawns a new thread for every application of the function to the element? But what if functions are small? Then this is quite an overhead. And you have to know this in advance if you want to use something else than the default parMap which is not always possible (if we are passing a function as an argument to the function which calls map). For example: calculate f xs = foldl (+) 0 $ map f xs -- or foldr, I am not sure And I would like to see something like this: it gets to the point when we need to evaluate this function call, for some "big" f and some "big" list of xs, so the thread which gets to it starts evaluating the first value and when it starts with another (it is recursive call so it is a "similar" evaluation) it sees that the other thread is empty and the function would probably take a long time (it speculates) so it delegates it there and continues with the third element that is a dummy recursive call to the function, in this case foldl (dummy because it did not really evaluate everything at the previous level). Now, if the second thread finishes first, it goes to the next element (recursive call) but sees that it is already (still) evaluating, so it gets to the fourth. Otherwise, it the first thread finishes first just goes to the next element. This would be some kind of speculative evaluating. If the CPUs know how to do that why would not we at the higher level? It would be also an interesting lazy evaluation of the, in this example, foldl function. The system would make a recursive call but would just go another call deeper if it finds that it is impossible (because it is still evaluating somewhere else) to evaluate it. And at every level it finishes it will check previous levels if it can collapse them (and maybe prematurely finish the whole thing). It would be like that I unwind the loop (in this case recursion), evaluate everything (on many threads) and then (try to) calculate the value. If it finishes prematurely ... sad, but for "big" lists and "big" functions it would be a saver. And the size of this window (unwinding) would be a heuristic speculative function of number of possible threads (cores, processors) and of information how long have previous evaluations of the same function taken. I really messed this explanation up. Or maybe it is completely wrong and this is why it looks like a mess to me. :-) Mitar