
On Aug 13, 2007, at 2:53 PM, Mitar wrote:
Hi!
I am thinking about a model where you would have only n threads on a n-core (or processor) machine. They would be your worker threads and you would spawn them only once (at the beginning of the program) and then just delegate work between them.
On 8/13/07, Jan-Willem Maessen
wrote: The problem here is that while Cilk spawns are incredibly cheap, they're still more than a simple procedure call (2-10x as expensive if my fading memory serves me rightly). Let's imagine we have a nice, parallelizable computation that we've expressed using recursive subdivision (the Cilk folks like to use matrix multiplication as an example here). Near the leaves of that computation we still spend the majority of our time paying the overhead of spawning. So we end up actually imposing a depth bound, and writing two versions of our computation---the one that spawns, which we use for coarse-grained computations, and the one that doesn't, which we use when computation gets fine-grained. It makes a really big difference in practice.
But this could be done at the runtime too. If the lazy-non-evaluated-yet chunk is "big" then divide it into a few parts and run each part in its thread. But if the chunk is small (you are at the end of the evaluation and you already evaluated necessary subexpressions) you do it in the thread which encounters this situation (which is probably near the end of the program or the end of the monadic IO action).
I didn't make my point very well. The hard part is determining exactly when a chunk is "big" or "small" without actually computing its value. Consider recursive fib (which I use only because it's easy to understand, and has been used as an example of this problem by the Cilk implementors): fib n = if n <= 1 then n else fib (n-1) + fib (n-2) Imagine we're computing (fib 30). We can divide and conquer; plenty of parallelism there! But do most calls to fib represent enough work to justify running them in parallel? No, because most of the calls are to (fib 0) or (fib 1)! We should only pay the spawn cost up to a certain bound --- say n >= 5 --- and then run serially for smaller n. This has a dramatic effect on how fast fib runs, but of course the best possible choice of bound is going to be machine-dependent. We can instrument our program and have some chance of doing OK for fib :: Int -> Int; it's not at all obvious what to do for: myFunction :: [Int] -> (Int -> Bool) -> [Frob] In effect, I end up needing to write myFunction with a built-in bound on computation, and I need to do it in such a way that the underlying systems knows that one branch should be serial and the other branch parallel. This is the problem I was attempting to allude to above. Yes, we can decide on a function-by-function or callsite-by-callsite basis that "enough work" is being done to justify parallelism. But the answer to this question is often "maybe", or even "no" (as in fib).
Yes, you have parMap but the problem I saw with it (and please correct me if I am wrong) is that it spawns a new thread for every application of the function to the element?
But what if functions are small? Then this is quite an overhead. And you have to know this in advance if you want to use something else than the default parMap which is not always possible (if we are passing a function as an argument to the function which calls map).
For example:
calculate f xs = foldl (+) 0 $ map f xs -- or foldr, I am not sure
You seem to be arguing that we can pick the right implementation of "map" and "fold" (serial or parallel) here if we only knew that xs was "big enough" and f "expensive enough". I agree. But that begs the question: let's assume "calculate" is a function that's called from numerous places, with a mixture of "big" and "small" arguments. Now I need two versions of calculate, and I need to decide at each call site whether to call "big calculate" or "small calculate". We also need to make sure any eagerness we introduce is semantically sound, too, but I think we've got a pretty good handle on that part in practice between my work on resource- bounded evaluation in Eager Haskell, Rob Ennals' work on eager evaluation in GHC, and the Singh & Harris paper. That isn't to say that any of this is impossible---but it's going to take a while to figure out how to get it right [it'd be a great Ph.D. project to even knock off the low-hanging fruit like fib and recursive matrix multiply]. Meanwhile, we also need to work hard educating programmers how to write code that'll run in parallel in the first place, and giving them the libraries that'll make it easy. -Jan-Willem Maessen