Re: [Haskell-cafe] Dynamic thread management?

13 Aug 2007

      On Aug 13, 2007, at 2:53 PM, Mitar wrote:
...
Hi!
I am thinking about a model where you would have only n threads on a
n-core (or processor) machine. They would be your worker threads and
you would spawn them only once (at the beginning of the program) and
then just delegate work between them.
On 8/13/07, Jan-Willem Maessen  wrote:
...
The problem here is that while Cilk spawns are incredibly cheap,
they're still more than a simple procedure call (2-10x as expensive
if my fading memory serves me rightly).  Let's imagine we have a
nice, parallelizable computation that we've expressed using recursive
subdivision (the Cilk folks like to use matrix multiplication as an
example here).  Near the leaves of that computation we still spend
the majority of our time paying the overhead of spawning.  So we end
up actually imposing a depth bound, and writing two versions of our
computation---the one that spawns, which we use for coarse-grained
computations, and the one that doesn't, which we use when computation
gets fine-grained.  It makes a really big difference in practice.
But this could be done at the runtime too. If the
lazy-non-evaluated-yet chunk is "big" then divide it into a few parts
and run each part in its thread. But if the chunk is small (you are at
the end of the evaluation and you already evaluated necessary
subexpressions) you do it in the thread which encounters this
situation (which is probably near the end of the program or the end of
the monadic IO action).
I didn't make my point very well.  The hard part is determining  
exactly when a chunk is "big" or "small" without actually computing  
its value.  Consider recursive fib (which I use only because it's  
easy to understand, and has been used as an example of this problem  
by the Cilk implementors):

fib n = if n <= 1 then n else fib (n-1) + fib (n-2)

Imagine we're computing (fib 30).  We can divide and conquer; plenty  
of parallelism there!  But do most calls to fib represent enough work  
to justify running them in parallel?  No, because most of the calls  
are to (fib 0) or (fib 1)!  We should only pay the spawn cost up to a  
certain bound --- say n >= 5 --- and then run serially for smaller  
n.  This has a dramatic effect on how fast fib runs, but of course  
the best possible choice of bound is going to be machine-dependent.

We can instrument our program and have some chance of doing OK for  
fib :: Int -> Int; it's not at all obvious what to do for:
    myFunction :: [Int] -> (Int -> Bool) -> [Frob]

In effect, I end up needing to write myFunction with a built-in bound  
on computation, and I need to do it in such a way that the underlying  
systems knows that one branch should be serial and the other branch  
parallel.  This is the problem I was attempting to allude to above.   
Yes, we can decide on a function-by-function or callsite-by-callsite  
basis that "enough work" is being done to justify parallelism.  But  
the answer to this question is often "maybe", or even "no" (as in fib).
...
Yes, you have parMap but the problem I saw with it (and please correct
me if I am wrong) is that it spawns a new thread for every application
of the function to the element?
...
But what if functions are small? Then
this is quite an overhead. And you have to know this in advance if you
want to use something else than the default parMap which is not always
possible (if we are passing a function as an argument to the function
which calls map).
For example:
calculate f xs = foldl (+) 0 $ map f xs -- or foldr, I am not sure
You seem to be arguing that we can pick the right implementation of  
"map" and "fold" (serial or parallel) here if we only knew that xs  
was "big enough" and f "expensive enough".

I agree.  But that begs the question: let's assume "calculate" is a  
function that's called from numerous places, with a mixture of "big"  
and "small" arguments.  Now I need two versions of calculate, and I  
need to decide at each call site whether to call "big calculate" or  
"small calculate".  We also need to make sure any eagerness we  
introduce is semantically sound, too, but I think we've got a pretty  
good handle on that part in practice between my work on resource- 
bounded evaluation in Eager Haskell, Rob Ennals' work on eager  
evaluation in GHC, and the Singh & Harris paper.

That isn't to say that any of this is impossible---but it's going to  
take a while to figure out how to get it right [it'd be a great Ph.D.  
project to even knock off the low-hanging fruit like fib and  
recursive matrix multiply].

Meanwhile, we also need to work hard educating programmers how to  
write code that'll run in parallel in the first place, and giving  
them the libraries that'll make it easy.

-Jan-Willem Maessen