Re: [Haskell-cafe] Dynamic thread management?

13 Aug 2007

      On Aug 11, 2007, at 12:35 PM, Brian Hurt wrote:
...
You guys might also want to take a look at the Cilk programming  
language, and how it managed threads.  If you know C, learning Cilk  
is about 2 hours of work, as it's C with half a dozen extra  
keywords and a few new concepts.  I'd love to see Cilk - C +  
Haskell as a programming language.
It was called pH, and we (meaning Alejandro Caro and myself)  
implemented it back in the mid/late 90's using Lennart Augustsson's  
hbcc front end (which he hacked to add a bunch of pH-specific  
syntax).  Arvind and Nikhil wrote a textbook on pH programming.

There are two problems, still: one is that laziness means you can't  
actually prove you need something until very close to the time you  
actually want it.  By the time I know that I'm adding f x to g y,  
it's probably too late to usefully run them in parallel (unless  
they're both *very* large).  We used eager evaluation in pH---to the  
point that we actually gave up the ability to manipulate infinite  
lazy data structures.  In NDP they've done much the same thing, first  
instrumenting the program to see that the eagerness they introduce  
won't disrupt execution.  Even the "par" annotation has this feel: we  
are telling the implementation that it's OK to do some computation  
even if it isn't yet obvious that we'll need the results.

The second problem is controlling the overhead.  More on this below.
...
The key idea of Cilk is that it's easier to deparallelize than it  
is to parallelize, especially automatically.  So the idea is that  
the program is written incredibly parallel, with huge numbers of  
microthreads, which are (on average) very cheap to spawn.  The  
runtime then deparallelizes the microthreads, running multiple  
microthreads sequentially within a single real thread (a worker  
thread).  Microthreads that live their entire life within a single  
real thread are cheap to spawn (as in "not much more expensive than  
a normal function call" cheap).
The problem here is that while Cilk spawns are incredibly cheap,  
they're still more than a simple procedure call (2-10x as expensive  
if my fading memory serves me rightly).  Let's imagine we have a  
nice, parallelizable computation that we've expressed using recursive  
subdivision (the Cilk folks like to use matrix multiplication as an  
example here).  Near the leaves of that computation we still spend  
the majority of our time paying the overhead of spawning.  So we end  
up actually imposing a depth bound, and writing two versions of our  
computation---the one that spawns, which we use for coarse-grained  
computations, and the one that doesn't, which we use when computation  
gets fine-grained.  It makes a really big difference in practice.

The programmer is free to use this trick in any programming  
language.  But we haven't yet figured out a way to *avoid* the need  
to do so.  This continues to worry me to this day, because making the  
right choices is black magic and specific to a particular combination  
of algorithm and machine.

That said, there is some reason for optimism: the overhead of  
creating work in Cilk is comparable to the overhead of creating a  
thunk in Haskell.
...
The problem that Cilk runs into is that it's, well, C.  It doesn't  
deal with contention at well at all- a single microthread blocking  
blocks the whole worker thread- meaning, among other things, that  
you can have "false deadlocks", where one microthread blocks on  
another microthread in the same real thread, and thus acts like  
it's deadlocked even though it really isn't.
This is actually a fundamental problem with the threading model:  
there is no guarantee of fairness using work stealing, so if you do  
something that requires fair scheduling you get into a lot of trouble  
fast.  It's not fair to blame C for this.  You have to be very  
careful to define the interaction between fair IO-style threads and  
unfair get-my-work-done threads.
...
You have greatly increased the likelyhood of raceconditions as well  
(mutable data and multithreading just don't mix).  Plus you have  
all the normal fun you have with C bugs- wild pointers, buffer over  
runs, etc.
This, however, *is* C's fault. :-)

More on pH: we got our programs to scale, but had troubles going past  
8 threads.  We found ourselves limited by a non-parallel GC (my  
fault, but labor-intensive to get right) and the absence of  
parallelism in the underlying algorithms.  For the latter problem  
there simply is no magic bullet.

-Jan-Willem Maessen