
On Aug 11, 2007, at 12:35 PM, Brian Hurt wrote:
You guys might also want to take a look at the Cilk programming language, and how it managed threads. If you know C, learning Cilk is about 2 hours of work, as it's C with half a dozen extra keywords and a few new concepts. I'd love to see Cilk - C + Haskell as a programming language.
It was called pH, and we (meaning Alejandro Caro and myself) implemented it back in the mid/late 90's using Lennart Augustsson's hbcc front end (which he hacked to add a bunch of pH-specific syntax). Arvind and Nikhil wrote a textbook on pH programming. There are two problems, still: one is that laziness means you can't actually prove you need something until very close to the time you actually want it. By the time I know that I'm adding f x to g y, it's probably too late to usefully run them in parallel (unless they're both *very* large). We used eager evaluation in pH---to the point that we actually gave up the ability to manipulate infinite lazy data structures. In NDP they've done much the same thing, first instrumenting the program to see that the eagerness they introduce won't disrupt execution. Even the "par" annotation has this feel: we are telling the implementation that it's OK to do some computation even if it isn't yet obvious that we'll need the results. The second problem is controlling the overhead. More on this below.
The key idea of Cilk is that it's easier to deparallelize than it is to parallelize, especially automatically. So the idea is that the program is written incredibly parallel, with huge numbers of microthreads, which are (on average) very cheap to spawn. The runtime then deparallelizes the microthreads, running multiple microthreads sequentially within a single real thread (a worker thread). Microthreads that live their entire life within a single real thread are cheap to spawn (as in "not much more expensive than a normal function call" cheap).
The problem here is that while Cilk spawns are incredibly cheap, they're still more than a simple procedure call (2-10x as expensive if my fading memory serves me rightly). Let's imagine we have a nice, parallelizable computation that we've expressed using recursive subdivision (the Cilk folks like to use matrix multiplication as an example here). Near the leaves of that computation we still spend the majority of our time paying the overhead of spawning. So we end up actually imposing a depth bound, and writing two versions of our computation---the one that spawns, which we use for coarse-grained computations, and the one that doesn't, which we use when computation gets fine-grained. It makes a really big difference in practice. The programmer is free to use this trick in any programming language. But we haven't yet figured out a way to *avoid* the need to do so. This continues to worry me to this day, because making the right choices is black magic and specific to a particular combination of algorithm and machine. That said, there is some reason for optimism: the overhead of creating work in Cilk is comparable to the overhead of creating a thunk in Haskell.
The problem that Cilk runs into is that it's, well, C. It doesn't deal with contention at well at all- a single microthread blocking blocks the whole worker thread- meaning, among other things, that you can have "false deadlocks", where one microthread blocks on another microthread in the same real thread, and thus acts like it's deadlocked even though it really isn't.
This is actually a fundamental problem with the threading model: there is no guarantee of fairness using work stealing, so if you do something that requires fair scheduling you get into a lot of trouble fast. It's not fair to blame C for this. You have to be very careful to define the interaction between fair IO-style threads and unfair get-my-work-done threads.
You have greatly increased the likelyhood of raceconditions as well (mutable data and multithreading just don't mix). Plus you have all the normal fun you have with C bugs- wild pointers, buffer over runs, etc.
This, however, *is* C's fault. :-) More on pH: we got our programs to scale, but had troubles going past 8 threads. We found ourselves limited by a non-parallel GC (my fault, but labor-intensive to get right) and the absence of parallelism in the underlying algorithms. For the latter problem there simply is no magic bullet. -Jan-Willem Maessen