
On Sat, 11 Aug 2007, Sebastian Sylvan wrote:
How is this any better than using "par" in Haskell?
Mainly how the threads are actually scheduled. Mind you, I'm an *incredible* Haskell newbie, so take all of my comments with a 5-pound salt block, but as I understand how the current implementation of parallel Haskell works, all threads spawned go into a communal heap. When a thread blocks, it's put back into the communal queue, and a new thread is selected to run by the worker thread. In Cilk, the task structure is more tree-like. When thread A spawns thread B, the worker thread stops executing thread A and starts executing thread B. When thread B exits, the worker thread then resumes thread A. So in any given point in time, the worker thread has a stack of processes waiting for subthreads to complete so they can be resumed- not unlike function calls in other languages, or nested lazy blocks in Haskell. When a worker thread runs out of work to do, when it has no more threads to execute, it picks another worker thread at random, and asks the other worker thread for work to do. The other worker thread (assuming it has work to do) then picks the microthread at the bottom of the resume stack, that is farthest away from being resumed, and passes that microthread's state to the original worker thread.
From the user program's perspective, this is no different from the current implementation. Threads get spawned, get executed in some unspecified order, and then complete.
What's interesting (to me, at least) are the optimization opportunities this model provides that the shared-queue model doesn't. Consider the following structural model: we have a two-tiered heap. Each worker thread has it's own local heap, which only microthreads it is managing can see. Plus there is a global heap with objects that all microthreads in all worker threads can see. Objects are originally allocated in the local heap. When a microthread to start being executed on another worker thread, all objects it can see (reach, in the conservative GC sense) are promoted to the global heap. The advantage of all of this is that now we can easily distinguish between objects that can be seen from other real threads, and those that can't. All of the mutable data structures- tvars, lazy thunks, everything- can be much more cheaply implemented if you know that no other thread can access them. Take the example of a parallel map, where I'm using a tvar in my map function. The likelyhood is that all of the (short-lived) microthreads I'm spawning will execute on the same worker thread- and that thus the tvar will never leave the local heap, and thus can be optimized down to something close to a single-threaded mvar. I have, via optimization, turned a parallel map and a synchronized tvar back into something approaching a serial map and an mvar. Brian