transparent parallelization

Hi there. Beeing rather new to the realm of Haskell and functional programming, I've been reading about "how is easier it is to parallelize code in a purely functional language" (am I right saying that ?). My knowledge of parallelization is also very weak, but I've been thinking about this and I have a question. Let's say we take a small piece of code, like the quicksort algorithm.
qsort [] = [] qsort (x:xs) = qsort lesser ++ [x] ++ qsort greater where lesser = filter (
=x) xs
(actually I didn't test this code, this is just for the example) Here, computing "lesser" and "greater" are two disjoint actions - we don't need the result from one to compute the other, and haskell does not allow one to alter data so that would change the behaviour of the other. So I guess those could be run in parallel. Would it be fairly simple to automatically determine parts of code that can be run in parallel, and parts that cannot (at least in non-monadic code) ? So the final question is : if it is possible to automatically define segments of code that can be run in parallel, is [insert compiler name here] compiling this code as a one thread thing, or as a multithreaded version ? I guess on single processor machines, it is more efficient to do it as a single thread - but with those "many-cores" processors that are coming, it probably won't be true anymore. Sorry if this post is blowing off open doors (I guess this doesn't mean anything in english but it sounds cool) ... Regards Thomas Girod

On Tue, 2007-09-18 at 18:13 +0200, Thomas Girod wrote:
Hi there. Beeing rather new to the realm of Haskell and functional programming, I've been reading about "how is easier it is to parallelize code in a purely functional language" (am I right saying that ?).
My knowledge of parallelization is also very weak, but I've been thinking about this and I have a question. Let's say we take a small piece of code, like the quicksort algorithm.
qsort [] = [] qsort (x:xs) = qsort lesser ++ [x] ++ qsort greater where lesser = filter (
=x) xs (actually I didn't test this code, this is just for the example)
Here, computing "lesser" and "greater" are two disjoint actions - we don't need the result from one to compute the other, and haskell does not allow one to alter data so that would change the behaviour of the other. So I guess those could be run in parallel.
Would it be fairly simple to automatically determine parts of code that can be run in parallel, and parts that cannot (at least in non-monadic code) ?
So the final question is : if it is possible to automatically define segments of code that can be run in parallel, is [insert compiler name here] compiling this code as a one thread thing, or as a multithreaded version ?
I guess on single processor machines, it is more efficient to do it as a single thread - but with those "many-cores" processors that are coming, it probably won't be true anymore.
Sorry if this post is blowing off open doors (I guess this doesn't mean anything in english but it sounds cool) ...
Detecting parallelism is possible, but generally rather fragile. Coarse-grained parallelism in form of threads (or processes) is only efficient if enough data can be processed in parallel. This in turn is determined by the data-dependencies, which are hard to detect automatically. To preserve program semantics the analysis has to be conservative, i.e., assume that two parts of a program depend on each other unless it can prove otherwise. (OpenMP relies on the programmer to explicitly declare what to parallelize.) A better way is to let the user specify the algorithm at a higher level. One very promising technique to do this is explored in Data Parallel Haskell (DPH) [1]. In DPH you specify your algorithm as functional programs operating on vectors, and even allows nested parallelism, i.e., you can call parallel functions inside parallel functions. If implemented naïvely, this could easily lead to inefficiencies due to too little workload per thread. This is where GHCs rewriting capabilities kick in and transform nested parallel programs into flat parallel programs. I really recommend reading the paper(s) (see [1]). / Thomas [1] .. http://haskell.org/haskellwiki/GHC/Data_Parallel_Haskell

Wow this is cool stuff! It would be nice to have something like this for the Playstation 3 :-) Regarding parallelism, I wander how this extension will compare to Sun's Fortress language, if/when it gets finally released. Peter

On Sep 18, 2007, at 4:24 PM,
Wow this is cool stuff! It would be nice to have something like this for the Playstation 3 :-)
Regarding parallelism, I wander how this extension will compare to Sun's Fortress language, if/when it gets finally released.
The scope of the two is very different. DPH proposes a single rather flexible data structure---nested Data Parallel Arrays (really as much list-like as array-like). The underlying data structure is manipulated using bulk operations like zip, sum, and comprehensions. By contrast, Fortress defines the notion of a "Generator" which you can think of as being akin to a parallel version of Data.Traversable or ListLike, where the fundamental primitive is a generalization of foldP and mapP. This is strictly more general---we can define many of the operations in Data.PArr on arbitrary data types, permitting us to talk about the sum of the elements of a set, or mapping a function across a distributed array. We can define nested data parallel arrays in Fortress. There isn't (yet) an equivalent of the RULES pragma that permits Fortress to optimize combinations of function calls. However, clever uses of type information and function overloading let Fortress do some interesting program transformations of its own (eg early exit for reductions with left zeros). Finally, Fortress actually has a mechanism for defining new types of comprehensions (though this isn't in the language specification yet). The other nice thing about Generators is that we can support consumption of large or infinite things, if we're very careful about how we do the consumption. We're planning to write the equivalent of hGetContents that works over blocks of file data in parallel where possible, but processes streams as chunks of data become available. It remains to be seen how this will work out in practice, though. Our goal is something LazyByteString or rope-like. So: DPH: available today (-ish), one (very flexible) data structure. Bulk operations on a data structure run in parallel. Relies on RULES + special compiler support (am I getting that right? You can fuse multiple traversals of a common argument, which isn't doable using RULES, right?) to make it all run nicely. A well-established performance model, cribbed from NESL, for the PArr bits. Fortress: Arrays and lists currently built in. Bulk operations on a data structure can run in parallel. Ability to define new parallel types with carefully-tailored traversal (eg we have a PureList that's based on Ralf Hinze and Ross Paterson's finger tree where traversal walks the tree structure in parallel). No optimization RULES yet (an interpreter doesn't optimize), but a good deal of type-based code selection. Implementation less complete than DPH in general (even the Generator API is in flux, though the fundamental use of a foldP- like operation hasn't changed over time). -Jan-Willem Maessen Longtime Haskell Hacker Fortress Library Developer PS - By the way, working on Generators has increased my suspicion that comprehensions do NOT have a naturally monadic structure (which bugged me when I worked on parallel list traversal optimization in pH back in 1994). It just happens that for cons-lists the two structures happen to coincide. If anyone else has had similarly subversive thoughts I'd love to chat.

If I recall correctly a rather neat way of exploiting this property of
qsort is exploited with Nested Data Parallelism and covered in this
talk:
http://www.londonhug.net/2007/05/25/video-of-spjs-talk-is-now-online/
Good food for thought :)
Dave,
On 18/09/2007, Thomas Girod
Hi there. Beeing rather new to the realm of Haskell and functional programming, I've been reading about "how is easier it is to parallelize code in a purely functional language" (am I right saying that ?).
My knowledge of parallelization is also very weak, but I've been thinking about this and I have a question. Let's say we take a small piece of code, like the quicksort algorithm.
qsort [] = [] qsort (x:xs) = qsort lesser ++ [x] ++ qsort greater where lesser = filter (
=x) xs (actually I didn't test this code, this is just for the example)
Here, computing "lesser" and "greater" are two disjoint actions - we don't need the result from one to compute the other, and haskell does not allow one to alter data so that would change the behaviour of the other. So I guess those could be run in parallel.
Would it be fairly simple to automatically determine parts of code that can be run in parallel, and parts that cannot (at least in non-monadic code) ?
So the final question is : if it is possible to automatically define segments of code that can be run in parallel, is [insert compiler name here] compiling this code as a one thread thing, or as a multithreaded version ?
I guess on single processor machines, it is more efficient to do it as a single thread - but with those "many-cores" processors that are coming, it probably won't be true anymore.
Sorry if this post is blowing off open doors (I guess this doesn't mean anything in english but it sounds cool) ...
Regards
Thomas Girod
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Tue, 2007-09-18 at 18:13 +0200, Thomas Girod wrote:
Hi there. Beeing rather new to the realm of Haskell and functional programming, I've been reading about "how is easier it is to parallelize code in a purely functional language" (am I right saying that ?).
My knowledge of parallelization is also very weak, but I've been thinking about this and I have a question. Let's say we take a small piece of code, like the quicksort algorithm.
qsort [] = [] qsort (x:xs) = qsort lesser ++ [x] ++ qsort greater where lesser = filter (
=x) xs (actually I didn't test this code, this is just for the example)
Here, computing "lesser" and "greater" are two disjoint actions - we don't need the result from one to compute the other, and haskell does not allow one to alter data so that would change the behaviour of the other. So I guess those could be run in parallel.
Would it be fairly simple to automatically determine parts of code that can be run in parallel, and parts that cannot (at least in non-monadic code) ?
The problem with Haskell is not finding opportunities to parallelize, they are legion. Actually, quite the opposite, there's so much that your code ends up slower than a sequential realization. The hard part is making a good cost-model and a good way to create coarser chunks of work. It's not worthwhile to spawn a thread (even a very lightweight one) for virtually every subexpression. Automatic parallelization is easy, efficient parallelization is hard.

The problem with Haskell is not finding opportunities to parallelize, they are legion. Actually, quite the opposite, there's so much that your code ends up slower than a sequential realization. The hard part is making a good cost-model and a good way to create coarser chunks of work. It's not worthwhile to spawn a thread (even a very lightweight one) for virtually every subexpression.
Automatic parallelization is easy, efficient parallelization is hard.
Absolutely. We had another crack at this recently in an upcoming ICFP paper (draft online at http://research.microsoft.com/~tharris/papers/2007-fdip.pdf). In that paper we start out by collecting profiles of the thunk creation, entry, and update operations for a set of benchmarks and conduct a limit study of how fast these dependencies would allow them to be evaluated (e.g. ignoring interactions through the GC, cache effects, the costs of sparking thunks etc.). We then make the system increasingly more practical, (i) setting a lower bound on the compute-time of thunks that we spark, (ii) making predictions of a thunk's compute-time when it's allocated, and then (iii) building a real implementation based on GHC 6.6. The parallel speedup dwindles at each stage: this kind of automated approach looks plausible for using "2n" cores instead of using "n", but I'm sceptical about it being able to efficiently exploit "n" cores instead of "1". Tim
participants (7)
-
bf3@telenet.be
-
Dave Tapley
-
Derek Elkins
-
Jan-Willem Maessen
-
Thomas Girod
-
Thomas Schilling
-
Tim Harris (RESEARCH)