Having trouble with parallel Haskell

I intended to write a half chapter or so about parallel programming in Haskell for the book I'm working on, but I've been coming to the conclusion that the time is not yet ripe for this. I hope it will be helpful if I share my experiences here. Specifically, I was planning to write about parallel programming in pure code: use of "par" and programming with strategies. The most substantial problem is that the threaded RTS in GHC 6.8.2 is very crashy in the face of "par": about 90% of my runs fail with a segfault or an assertion failure. Simon Marlow mentioned that this bug is fixed, but I've been unsuccessful in building a GHC 6.8.3 release candidate snapshot so far, so I can't verify this. When a run does go through, I have found it surprisingly difficult to actually get both cores of a dual-core system to show activity. As there are no tools I can use to see what is happening, I am stumped. It is of course quite likely that I am doing something wrong that results in unexpected dependencies, but I cannot find a means to gaining insight into the problem. As an example, I have a simple parallel quicksort that is very conservative in its sparking, and I get no joy out of it on a dual-core machine: it mostly sticks to a single core. This wouldn't be a problem if I had some way to spot the presumed data dependency that is serialising the code, but no joy.

Bryan O'Sullivan wrote:
| otherwise = rnf lesser `par` (rnf greater `pseq` lesser ++ x:greater)
It seems that `pseq` doesn't have a defined precedence, so it's infixl 9, unlike `seq` and `par` which are infixr 0. Therefore the above is equivalent to | otherwise = rnf lesser `par` ((rnf greater `pseq` lesser) ++ x:greater) (no idea if that's relevant, though) -Isaac

Isaac Dupree wrote:
(no idea if that's relevant, though)
I can't tell either. It doesn't seem to make a difference, at any rate. By the way, I notice that the threaded RTS blocks signals for long periods of time when I run this program. If I hit control-C, it takes several seconds for the program to notice.

Bryan O'Sullivan wrote:
Isaac Dupree wrote:
(no idea if that's relevant, though)
I can't tell either. It doesn't seem to make a difference, at any rate.
By the way, I notice that the threaded RTS blocks signals for long periods of time when I run this program. If I hit control-C, it takes several seconds for the program to notice.
This is usually a symptom of a program that does a lot of allocation-less computation (or also, long GCs). When the program isn't allocating, the scheduler never runs, and ^C doesn't get serviced. Allocation-less computation will also hurt parallelism, because the load-balancing only happens when the scheduler runs. This often crops up when people try to parallelism small benchmarks (e.g. fib). It's certainly a problem, and we don't have a good solution yet. Cheers, Simon

I do not have a platform to try it on; the following is pure speculation. In main, the builder of input is highly lazy, since randoms is. To be sure, the spine of the list is presently forced by printing its length, but the numbers inside the list... It is very bleak because randoms ensures that input!!(n+1) is patently dependent on input!!n, ad infinitum; this should defy all attempts at parallelism... Forcing the numbers themselves before sorting will give a much more conclusive result, success or failure.

trebla:
I do not have a platform to try it on; the following is pure speculation.
In main, the builder of input is highly lazy, since randoms is. To be sure, the spine of the list is presently forced by printing its length, but the numbers inside the list... It is very bleak because randoms ensures that input!!(n+1) is patently dependent on input!!n, ad infinitum; this should defy all attempts at parallelism...
Forcing the numbers themselves before sorting will give a much more conclusive result, success or failure.
I wonder if a fast, strict randoms generator (like mersenne-pure64) would help then. -- Don

Albert Y. C. Lai wrote:
In main, the builder of input is highly lazy, since randoms is. To be sure, the spine of the list is presently forced by printing its length, but the numbers inside the list... It is very bleak because randoms ensures that input!!(n+1) is patently dependent on input!!n, ad infinitum; this should defy all attempts at parallelism...
Hmm. I forced the elements of the list with rnf, and it didn't really make a difference. I now get perhaps 115% CPU usage during a sort. That is a small improvement in parallelism, but it also causes the program to take about 15% longer to run :-)

Bryan O'Sullivan wrote:
I intended to write a half chapter or so about parallel programming in Haskell for the book I'm working on, but I've been coming to the conclusion that the time is not yet ripe for this. I hope it will be helpful if I share my experiences here.
Specifically, I was planning to write about parallel programming in pure code: use of "par" and programming with strategies.
The most substantial problem is that the threaded RTS in GHC 6.8.2 is very crashy in the face of "par": about 90% of my runs fail with a segfault or an assertion failure. Simon Marlow mentioned that this bug is fixed, but I've been unsuccessful in building a GHC 6.8.3 release candidate snapshot so far, so I can't verify this.
When a run does go through, I have found it surprisingly difficult to actually get both cores of a dual-core system to show activity. As there are no tools I can use to see what is happening, I am stumped. It is of course quite likely that I am doing something wrong that results in unexpected dependencies, but I cannot find a means to gaining insight into the problem.
As an example, I have a simple parallel quicksort that is very conservative in its sparking, and I get no joy out of it on a dual-core machine: it mostly sticks to a single core. This wouldn't be a problem if I had some way to spot the presumed data dependency that is serialising the code, but no joy.
I don't have a great deal of time to look at this right now, but I had a bit more luck with this version: parSort :: (NFData a, Ord a) => Int -> [a] -> [a] parSort d list@(x:xs) | d <= 0 = sort list | otherwise = rnf below `pseq` ( rnf above `pseq` ( rnf lesser `par` ( rnf greater `pseq` ( (lesser ++ x:greater))))) where below = [y | y <- xs, y < x] above = [y | y <- xs, y >= x] lesser = parSort d' below greater = parSort d' above d' = d - 1 parSort _ _ = [] I imagine the problem is mainly to do with nested pars: in the original code, the left-hand par evaluates immediately into another par, and so on: there's no actual *work* being done in each spark. Which is why I tried to do some evaluation before each par above. It still doesn't achieve much speedup though. Incedentally, this kind of recursive parallelism is typically much harder to get any benefit from than, for example, a simple top-level parMap. If you're just trying to introduce parallelism, it might be worthwhile sticking to the easy cases: simple strategies, and speeding up concurrent programs. Also you might notice that the GC is doing a lot of work in this example (first port of call is always +RTS -sstderr). I tried the parallel GC, but it didn't help at all, I guess because the heap is full of lists and the parallel GC likes trees. The biggest problem, as you note, is the lack of tools for investigating parallel performance. This is high on our priority list, we have a couple of projects planned over the summer to improve the situation. Cheers, Simon

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Just a "metoo": a nicely working "Parallel Haskell" would be a very strong selling point, and I'm especially thinking of education here. If I could show to my students that just by inserting a few 'par' (or similar) annotations in the source they can easily and fully exploit multi-core hardware, then they sure would take that course on Declarative Programming... Best regards, Johannes Waldmann. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.9 (GNU/Linux) Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org iEYEARECAAYFAkhHuDIACgkQDqiTJ5Q4dm+mzACgxO7eYQf5ejogjJGv9Qp7oAlL eykAoLRa81xVju2rx5Q0bCu8L95pQiMD =cWEA -----END PGP SIGNATURE-----

| > The most substantial problem is that the threaded RTS in GHC 6.8.2 is | > very crashy in the face of "par": about 90% of my runs fail with a | > segfault or an assertion failure. Simon Marlow mentioned that this bug | > is fixed, but I've been unsuccessful in building a GHC 6.8.3 release | > candidate snapshot so far, so I can't verify this. There's a 6.8.3 release candidate you might be able to try. GHC shouldn't crash, ever. So everyone, do please file bug reports with test cases for parallel programs that crash. (But probably not for 6.8.2!) Simon

Bryan O'Sullivan wrote:
Simon Peyton-Jones wrote:
There's a 6.8.3 release candidate you might be able to try.
Yes, I reported a bug and Simon mentioned that it ought to be fixed already. I've been trying to build snapshots for about a week :-)
And we appreciate the bug reports. Is it just the Haddock 2 problem you're having now? Cheers, Simon

Simon Marlow wrote:
Is it just the Haddock 2 problem you're having now?
I switched back to Haddock 0.8, and found a new problem at installation time: http://hackage.haskell.org/trac/ghc/ticket/2343 This is a change in behaviour from 6.8.2, and I'm not really sure what to do about it.
participants (7)
-
Albert Y. C. Lai
-
Bryan O'Sullivan
-
Don Stewart
-
Isaac Dupree
-
Johannes Waldmann
-
Simon Marlow
-
Simon Peyton-Jones