Re: [Haskell-beginners] Performance of parallel mergesort

24 Dec 2009


      On Thursday 24 December 2009 09:09:41 you wrote:
...
The mergesort example given here has a rather nasty problem – it creates
sparks for mergeing the two lists at the same time as it creates the lists,
so there's always at least one task blocking waiting for it's producers.
I see. So the sparks either contend for locks on the thunks to force that list 
or they repeat the same work.
...
This variation on a theme goes roughly twice as fast for me with -N1, and
*does* get a speedup from -N<somethinghigher>:
mergesort [] = []
mergesort [x] = [x]
mergesort [x,y] = if x < y then [x,y] else [y,x]
mergesort xs = (forceList sleft) `par`
               (forceList sright) `pseq`
               merge sleft sright
               where
                 (left,right) = split (length xs `div` 2) xs
                 sleft = mergesort left
                 sright = mergesort right
Note the `pseq` not `par` before the merge.
Can you explain why that helps?
...
I also added one more short-circuit case – we want to avoid spawning threads
for tiny tasks like sorting 1 element lists.
This is something that concerns me. Lots of discussions of parallelism, 
including the Haskell literature I have read, neglect this critical problem 
of making sure that more time is spent doing useful work than spawning tasks 
(sparks). How is this solved in Haskell? Do you just put magic numbers in 
that work on the machine you're currently using?

Also, how much work does a Haskell thread usually have to do to make it worth 
sparking?
...
Here's my results on a dual core running OS X:
LappyBob:~ tatd2$ time ./test +RTS -N1
-0.117109518058233
real 0m4.608s
user 0m4.400s
sys 0m0.189s
LappyBob:~ tatd2$ time ./test +RTS -N2
-0.117109518058233
real 0m3.648s
user 0m6.360s
sys 0m0.220s
LappyBob:~ tatd2$ time ./test +RTS -N3
-0.117109518058233
real 0m50.679s
user 1m24.235s
sys 0m0.620s
The last result is still off the wall, but it's a lot better.
That's another question: why did you get such awful performance for N=3 and I 
get similar results for N=8 on this 8 core?

-- 
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?e