Performance of parallel mergesort

Jon Harrop

23 Dec 2009 23 Dec '09

9:14 p.m.

I took the parallel merge sort described here: http://martin.sulzmann.googlepages.com/AMP-Spring2008-intro.pdf and added a simple test: main :: IO () main = do putStrLn $ show $ sum $ mergesort $ [sin x | x <- [1.0 .. 1000000.0]] Compiled it with GHC 6.8.2 from Debian testing and ran it with various +RTS -N<n> arguments to measure its performance and obtained the following results on an 8 core: $ ghc --make -O2 -threaded mergesort.hs -o mergesort [1 of 1] Compiling Main ( mergesort.hs, mergesort.o ) Linking mergesort ... $ time ./mergesort +RTS -N1 -0.117109518058233 real 0m9.723s user 0m9.461s sys 0m0.232s $ time ./mergesort +RTS -N2 -0.117109518058233 real 0m13.574s user 0m15.225s sys 0m0.140s $ time ./mergesort +RTS -N3 -0.117109518058233 real 0m13.185s user 0m15.529s sys 0m0.184s $ time ./mergesort +RTS -N4 -0.117109518058233 real 0m13.251s user 0m15.829s sys 0m0.336s $ time ./mergesort +RTS -N5 -0.117109518058233 real 0m13.093s user 0m16.929s sys 0m0.164s $ time ./mergesort +RTS -N6 Segmentation fault real 0m5.711s user 0m7.408s sys 0m0.136s That segfault must be due to a bug in GHC so I thought perhaps a newer version of GHC might fix the segfault but I installed GHC 6.10.4 from Sid and now I get: $ ghc --make -O2 -threaded mergesort.hs -o mergesort Binary: Int64 truncated to fit in 32 bit Int ghc: panic! (the 'impossible' happened) (GHC version 6.10.4 for i386-unknown-linux): Prelude.chr: bad argument Please report this as a GHC bug: http://www.haskell.org/ghc/reportabug I'll try the haskell-platform package next. Are these known problems? -- Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e

Show replies by date

Antoine Latter

24 Dec 24 Dec

12:16 a.m.

Hi Jon, You're more likely to get a useful response over on the glasgow-haskell-users mailing list, as it looks like this is a GHC specific bug: http://www.haskell.org/mailman/listinfo/glasgow-haskell-users Running this at -N6 doesn't want to complete for me, so I'm not sure if it would segfault. I'm on a 64-bit Intel Mac OS X using GHC 6.12, dual core. With such a small example as this your best bet would be to file a bug report: http://hackage.haskell.org/trac/ghc/ You'll need to use the login guest/guest to file a new ticket. Antoine On Wed, Dec 23, 2009 at 9:14 PM, Jon Harrop wrote:

...

I took the parallel merge sort described here:

http://martin.sulzmann.googlepages.com/AMP-Spring2008-intro.pdf

and added a simple test:

main :: IO () main = do putStrLn $ show $ sum $ mergesort $ [sin x | x <- [1.0 .. 1000000.0]]

Compiled it with GHC 6.8.2 from Debian testing and ran it with various +RTS -N<n> arguments to measure its performance and obtained the following results on an 8 core:

$ ghc --make -O2 -threaded mergesort.hs -o mergesort [1 of 1] Compiling Main ( mergesort.hs, mergesort.o ) Linking mergesort ... $ time ./mergesort +RTS -N1 -0.117109518058233

real 0m9.723s user 0m9.461s sys 0m0.232s $ time ./mergesort +RTS -N2 -0.117109518058233

real 0m13.574s user 0m15.225s sys 0m0.140s $ time ./mergesort +RTS -N3 -0.117109518058233

real 0m13.185s user 0m15.529s sys 0m0.184s $ time ./mergesort +RTS -N4 -0.117109518058233

real 0m13.251s user 0m15.829s sys 0m0.336s $ time ./mergesort +RTS -N5 -0.117109518058233

real 0m13.093s user 0m16.929s sys 0m0.164s $ time ./mergesort +RTS -N6 Segmentation fault

real 0m5.711s user 0m7.408s sys 0m0.136s

That segfault must be due to a bug in GHC so I thought perhaps a newer version of GHC might fix the segfault but I installed GHC 6.10.4 from Sid and now I get:

$ ghc --make -O2 -threaded mergesort.hs -o mergesort Binary: Int64 truncated to fit in 32 bit Int ghc: panic! (the 'impossible' happened) (GHC version 6.10.4 for i386-unknown-linux): Prelude.chr: bad argument

Please report this as a GHC bug: http://www.haskell.org/ghc/reportabug

I'll try the haskell-platform package next. Are these known problems?

-- Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e _______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners

Antoine Latter

12:22 a.m.

On Thu, Dec 24, 2009 at 12:16 AM, Antoine Latter wrote:

...

Hi Jon,

You're more likely to get a useful response over on the glasgow-haskell-users mailing list, as it looks like this is a GHC specific bug: http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

Running this at -N6 doesn't want to complete for me, so I'm not sure if it would segfault. I'm on a 64-bit Intel Mac OS X using GHC 6.12, dual core.

With such a small example as this your best bet would be to file a bug report:

http://hackage.haskell.org/trac/ghc/

You'll need to use the login guest/guest to file a new ticket.

Antoine

Here's a paste-bin link for the code in question: http://hpaste.org/fastcgi/hpaste.fcgi/view?id=14871#a14871

...

On Wed, Dec 23, 2009 at 9:14 PM, Jon Harrop wrote:

...
I took the parallel merge sort described here:

http://martin.sulzmann.googlepages.com/AMP-Spring2008-intro.pdf

and added a simple test:

main :: IO () main = do putStrLn $ show $ sum $ mergesort $ [sin x | x <- [1.0 .. 1000000.0]]

Compiled it with GHC 6.8.2 from Debian testing and ran it with various +RTS -N<n> arguments to measure its performance and obtained the following results on an 8 core:

$ ghc --make -O2 -threaded mergesort.hs -o mergesort [1 of 1] Compiling Main ( mergesort.hs, mergesort.o ) Linking mergesort ... $ time ./mergesort +RTS -N1 -0.117109518058233

real 0m9.723s user 0m9.461s sys 0m0.232s $ time ./mergesort +RTS -N2 -0.117109518058233

real 0m13.574s user 0m15.225s sys 0m0.140s $ time ./mergesort +RTS -N3 -0.117109518058233

real 0m13.185s user 0m15.529s sys 0m0.184s $ time ./mergesort +RTS -N4 -0.117109518058233

real 0m13.251s user 0m15.829s sys 0m0.336s $ time ./mergesort +RTS -N5 -0.117109518058233

real 0m13.093s user 0m16.929s sys 0m0.164s $ time ./mergesort +RTS -N6 Segmentation fault

real 0m5.711s user 0m7.408s sys 0m0.136s

That segfault must be due to a bug in GHC so I thought perhaps a newer version of GHC might fix the segfault but I installed GHC 6.10.4 from Sid and now I get:

$ ghc --make -O2 -threaded mergesort.hs -o mergesort Binary: Int64 truncated to fit in 32 bit Int ghc: panic! (the 'impossible' happened) (GHC version 6.10.4 for i386-unknown-linux): Prelude.chr: bad argument

Please report this as a GHC bug: http://www.haskell.org/ghc/reportabug

I'll try the haskell-platform package next. Are these known problems?

-- Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e _______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners

Antoine Latter

12:23 a.m.

On Thu, Dec 24, 2009 at 12:22 AM, Antoine Latter wrote:

...

On Thu, Dec 24, 2009 at 12:16 AM, Antoine Latter wrote:

...
Hi Jon,

You're more likely to get a useful response over on the glasgow-haskell-users mailing list, as it looks like this is a GHC specific bug: http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

Running this at -N6 doesn't want to complete for me, so I'm not sure if it would segfault. I'm on a 64-bit Intel Mac OS X using GHC 6.12, dual core.

With such a small example as this your best bet would be to file a bug report:

http://hackage.haskell.org/trac/ghc/

You'll need to use the login guest/guest to file a new ticket.

Antoine

Here's a paste-bin link for the code in question: http://hpaste.org/fastcgi/hpaste.fcgi/view?id=14871#a14871

-N6 finally completed. I'm using GHC 6.12 on Inter, dual-core mac-book. I'm not sure how comparable my setup is to what you tried. Antoine

Tommy M. McGuire

26 Dec 26 Dec

6 p.m.

Antoine Latter wrote:

...

...
Here's a paste-bin link for the code in question: http://hpaste.org/fastcgi/hpaste.fcgi/view?id=14871#a14871

I am not familiar with Control.Parallel at all, so someone please correct me if I'm wrong.... ...but that parallel mergesort looks very wrong to me. In the first place, the split is using the length of the list, which is going to force the spine of the list, leaving a bazillion thunks unless the runtime will evaluate the numerical value at each element. More importantly, it seems to be running the forcelist in parallel with the merge; forcelist is not going to do anything useful after the first pass over each element, and the merge has a data dependency on the split sub-lists, which is going to limit parallelism quite a bit. Am I wrong here? (Yep, Jedaï, I'm familiar with Jon and the reason he posted this here rather than the cafe.) -- Tommy M. McGuire mcguire@crsr.net

Jon Harrop

7:56 p.m.

On Saturday 26 December 2009 18:00:49 Tommy M. McGuire wrote:

...

Antoine Latter wrote:

...
...
Here's a paste-bin link for the code in question:

http://hpaste.org/fastcgi/hpaste.fcgi/view?id=14871#a14871

I am not familiar with Control.Parallel at all, so someone please correct me if I'm wrong....

...but that parallel mergesort looks very wrong to me.

In the first place, the split is using the length of the list, which is going to force the spine of the list, leaving a bazillion thunks unless the runtime will evaluate the numerical value at each element.

More importantly, it seems to be running the forcelist in parallel with the merge; forcelist is not going to do anything useful after the first pass over each element, and the merge has a data dependency on the split sub-lists, which is going to limit parallelism quite a bit.

Am I wrong here?

The parallelism is obviously not obtaining any real speedup so something is wrong. But can anyone fix it? -- Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e

Stephen Blackheath [to Haskell-Beginners]

27 Dec 27 Dec

8:56 p.m.

...

This is something that concerns me. Lots of discussions of parallelism, including the Haskell literature I have read, neglect this critical

Jon, Jon Harrop wrote: problem

...

of making sure that more time is spent doing useful work than spawning tasks (sparks). How is this solved in Haskell? Do you just put magic numbers in that work on the machine you're currently using?

It is simply not true that Haskell literature neglects the question of spark granularity - this is very basic and is always covered. Read "Real World Haskell" (available free online). There's no 'magic number'. You must explicitly write your code to give the right granule size. I don't know the exact cost of sparking, but in my experience it is quite small - so - as long as your granule is doing *some* real work, it should speed up. Jon Harrop wrote:

...

The parallelism is obviously not obtaining any real speedup so something is wrong. But can anyone fix it?

I've spent a bit of time measuring parallel speedup on real commercial projects, and this is what I've found: 1. ghc-6.12 performs significantly better than ghc-6.10, and has now been released, therefore don't use ghc-6.10. 2. The defaults generally work better than giving huge heap sizes. Your -K100000000 - maximum heap size per thread - will either do nothing or cause an artificial slowdown (I have observed this with the minimum heap size option). Don't use it, unless experimentation proves it makes things better. 3. +RTS -s is well worth using. It breaks the time down into MUT (mutator) and GC (garbage collector). 4. MUT parallelization is excellent, but the parallel GC is not so good. If your code is GC-heavy it can spend around half of its time garbage collecting, which doesn't parallelize so well, and this eats into the speedup. 5. There seems to be a scalability problem with the parallel gc for larger numbers of cores (namely 8). I am guessing somewhat, but my experiments tend to confirm the issue raised in Simon Marlow's (the implementor of GHC parallelization) recent paper that it's to do with "stopping the world" for gc. If GHC's lack of perfection at this point in time makes Haskell "look bad" I don't mind. I am not selling anything, so the reader at least knows they're getting the truth. I see this as one of the great advantages of open source. Progress on GHC has been very rapid in the last couple of years, and so I know we'll continue to see the speed of GHC's parallelism improving in leaps and bounds. It's actually still quite a new area, considering the difficulty of some of the technical issues and how recent it is that multicores are widely available on consumer hardware. I know you OCaml/F# guys are making great progress too. Steve

Jon Harrop

28 Dec 28 Dec

3:19 a.m.

On Sunday 27 December 2009 20:56:51 you wrote:

...

Jon Harrop wrote:

...
This is something that concerns me. Lots of discussions of parallelism, including the Haskell literature I have read, neglect this critical problem of making sure that more time is spent doing useful work than spawning tasks (sparks). How is this solved in Haskell? Do you just put magic numbers in that work on the machine you're currently using?

It is simply not true that Haskell literature neglects the question of spark granularity - this is very basic and is always covered. Read "Real World Haskell" (available free online). There's no 'magic number'. You must explicitly write your code to give the right granule size.

There is no "right granule" size. That's the whole point: the optimum is a function of the machine. If you hardcode the granularity then your code isn't future proof and isn't portable. From chapter 24 of Real World Haskell on sorting: "At this fine granularity, the cost of using par outweighs any possible usefulness. To reduce this effect, we switch to our non-parallel sort after passing some threshold." From the Sorting.hs file, parSort2 accepts a threshold "d": parSort2 :: (Ord a) => Int -> [a] -> [a] parSort2 d list@(x:xs) | d <= 0 = sort list | otherwise = force greater `par` (force lesser `pseq` (lesser ++ x:greater)) where lesser = parSort2 d' [y | y <- xs, y < x] greater = parSort2 d' [y | y <- xs, y >= x] d' = d - 1 parSort2 _ _ = [] From the SortMain.hs file, it is always invoked with the magic number "2": testFunction = parSort2 2 Moreover, their approach of subdividing a fixed number of times is suboptimal because it inhibits load balancing. Later, about parallelized IO, they give the code: chunkedReadWith :: (NFData a) => ([LB.ByteString] -> a) -> FilePath -> IO a chunkedReadWith func path = withChunks (lineChunks (numCapabilities * 4)) func path where "4" is one magic number that gets multiplied by the magic number the user supplied via the +RTS -N<n> command-line option. They make no attempt to adapt the granularity to the machine at all and rely entirely upon magic numbers. Consequently, their parallel sort that got a 25% speedup on two cores achieves a 30% slowdown on my 8 core.

...

I don't know the exact cost of sparking, but in my experience it is quite small - so - as long as your granule is doing *some* real work, it should speed up.

Can you quantify it, e.g. How many FLOPS?

...

Jon Harrop wrote:

...
The parallelism is obviously not obtaining any real speedup so something is wrong. But can anyone fix it?

I've spent a bit of time measuring parallel speedup on real commercial projects, and this is what I've found:

1. ghc-6.12 performs significantly better than ghc-6.10, and has now been released, therefore don't use ghc-6.10.

Ok.

...

2. The defaults generally work better than giving huge heap sizes. Your -K100000000 - maximum heap size per thread - will either do nothing or cause an artificial slowdown (I have observed this with the minimum heap size option). Don't use it, unless experimentation proves it makes things better.

On the contrary, the Haskell program dies without it: $ time ./mergesort +RTS -N8 Stack space overflow: current size 8388608 bytes. Use `+RTS -Ksize' to increase it. [1]+ Done kwrite mergesort.hs real 0m33.320s user 3m29.397s sys 0m0.592s I had to add that -K command line option just to get the program to run to completion.

...

3. +RTS -s is well worth using. It breaks the time down into MUT (mutator) and GC (garbage collector).

Its says 78.5% GC time (with GHC 6.10).

...

4. MUT parallelization is excellent, but the parallel GC is not so good. If your code is GC-heavy it can spend around half of its time garbage collecting, which doesn't parallelize so well, and this eats into the speedup.

Ok.

...

5. There seems to be a scalability problem with the parallel gc for larger numbers of cores (namely 8). I am guessing somewhat, but my experiments tend to confirm the issue raised in Simon Marlow's (the implementor of GHC parallelization) recent paper that it's to do with "stopping the world" for gc.

Do you mean this bug: http://hackage.haskell.org/trac/ghc/ticket/3553

...

If GHC's lack of perfection at this point in time makes Haskell "look bad" I don't mind. I am not selling anything, so the reader at least knows they're getting the truth. I see this as one of the great advantages of open source.

I'm sure we'd all rather see speedups. :-)

...

Progress on GHC has been very rapid in the last couple of years, and so I know we'll continue to see the speed of GHC's parallelism improving in leaps and bounds. It's actually still quite a new area, considering the difficulty of some of the technical issues and how recent it is that multicores are widely available on consumer hardware.

My original intent was to test a claim someone made: that mergesort in Haskell is competitively performant and trivially parallelized.

...

I know you OCaml/F# guys are making great progress too.

F# is production ready but OCaml is dead in the water when it comes to multicore. I'm working on an OCaml replacement called HLVM but it is early days yet. -- Dr Jon Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/?e

5682

Age (days ago)

5687

Last active (days ago)

List overview

Download

7 comments

4 participants

participants (4)

Antoine Latter
Jon Harrop
Stephen Blackheath [to Haskell-Beginners]
Tommy M. McGuire