New subject: [Haskell-beginners] Performance of parallel mergesort

28 Dec 2009

      This discussion definitely does not belong on the
Haskell-Beginners list. Besides not being a topic
for beginners, being there is keeping it under the
radar of many or most of the people who work on
these things in Haskell.

I am moving the discussion to the GHC users list,
as suggested by Antoine. For those joining us in
progress, the first part of the thread starts here:

http://www.haskell.org/pipermail/beginners/2009-December/003045.html

On Mon, Dec 28, 2009 at 5:19 AM, Jon Harrop wrote:
...
On Sunday 27 December 2009 20:56:51 Stephen Blackheath wrote:
...
Jon Harrop wrote:
...
This is something that concerns me. Lots of discussions of parallelism,
including the Haskell literature I have read, neglect this critical
problem
of making sure that more time is spent doing useful work than spawning
tasks
(sparks). How is this solved in Haskell? Do you just put magic numbers in
that work on the machine you're currently using?
It is simply not true that Haskell literature neglects the question of
spark granularity - this is very basic and is always covered. Read
"Real World Haskell" (available free online).  There's no 'magic
number'. You must explicitly write your code to give the right granule
size.
There is no "right granule" size. That's the whole point: the optimum is a
function of the machine. If you hardcode the granularity then your code isn't
future proof and isn't portable.
From chapter 24 of Real World Haskell on sorting:
 "At this fine granularity, the cost of using par outweighs any possible
usefulness. To reduce this effect, we switch to our non-parallel sort after
passing some threshold."
From the Sorting.hs file, parSort2 accepts a threshold "d":
 parSort2 :: (Ord a) => Int -> [a] -> [a]
 parSort2 d list@(x:xs)
   | d <= 0     = sort list
   | otherwise = force greater `par` (force lesser `pseq`
                                     (lesser ++ x:greater))
       where lesser      = parSort2 d' [y | y <- xs, y <  x]
             greater     = parSort2 d' [y | y <- xs, y >= x]
             d' = d - 1
 parSort2 _ _              = []
From the SortMain.hs file, it is always invoked with the magic number "2":
 testFunction = parSort2 2
Moreover, their approach of subdividing a fixed number of times is suboptimal
because it inhibits load balancing.
Later, about parallelized IO, they give the code:
 chunkedReadWith :: (NFData a) =>
                   ([LB.ByteString] -> a) -> FilePath -> IO a
 chunkedReadWith func path =
     withChunks (lineChunks (numCapabilities * 4)) func path
where "4" is one magic number that gets multiplied by the magic number the
user supplied via the +RTS -N<n> command-line option.
They make no attempt to adapt the granularity to the machine at all and rely
entirely upon magic numbers. Consequently, their parallel sort that got a 25%
speedup on two cores achieves a 30% slowdown on my 8 core.
...
I don't know the exact cost of sparking, but in my experience it
is quite small - so - as long as your granule is doing *some* real work,
it should speed up.
Can you quantify it, e.g. How many FLOPS?
...
Jon Harrop wrote:
...
The parallelism is obviously not obtaining any real speedup so something
is wrong. But can anyone fix it?
I've spent a bit of time measuring parallel speedup on real commercial
projects, and this is what I've found:
1. ghc-6.12 performs significantly better than ghc-6.10, and has now
been released, therefore don't use ghc-6.10.
Ok.
...
2. The defaults generally work better than giving huge heap sizes.  Your
-K100000000 - maximum heap size per thread - will either do nothing or
cause an artificial slowdown (I have observed this with the minimum heap
size option).  Don't use it, unless experimentation proves it makes
things better.
On the contrary, the Haskell program dies without it:
$ time ./mergesort +RTS -N8
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize' to increase it.
[1]+  Done                    kwrite mergesort.hs
real    0m33.320s
user    3m29.397s
sys     0m0.592s
I had to add that -K command line option just to get the program to run to
completion.
...
3. +RTS -s is well worth using. It breaks the time down into MUT
(mutator) and GC (garbage collector).
Its says 78.5% GC time (with GHC 6.10).
...
4. MUT parallelization is excellent, but the parallel GC is not so good.
 If your code is GC-heavy it can spend around half of its time garbage
collecting, which doesn't parallelize so well, and this eats into the
speedup.
Ok.
...
5. There seems to be a scalability problem with the parallel gc for
larger numbers of cores (namely 8).  I am guessing somewhat, but my
experiments tend to confirm the issue raised in Simon Marlow's (the
implementor of GHC parallelization) recent paper that it's to do with
"stopping the world" for gc.
Do you mean this bug:
 http://hackage.haskell.org/trac/ghc/ticket/3553
...
If GHC's lack of perfection at this point in time makes Haskell "look
bad" I don't mind.  I am not selling anything, so the reader at least
knows they're getting the truth.  I see this as one of the great
advantages of open source.
I'm sure we'd all rather see speedups. :-)
...
Progress on GHC has been very rapid in the last couple of years, and so
I know we'll continue to see the speed of GHC's parallelism improving in
leaps and bounds.  It's actually still quite a new area, considering the
difficulty of some of the technical issues and how recent it is that
multicores are widely available on consumer hardware.
My original intent was to test a claim someone made: that mergesort in Haskell
is competitively performant and trivially parallelized.
...
I know you OCaml/F# guys are making great progress too.
F# is production ready but OCaml is dead in the water when it comes to
multicore. I'm working on an OCaml replacement called HLVM but it is early
days yet.
--
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?e
_______________________________________________
Beginners mailing list
Beginners@haskell.org
http://www.haskell.org/mailman/listinfo/beginners

Re: [Haskell-beginners] Performance of parallel mergesort

Yitzchak Gale

Jon Harrop

Antoine Latter

Simon Marlow

Simon Marlow

tags

participants (4)