Re: [Haskell-cafe] SMP parallelism increasing GC time dramatically

10 Oct 2011


      On 08/10/2011 01:47, austin seipp wrote:
...
It's GHC, and partly the OS scheduler in some sense. Oversaturating,
i.e. using an -N option>  your number of logical cores (including
hyperthreads) will slow down your program typically. This isn't
uncommon, and is well known - GHC's lightweight threads have an M:N
threading model, but for good performance, you typically want the OS
threads and cores to have a 1:1 correspondence. Creating a massive
amount of OS threads will cause much context switching between them,
which is going to slow things down. When GHC needs to synchronize OS
threads in order to do a GC, there are going to be a lot of threads
being context swapped in/out in order to achieve this (because the GC
must halt all mutator threads to do its thing.)
Furthermore, oversaturation isn't the only problem - having the same
number of threads as cores will mean *some* thread is going to get
de-scheduled. Many times this means that the thread in which GHC does
GC will get descheduled by the OS. A corollary of this descheduling
phenomenon is that even using the same # of OS threads as you have
cores could result in -worse- performance than N-1 OS threads. This
was mitigated a bit in 7.2.1 I think. Linux was affected much more
drastically than others (OS X and Solaris have vastly different
schedulers, and as a result the performance wouldn't just tank - it
would actually get better IIRC, it just wouldn't scale as well at that
point.) At the very least, on Linux, using an -N option equivalent to
your number of logical cores should not drastically slow things down
anymore - but it won't make them faster either. This is the "dreaded
last core slowdown" bug that's been known about for a while, and as a
result, you typically only see parallel speedup on Linux up to N-1
threads, where N = the number of cores you have.
Incidentally, I don't think that's true any more with recent versions of 
GHC and Linux, I typically see speedup increasing all the way to the 
total number of cores, although sometimes the speedup when adding the 
last core is less.  Take a look at the graphs in our recent papers for 
some concrete results.
...
As a result, with dual-core only (and no hyperthreading,) on Linux,
you're very unlikely to be able to get parallel speedup in any case.
There's work to fix this in the garbage collector among other things,
but it's not clear if it's going into GHC just yet.
It probably depends on how much other activity is happening on the 
system.  I get pretty good speedups for most benchmarks I try on my 
dual-core laptop running either Linux or Windows.  Typically with 
Windows I have to wait a while after booting for all the background 
activity to die down, before I can use both cores reliably.

Cheers,
	Simon
...
On Fri, Oct 7, 2011 at 2:31 PM, Oliver Batchelor  wrote:
...
I'm not sure if this is at all related, but if I run a small Repa program
with more threads than I have cores/CPUs then it gets drastically slower, I
have a dual core laptop - and -N2 makes my small program take approximately
0.6 of the time. Increasing to -N4 and we're running about 2x the time, -N8
and it's taking 20x or more. I guess this is probably more down to the
design of Repa rather than GHC itself?
Oliver
On Sat, Oct 8, 2011 at 1:21 AM, Tom Thorne
wrote:
...
I have made a dummy program that seems to exhibit the same GC
slowdown behavior, minus the segmentation faults. Compiling with -threaded
and running with -N12 I get very bad performance (3x slower than -N1),
running with -N12 -qg it runs approximately 3 times faster than -N1. I don't
know if I should submit this as a bug or not? I'd certainly like to know why
this is happening!
import Numeric.LinearAlgebra
import Numeric.GSL.Special.Gamma
import Control.Parallel.Strategies
import Control.Monad
import Data.IORef
import Data.Random
import Data.Random.Source.PureMT
import Debug.Trace
--
subsets s n = (subsets_stream s) !! n
subsets_stream [] = [[]] : repeat []
subsets_stream (x:xs) =
let r = subsets_stream xs
    s = map (map (x:)) r
in [[]] : zipWith (++) s (tail r)
testfun :: Matrix Double ->  Int ->  [Int] ->  Double
testfun x k cs = lngamma (det (c+u))
where
(m,c) = meanCov xx
m' = fromRows [m]
u = (trans m')<>  m'
xx = fromColumns ( [(toColumns x)!!i] ++ [(toColumns x)!!j] ++ [(toColumns
x)!!k] )
i = cs !! 0
j = cs !! 1
test :: Matrix Double ->  Int ->  Double
test x i = sum p
where
p = parMap (rdeepseq) (testfun x (i+1)) (subsets [0..i] 2)
ranMatrix :: Int ->  RVar (Matrix Double)
ranMatrix n = do
xs<- mapM (\_ ->  mapM (\_ ->  uniform 0 1.0) [1..n]) [1..n]
return (fromLists xs)
loop :: Int ->  Double ->  Int ->  RVar Double
loop n s i = traceShow i $ do
x<- ranMatrix n
let r = sum $ parMap (rdeepseq) (test x) [2..(n-2)]
return (r+s)
main = do
let n = 100
let iter = 5
rng<- newPureMT
rngr<- newIORef rng
p<- runRVar (foldM (loop n) 0.0 [1..iter]) rngr
print p
I have also found that the segmentation faults in my code disappear if I
switch from Control.Parallel to Control.Monad.Par, which is quite strange. I
get slightly better performance with Control.Parallel when it completes
without a seg. fault, and the frequency with which it does so seems to
depend on the number of sparks that are being created.
On Thu, Oct 6, 2011 at 1:56 PM, Tom Thorne
wrote:
...
I'm trying to narrow it down so that I can submit a meaningful bug
report, and it seems to be something to do with switching off parallel GC
using -qg, whilst also passing -Nx.
Are there any known issues with this that people are aware of? At the
moment I am using the latest haskell platform release on arch.
I'd like to give 7.2 a try in case that fixes it, but I'm using rather a
lot of libraries (hmatrix, fclabels, random fu) and I don't know how to
install them for multiple ghc versions
On Wed, Oct 5, 2011 at 10:43 PM, Johan Tibell
wrote:
...
On Wed, Oct 5, 2011 at 2:37 PM, Tom Thorne
wrote:
...
The only problem is that now I am getting random occasional
segmentation faults that I was not been getting before, and once got a
message saying:
Main: schedule: re-entered unsafely
Perhaps a 'foreign import unsafe' should be 'safe'?
I think this may be something to do with creating a lot of sparks
though, since this occurs whether I have the parallel GC on or not.
Unless you (or some library you're using) is doing what the error
message says then you should file a GHC bug here:
http://hackage.haskell.org/trac/ghc/
-- Johan
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] SMP parallelism increasing GC time dramatically

Simon Marlow