
On 08/10/2011 01:47, austin seipp wrote:
It's GHC, and partly the OS scheduler in some sense. Oversaturating, i.e. using an -N option> your number of logical cores (including hyperthreads) will slow down your program typically. This isn't uncommon, and is well known - GHC's lightweight threads have an M:N threading model, but for good performance, you typically want the OS threads and cores to have a 1:1 correspondence. Creating a massive amount of OS threads will cause much context switching between them, which is going to slow things down. When GHC needs to synchronize OS threads in order to do a GC, there are going to be a lot of threads being context swapped in/out in order to achieve this (because the GC must halt all mutator threads to do its thing.)
Furthermore, oversaturation isn't the only problem - having the same number of threads as cores will mean *some* thread is going to get de-scheduled. Many times this means that the thread in which GHC does GC will get descheduled by the OS. A corollary of this descheduling phenomenon is that even using the same # of OS threads as you have cores could result in -worse- performance than N-1 OS threads. This was mitigated a bit in 7.2.1 I think. Linux was affected much more drastically than others (OS X and Solaris have vastly different schedulers, and as a result the performance wouldn't just tank - it would actually get better IIRC, it just wouldn't scale as well at that point.) At the very least, on Linux, using an -N option equivalent to your number of logical cores should not drastically slow things down anymore - but it won't make them faster either. This is the "dreaded last core slowdown" bug that's been known about for a while, and as a result, you typically only see parallel speedup on Linux up to N-1 threads, where N = the number of cores you have.
Incidentally, I don't think that's true any more with recent versions of GHC and Linux, I typically see speedup increasing all the way to the total number of cores, although sometimes the speedup when adding the last core is less. Take a look at the graphs in our recent papers for some concrete results.
As a result, with dual-core only (and no hyperthreading,) on Linux, you're very unlikely to be able to get parallel speedup in any case. There's work to fix this in the garbage collector among other things, but it's not clear if it's going into GHC just yet.
It probably depends on how much other activity is happening on the system. I get pretty good speedups for most benchmarks I try on my dual-core laptop running either Linux or Windows. Typically with Windows I have to wait a while after booting for all the background activity to die down, before I can use both cores reliably. Cheers, Simon
On Fri, Oct 7, 2011 at 2:31 PM, Oliver Batchelor
wrote: I'm not sure if this is at all related, but if I run a small Repa program with more threads than I have cores/CPUs then it gets drastically slower, I have a dual core laptop - and -N2 makes my small program take approximately 0.6 of the time. Increasing to -N4 and we're running about 2x the time, -N8 and it's taking 20x or more. I guess this is probably more down to the design of Repa rather than GHC itself? Oliver
On Sat, Oct 8, 2011 at 1:21 AM, Tom Thorne
wrote: I have made a dummy program that seems to exhibit the same GC slowdown behavior, minus the segmentation faults. Compiling with -threaded and running with -N12 I get very bad performance (3x slower than -N1), running with -N12 -qg it runs approximately 3 times faster than -N1. I don't know if I should submit this as a bug or not? I'd certainly like to know why this is happening! import Numeric.LinearAlgebra import Numeric.GSL.Special.Gamma import Control.Parallel.Strategies import Control.Monad import Data.IORef import Data.Random import Data.Random.Source.PureMT import Debug.Trace -- subsets s n = (subsets_stream s) !! n subsets_stream [] = [[]] : repeat [] subsets_stream (x:xs) = let r = subsets_stream xs s = map (map (x:)) r in [[]] : zipWith (++) s (tail r) testfun :: Matrix Double -> Int -> [Int] -> Double testfun x k cs = lngamma (det (c+u)) where (m,c) = meanCov xx m' = fromRows [m] u = (trans m')<> m' xx = fromColumns ( [(toColumns x)!!i] ++ [(toColumns x)!!j] ++ [(toColumns x)!!k] ) i = cs !! 0 j = cs !! 1
test :: Matrix Double -> Int -> Double test x i = sum p where p = parMap (rdeepseq) (testfun x (i+1)) (subsets [0..i] 2)
ranMatrix :: Int -> RVar (Matrix Double) ranMatrix n = do xs<- mapM (\_ -> mapM (\_ -> uniform 0 1.0) [1..n]) [1..n] return (fromLists xs)
loop :: Int -> Double -> Int -> RVar Double loop n s i = traceShow i $ do x<- ranMatrix n let r = sum $ parMap (rdeepseq) (test x) [2..(n-2)] return (r+s) main = do let n = 100 let iter = 5 rng<- newPureMT rngr<- newIORef rng p<- runRVar (foldM (loop n) 0.0 [1..iter]) rngr print p I have also found that the segmentation faults in my code disappear if I switch from Control.Parallel to Control.Monad.Par, which is quite strange. I get slightly better performance with Control.Parallel when it completes without a seg. fault, and the frequency with which it does so seems to depend on the number of sparks that are being created. On Thu, Oct 6, 2011 at 1:56 PM, Tom Thorne
wrote: I'm trying to narrow it down so that I can submit a meaningful bug report, and it seems to be something to do with switching off parallel GC using -qg, whilst also passing -Nx. Are there any known issues with this that people are aware of? At the moment I am using the latest haskell platform release on arch. I'd like to give 7.2 a try in case that fixes it, but I'm using rather a lot of libraries (hmatrix, fclabels, random fu) and I don't know how to install them for multiple ghc versions On Wed, Oct 5, 2011 at 10:43 PM, Johan Tibell
wrote: On Wed, Oct 5, 2011 at 2:37 PM, Tom Thorne
wrote: The only problem is that now I am getting random occasional segmentation faults that I was not been getting before, and once got a message saying: Main: schedule: re-entered unsafely Perhaps a 'foreign import unsafe' should be 'safe'? I think this may be something to do with creating a lot of sparks though, since this occurs whether I have the parallel GC on or not.
Unless you (or some library you're using) is doing what the error message says then you should file a GHC bug here:
http://hackage.haskell.org/trac/ghc/
-- Johan
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe