The parallel GC currently doesn't behave well with concurrent programs that uses multiple capabilities (aka OS threads), and the behaviour you see is the known symptom of this.. I believe that Simon Marlow has some fixes in hand that may go into 6.12.2.

Are you saying that you see two different classes of undesirable performance, one with -qg and one without? How are your threads in your real program communicating with each other? We've seen problems there when there's a lot of contention for e.g. IORefs among thousands of threads.

On Mon, Mar 1, 2010 at 7:59 AM, Michael Lesniak <mlesniak@uni-kassel.de> wrote:
Hello haskell-cafe,

Sorry for this long post, but I can't think of a way to describe and explain
the problem in a shorter way.

I've (again) a very strange behaviour with the parallel GC and would be glad
if someone could either reproduce (and explain) it or provide a solution. A
similar but unrelated problem has been described in [1].


EXAMPLE CODE
The following demonstration program, which is a much smaller and
single-threaded version of my real problem behaves as my real program.
It does some number crunching by calculating pi to a definable precision:

> -- File Pi.hs
> -- you need the numbers package from hackage.
> module Main where
> import Data.Number.CReal
> import System.Environment
> import GHC.Conc
>
> main = do
>     digits <- (read . head) `fmap` getArgs :: IO Int
>     calcPi digits
>
> calcPi digits = showCReal (fromEnum digits) pi `pseq` return ()

Compile it with

 ghc --make -threaded -O2 Pi.hs -o pi


BENCHMARKS
On my two-core machine I get the following quite strange and
unpredictable results:

* Using one thread:

   $ for i in `seq 1 5`;do time pi 5000 +RTS -N1;done

   real        0m1.441s
   user        0m1.390s
   sys 0m0.020s

   real        0m1.449s
   user        0m1.390s
   sys 0m0.000s

   real        0m1.399s
   user        0m1.370s
   sys 0m0.010s

   real        0m1.401s
   user        0m1.380s
   sys 0m0.000s

   real        0m1.404s
   user        0m1.380s
   sys 0m0.000s


* Using two threads, hence the parallel GC is used:

   for i in `seq 1 5`;do time pi 5000 +RTS -N2;done

   real        0m2.540s
   user        0m2.490s
   sys 0m0.010s

   real        0m1.527s
   user        0m1.530s
   sys 0m0.010s

   real        0m1.966s
   user        0m1.900s
   sys 0m0.010s

   real        0m5.670s
   user        0m5.620s
   sys 0m0.010s

   real        0m2.966s
   user        0m2.910s
   sys 0m0.020s


* Using two threads, but disabling the parallel GC:

   for i in `seq 1 5`;do time pi 5000 +RTS -N2 -qg;done

   real        0m1.383s
   user        0m1.380s
   sys 0m0.010s

   real        0m1.420s
   user        0m1.360s
   sys 0m0.010s

   real        0m1.406s
   user        0m1.360s
   sys 0m0.010s

   real        0m1.421s
   user        0m1.380s
   sys 0m0.000s

   real        0m1.360s
   user        0m1.360s
   sys 0m0.000s


THREADSCOPE
I've additionally attached the threadscope profile of a really bad run,
started with

    $ time pi 5000 +RTS -N2 -ls

   real        0m15.594s
   user        0m15.490s
   sys 0m0.010s

as file pi.pdf


FURTHER INFORMATION/QUESTION
Just disabling the parallel GC leads to very bad performance in my original
code, which forks threads with forkIO and does a lot of communications. Hence,
using -qg is not a real option for me.

Do I have overlooked some cruical aspect of this problem? If you've
read this far, thank you for reading ... this far ;-)

Cheers,
 Michael



[1] http://osdir.com/ml/haskell-cafe@haskell.org/2010-02/msg00850.html


--
Dipl.-Inf. Michael C. Lesniak
University of Kassel
Programming Languages / Methodologies Research Group
Department of Computer Science and Electrical Engineering

Wilhelmshöher Allee 73
34121 Kassel

Phone: +49-(0)561-804-6269

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe