However, the parallel GC will be a problem if one or more of your cores is being used by other process(es) on the machine. In that case, the GC synchronisation will stall and performance will go down the drain. You can often see this on a ThreadScope profile as a big delay during GC while the other cores wait for the delayed core. Make sure your machine is quiet and/or use one fewer cores than the total available. It's not usually a good idea to use hyperthreaded cores either.
Does it ever help to set the number of GC threads greater than numCapabilities to over-partition the GC work? The idea would be to enable some load balancing in the face of perturbation from external load on the machine...
It looks like GHC 6.10 had a "-g" flag for this that.... later went away?
-Ryan