
Follows three runs of a stupid program that should see a speedup (I believe). 1. Compiled with ghc-6.12.3. Run on 1 thread, 4.0 s (real). 2. Compiled with ghc-6.12.3. Run on 2 threads, spark converted, 5.4 s (real), 7.7 s (user). 3. Compiled with ghc-6.13.20100831. Run on 2 threads, but spark *not* converted 4.1 s (real). All runs with disabled parallel GC, which negatively impacts -N2 runs (whether the spark is converted or not). Two questions: A. Why is run (2) slower than (1)? GC is a bit more expensive, as one would expect, but not enough to explain the difference (only explains 0.3 s). B. ghc HEAD doesn't convert the spark. Why is that? Thanks. ---- module Main where import Control.Parallel fac :: Integer -> Integer fac n = aux n 1 where aux 0 _ = 0 aux 1 m = m aux n m = let p = n * m in p `pseq` aux (n - 1) p main = do let x = fac 100000 y = fac 100001 in print (x `par` y `pseq` (x + y)) ---- $ ghc-6.12.3 -O2 -threaded -rtsopts Test.hs --make $ time ./Test +RTS -N1 -qg -sstderr > /dev/null ./Test +RTS -N1 -qg -sstderr 20,460,660,928 bytes allocated in the heap 2,492,368 bytes copied during GC 462,128 bytes maximum residency (3 sample(s)) 76,576 bytes maximum slop 4 MB total memory in use (1 MB lost due to fragmentation) Generation 0: 35093 collections, 0 parallel, 0.14s, 0.17s elapsed Generation 1: 3 collections, 0 parallel, 0.00s, 0.00s elapsed MUT time (elapsed) GC time (elapsed) Task 0 (worker) : 0.00s ( 0.00s) 0.00s ( 0.00s) Task 1 (worker) : 0.00s ( 3.84s) 0.00s ( 0.00s) Task 2 (bound) : 3.87s ( 3.84s) 0.14s ( 0.17s) SPARKS: 1 (0 converted, 1 pruned) INIT time 0.00s ( 0.00s elapsed) MUT time 3.83s ( 3.84s elapsed) GC time 0.14s ( 0.17s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 3.97s ( 4.02s elapsed) %GC time 3.6% (4.3% elapsed) Alloc rate 5,345,814,403 bytes per MUT second Productivity 96.4% of total user, 95.2% of total elapsed gc_alloc_block_sync: 0 whitehole_spin: 0 gen[0].steps[0].sync_large_objects: 0 gen[0].steps[1].sync_large_objects: 0 gen[1].steps[0].sync_large_objects: 0 real 0m4.022s user 0m3.970s sys 0m0.051s $ time ./Test +RTS -N2 -qg -sstderr > /dev/null ./Test +RTS -N2 -qg -sstderr 20,600,461,512 bytes allocated in the heap 3,585,432 bytes copied during GC 458,480 bytes maximum residency (14 sample(s)) 76,576 bytes maximum slop 5 MB total memory in use (1 MB lost due to fragmentation) Generation 0: 34803 collections, 0 parallel, 0.43s, 0.48s elapsed Generation 1: 14 collections, 0 parallel, 0.00s, 0.00s elapsed MUT time (elapsed) GC time (elapsed) Task 0 (worker) : 0.00s ( 0.00s) 0.00s ( 0.00s) Task 1 (worker) : 3.72s ( 4.91s) 0.24s ( 0.25s) Task 2 (bound) : 3.91s ( 4.91s) 0.19s ( 0.23s) Task 3 (worker) : 0.00s ( 4.91s) 0.00s ( 0.00s) SPARKS: 1 (1 converted, 0 pruned) INIT time 0.00s ( 0.00s elapsed) MUT time 7.29s ( 4.91s elapsed) GC time 0.43s ( 0.48s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 7.72s ( 5.39s elapsed) %GC time 5.6% (8.9% elapsed) Alloc rate 2,824,733,790 bytes per MUT second Productivity 94.4% of total user, 135.2% of total elapsed gc_alloc_block_sync: 0 whitehole_spin: 0 gen[0].steps[0].sync_large_objects: 0 gen[0].steps[1].sync_large_objects: 0 gen[1].steps[0].sync_large_objects: 0 real 0m5.397s user 0m7.724s sys 0m0.443s $ ghc-6.13.20100831 -O2 -threaded -rtsopts Test.hs $ time ./Test +RTS -N2 -qg -sstderr > /dev/null ./Test +RTS -N2 -qg -sstderr 20,458,286,584 bytes allocated in the heap 2,705,816 bytes copied during GC 463,520 bytes maximum residency (3 sample(s)) 73,032 bytes maximum slop 5 MB total memory in use (1 MB lost due to fragmentation) Generation 0: 35088 collections, 0 parallel, 0.19s, 0.22s elapsed Generation 1: 3 collections, 0 parallel, 0.00s, 0.00s elapsed MUT time (elapsed) GC time (elapsed) Task 0 (worker) : 0.00s ( 0.00s) 0.00s ( 0.00s) Task 1 (worker) : 0.00s ( 3.92s) 0.00s ( 0.00s) Task 2 (bound) : 3.90s ( 3.92s) 0.19s ( 0.22s) Task 3 (worker) : 0.00s ( 3.92s) 0.00s ( 0.00s) SPARKS: 1 (0 converted, 1 pruned) INIT time 0.00s ( 0.00s elapsed) MUT time 3.84s ( 3.92s elapsed) GC time 0.19s ( 0.22s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 4.03s ( 4.14s elapsed) %GC time 4.7% (5.2% elapsed) Alloc rate 5,329,874,595 bytes per MUT second Productivity 95.3% of total user, 92.7% of total elapsed gc_alloc_block_sync: 0 whitehole_spin: 0 gen[0].sync_large_objects: 0 gen[1].sync_large_objects: 0 real 0m4.144s user 0m4.027s sys 0m0.059s