
Hello, I have a program that is intermittently experiencing performance issues that I believe are related to parallel GC, and I was hoping to get some advice on how I might improve it. Essentially, any given execution is either slow or fast (the same executable, without recompiling), most often slow. So far I can't find anything that would trigger either case. This is with ghc-7.4.2 on 64bit linux. Here are the statistics from running with -N4 -A8m -s: slow run: 16,647,460,328 bytes allocated in the heap 313,767,248 bytes copied during GC 17,305,120 bytes maximum residency (22 sample(s)) 601,952 bytes maximum slop 73 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 1268 colls, 1267 par 8.62s 8.00s 0.0063s 0.0389s Gen 1 22 colls, 22 par 0.63s 0.60s 0.0275s 0.0603s Parallel GC work balance: 1.53 (39176141 / 25609887, ideal 4) MUT time (elapsed) GC time (elapsed) Task 0 (worker) : 0.00s ( 0.01s) 0.00s ( 0.00s) Task 1 (worker) : 0.00s ( 13.66s) 0.01s ( 0.04s) Task 2 (bound) : 0.00s ( 13.98s) 0.00s ( 0.00s) Task 3 (worker) : 0.00s ( 18.14s) 0.16s ( 0.44s) Task 4 (worker) : 0.53s ( 17.49s) 1.29s ( 4.25s) Task 5 (worker) : 0.00s ( 17.45s) 1.25s ( 4.42s) Task 6 (worker) : 0.00s ( 14.98s) 1.75s ( 6.90s) Task 7 (worker) : 0.00s ( 21.87s) 0.02s ( 0.06s) Task 8 (worker) : 0.01s ( 37.12s) 0.06s ( 0.17s) Task 9 (worker) : 0.00s ( 21.41s) 4.88s ( 15.99s) Task 10 (worker) : 0.84s ( 43.06s) 1.99s ( 8.25s) Task 11 (bound) : 6.39s ( 51.13s) 0.06s ( 0.18s) Task 12 (worker) : 0.00s ( 0.00s) 8.04s ( 21.42s) Task 13 (worker) : 0.43s ( 28.38s) 8.14s ( 22.94s) Task 14 (worker) : 5.35s ( 29.30s) 5.81s ( 22.02s) SPARKS: 7 (7 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.03s ( 0.01s elapsed) MUT time 43.88s ( 42.71s elapsed) GC time 9.26s ( 8.60s elapsed) EXIT time 0.01s ( 0.01s elapsed) Total time 53.65s ( 51.34s elapsed) Alloc rate 374,966,825 bytes per MUT second Productivity 82.7% of total user, 86.4% of total elapsed gc_alloc_block_sync: 1388000 whitehole_spin: 0 gen[0].sync: 0 gen[1].sync: 0 -- -------------------------- Fast run: 42,061,441,560 bytes allocated in the heap 725,062,720 bytes copied during GC 36,963,480 bytes maximum residency (21 sample(s)) 1,382,536 bytes maximum slop 141 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 3206 colls, 3205 par 8.34s 1.87s 0.0006s 0.0089s Gen 1 21 colls, 21 par 0.76s 0.17s 0.0081s 0.0275s Parallel GC work balance: 1.78 (90535973 / 50955059, ideal 4) MUT time (elapsed) GC time (elapsed) Task 0 (worker) : 0.00s ( 0.00s) 0.00s ( 0.00s) Task 1 (worker) : 0.00s ( 0.00s) 0.00s ( 0.00s) Task 2 (worker) : 0.00s ( 11.50s) 0.00s ( 0.00s) Task 3 (worker) : 0.00s ( 12.40s) 0.00s ( 0.00s) Task 4 (worker) : 0.58s ( 12.20s) 0.59s ( 0.61s) Task 5 (bound) : 0.00s ( 12.89s) 0.00s ( 0.00s) Task 6 (worker) : 0.00s ( 13.40s) 0.02s ( 0.02s) Task 7 (worker) : 0.00s ( 14.66s) 0.00s ( 0.00s) Task 8 (worker) : 0.95s ( 14.18s) 0.69s ( 0.76s) Task 9 (worker) : 2.82s ( 13.50s) 1.37s ( 1.44s) Task 10 (worker) : 1.72s ( 17.59s) 1.07s ( 1.16s) Task 11 (worker) : 3.99s ( 24.68s) 0.37s ( 0.38s) Task 12 (worker) : 1.24s ( 24.25s) 0.80s ( 0.82s) Task 13 (bound) : 6.18s ( 25.02s) 0.04s ( 0.04s) Task 14 (worker) : 1.46s ( 23.42s) 1.59s ( 1.65s) Task 15 (worker) : 0.00s ( 0.00s) 0.66s ( 0.66s) Task 16 (worker) : 11.00s ( 23.36s) 1.67s ( 1.70s) SPARKS: 28 (28 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.04s ( 0.02s elapsed) MUT time 42.08s ( 23.02s elapsed) GC time 9.10s ( 2.04s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 51.69s ( 25.09s elapsed) Alloc rate 987,695,300 bytes per MUT second Productivity 82.3% of total user, 169.6% of total elapsed gc_alloc_block_sync: 164572 whitehole_spin: 0 gen[0].sync: 164 gen[1].sync: 18147 When I record an eventlog and view it with Threadscope, the slow run shows long frequent pauses for GC, whereas on a fast run GC is extremely fast. Running with the parallel collector disabled (-qg) is more-or-less consistently between these two runs. Given this, can anyone suggest any likely causes of this issue, or anything I might want to look for? Also, should I be concerned about the much larger gc_alloc_block_sync level for the slow run? Does that indicate the allocator waiting to alloc a new block, or is it something else? Am I on completely the wrong track? Thanks very much, John L.