SMP parallelism increasing GC time dramatically

I am having some strange performance issues when using SMP parallelism, that I think may be something to do with GC. Apologies for the large readouts below but I'm not familiar enough to know what is and isn't relevant! I have a pure function that is mapped over a list of around 10 values, and this happens several times for each iteration of my program. It does some fairly intensive calculations using hmatrix, generating intermediate matrices along the way. The computation is significantly more complex for some values, so the work done by each call is not spread equally. I did some profiling and it seems like the program is spending about 50% of its time in that function. First of all, without any attempts at parallelism, I see this from ./Main +RTS -s 67,142,126,336 bytes allocated in the heap 147,759,264 bytes copied during GC 109,384 bytes maximum residency (58 sample(s)) 354,408 bytes maximum slop 3 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 104551 collections, 0 parallel, 1.13s, 1.11s elapsed Generation 1: 58 collections, 0 parallel, 0.01s, 0.01s elapsed Parallel GC work balance: -nan (0 / 0, ideal 1) MUT time (elapsed) GC time (elapsed) Task 0 (worker) : 0.00s ( 67.06s) 0.00s ( 0.00s) Task 1 (worker) : 0.00s ( 67.09s) 0.00s ( 0.00s) Task 2 (bound) : 66.95s ( 67.09s) 1.14s ( 1.12s) SPARKS: 0 (0 converted, 0 pruned) INIT time 0.00s ( 0.00s elapsed) MUT time 66.95s ( 67.09s elapsed) GC time 1.14s ( 1.12s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 68.09s ( 68.21s elapsed) %GC time 1.7% (1.6% elapsed) Alloc rate 1,002,835,517 bytes per MUT second Productivity 98.3% of total user, 98.2% of total elapsed gc_alloc_block_sync: 0 whitehole_spin: 0 gen[0].sync_large_objects: 0 gen[1].sync_large_objects: 0 This looks ok to me... Then if I try to use Control.Parallel to parallelise my code, simpy replacing a map with parMap (rdeepseq), on a 12 core machine using +RTS -N12 -s I get this: 66,065,148,144 bytes allocated in the heap 197,202,056 bytes copied during GC 181,312 bytes maximum residency (251 sample(s)) 387,240 bytes maximum slop 12 MB total memory in use (3 MB lost due to fragmentation) Generation 0: 37592 collections, 37591 parallel, 245.32s, 26.67s elapsed Generation 1: 251 collections, 251 parallel, 3.12s, 0.33s elapsed Parallel GC work balance: 2.41 (24219609 / 10058220, ideal 12) MUT time (elapsed) GC time (elapsed) Task 0 (worker) : 0.00s ( 0.00s) 0.00s ( 0.00s) Task 1 (worker) : 0.00s ( 0.00s) 0.00s ( 0.00s) Task 2 (worker) : 0.00s ( 17.97s) 0.00s ( 0.00s) Task 3 (worker) : 0.00s ( 19.35s) 0.00s ( 0.00s) Task 4 (worker) : 0.00s ( 40.28s) 0.00s ( 0.00s) Task 5 (worker) : 0.00s ( 45.08s) 0.00s ( 0.00s) Task 6 (worker) : 0.00s ( 47.06s) 0.00s ( 0.00s) Task 7 (worker) : 18.30s ( 49.73s) 16.24s ( 1.71s) Task 8 (worker) : 0.00s ( 51.22s) 0.00s ( 0.00s) Task 9 (worker) : 0.00s ( 53.75s) 0.00s ( 0.00s) Task 10 (worker) : 0.00s ( 54.17s) 0.00s ( 0.00s) Task 11 (worker) : 5.65s ( 54.30s) 0.70s ( 0.08s) Task 12 (worker) : 0.00s ( 54.41s) 0.41s ( 0.04s) Task 13 (worker) : 4.34s ( 54.58s) 4.50s ( 0.48s) Task 14 (worker) : 5.82s ( 54.76s) 5.91s ( 0.64s) Task 15 (worker) : 6.50s ( 55.01s) 3.37s ( 0.38s) Task 16 (worker) : 7.60s ( 55.21s) 8.56s ( 0.94s) Task 17 (worker) : 11.05s ( 55.21s) 9.04s ( 0.96s) Task 18 (worker) : 11.75s ( 55.21s) 12.94s ( 1.43s) Task 19 (worker) : 16.02s ( 55.21s) 13.32s ( 1.43s) Task 20 (worker) : 26.98s ( 55.23s) 7.43s ( 0.77s) Task 21 (worker) : 7.36s ( 55.23s) 7.47s ( 0.83s) Task 22 (worker) : 16.08s ( 55.23s) 10.25s ( 1.12s) Task 23 (worker) : 7.04s ( 55.23s) 4.99s ( 0.57s) Task 24 (worker) : 28.47s ( 55.23s) 8.78s ( 0.94s) Task 25 (worker) : 7.43s ( 55.23s) 1.62s ( 0.18s) Task 26 (worker) : 6.33s ( 55.23s) 11.42s ( 1.23s) Task 27 (worker) : 9.80s ( 55.23s) 8.72s ( 0.95s) Task 28 (worker) : 4.88s ( 55.26s) 8.92s ( 0.99s) Task 29 (worker) : 0.00s ( 55.26s) 0.00s ( 0.00s) Task 30 (bound) : 5.59s ( 55.26s) 0.59s ( 0.06s) Task 31 (worker) : 41.16s ( 55.26s) 3.48s ( 0.38s) Task 32 (worker) : 17.03s ( 55.26s) 3.90s ( 0.42s) Task 33 (worker) : 14.89s ( 55.26s) 5.29s ( 0.58s) Task 34 (worker) : 6.30s ( 55.26s) 1.99s ( 0.21s) Task 35 (worker) : 16.13s ( 55.26s) 13.95s ( 1.50s) Task 36 (worker) : 16.70s ( 55.26s) 13.02s ( 1.41s) Task 37 (worker) : 11.68s ( 55.26s) 15.45s ( 1.68s) Task 38 (worker) : 7.65s ( 55.26s) 13.61s ( 1.50s) Task 39 (worker) : 14.18s ( 55.26s) 12.48s ( 1.35s) Task 40 (worker) : 5.51s ( 55.26s) 11.92s ( 1.31s) Task 41 (worker) : 3.10s ( 55.26s) 8.17s ( 0.93s) SPARKS: 42525 (37707 converted, 519 pruned) INIT time 0.00s ( 0.00s elapsed) MUT time 365.55s ( 55.26s elapsed) GC time 248.45s ( 27.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 614.01s ( 82.26s elapsed) %GC time 40.5% (32.8% elapsed) Alloc rate 180,724,403 bytes per MUT second Productivity 59.5% of total user, 444.4% of total elapsed gc_alloc_block_sync: 1339708 whitehole_spin: 36 gen[0].sync_large_objects: 197621 gen[1].sync_large_objects: 6396 It seems like suddenly GC is taking a lot longer. I also tried using Control.Monad.Par, with runPar $ do parMap, and using +RTS -N12 -s I get this: 71,480,561,920 bytes allocated in the heap 275,752,912 bytes copied during GC 198,704 bytes maximum residency (343 sample(s)) 394,232 bytes maximum slop 12 MB total memory in use (3 MB lost due to fragmentation) Generation 0: 37575 collections, 37574 parallel, 169.08s, 16.67s elapsed Generation 1: 343 collections, 343 parallel, 6.24s, 0.61s elapsed Parallel GC work balance: 2.56 (33928871 / 13275450, ideal 12) MUT time (elapsed) GC time (elapsed) Task 0 (worker) : 0.70s ( 46.46s) 0.00s ( 0.00s) Task 1 (worker) : 2.94s ( 46.46s) 0.00s ( 0.00s) Task 2 (worker) : 0.10s ( 46.46s) 0.00s ( 0.00s) Task 3 (worker) : 0.06s ( 46.47s) 0.00s ( 0.00s) Task 4 (worker) : 1.70s ( 46.47s) 0.00s ( 0.00s) Task 5 (worker) : 0.07s ( 46.47s) 0.00s ( 0.00s) Task 6 (worker) : 2.75s ( 46.47s) 0.00s ( 0.00s) Task 7 (worker) : 1.21s ( 46.47s) 0.00s ( 0.00s) Task 8 (worker) : 0.43s ( 46.47s) 0.00s ( 0.00s) Task 9 (worker) : 10.44s ( 46.47s) 0.00s ( 0.00s) Task 10 (worker) : 0.06s ( 46.47s) 0.00s ( 0.00s) Task 11 (worker) : 1.36s ( 46.47s) 0.00s ( 0.00s) Task 12 (worker) : 0.00s ( 46.50s) 46.69s ( 4.63s) Task 13 (worker) : 0.00s ( 46.50s) 0.00s ( 0.00s) Task 14 (bound) : 0.05s ( 46.50s) 0.56s ( 0.06s) Task 15 (worker) : 24.84s ( 46.50s) 2.24s ( 0.21s) Task 16 (worker) : 28.00s ( 46.50s) 1.95s ( 0.19s) Task 17 (worker) : 35.30s ( 46.50s) 2.79s ( 0.27s) Task 18 (worker) : 29.03s ( 46.50s) 4.06s ( 0.39s) Task 19 (worker) : 32.71s ( 46.50s) 7.04s ( 0.69s) Task 20 (worker) : 32.74s ( 46.50s) 11.47s ( 1.09s) Task 21 (worker) : 23.35s ( 46.50s) 15.16s ( 1.49s) Task 22 (worker) : 36.00s ( 46.50s) 7.94s ( 0.76s) Task 23 (worker) : 0.00s ( 46.50s) 35.84s ( 3.63s) Task 24 (worker) : 24.70s ( 46.50s) 21.26s ( 2.06s) Task 25 (worker) : 23.07s ( 46.50s) 18.33s ( 1.82s) SPARKS: 0 (0 converted, 0 pruned) INIT time 0.00s ( 0.00s elapsed) MUT time 307.74s ( 46.50s elapsed) GC time 175.32s ( 17.29s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 483.06s ( 63.79s elapsed) %GC time 36.3% (27.1% elapsed) Alloc rate 232,274,501 bytes per MUT second Productivity 63.7% of total user, 482.4% of total elapsed gc_alloc_block_sync: 1619540 whitehole_spin: 258 gen[0].sync_large_objects: 232661 gen[1].sync_large_objects: 6378 This seems slightly better, but the GC time is still much bigger than I would expect it to be. I don't think it is solely because of parallel GC, since running the parallel code with +RTS -N1 gives results similar to the non-parallel version. Am I simply misunderstanding the way the GC time is represented in this output, or is there something amiss? (I also noticed that there seem to be a lot more than 12 threads running if I look at top while the program is running, but I assume this is something to do with the runtime spawning more threads for GC etc)

Hi Tom,
I think debugging this sort of problem is exactly what we need to be doing
(and making easier). Have you tried Duncan's newest version of Threadscope
by the way?
It looks like -- completely aside from the GC time -- this program is not
scaling. The mutator time itself, disregarding GC, isn't going down much
with parallelism (with the total mutator time increasing drastically).
Either this is completely memory bottlenecked or there is some other kind
of bad interaction (e.g. false sharing, contention on a hot lock, etc).
My inclination would be to figure this out first before worrying about the
GC time. Is this code that you would be able to share for debugging?
I think we need to get together some general documentation on how to debug
this kind of problem. For example, you can get some hints as to the memory
behavior by running valgrind/cachegrind on the program. Also, what does
"top" say, by the way? Is the process using 1200% CPU?
Cheers,
-Ryan
On Wed, Oct 5, 2011 at 2:15 PM, Tom Thorne
I am having some strange performance issues when using SMP parallelism, that I think may be something to do with GC. Apologies for the large readouts below but I'm not familiar enough to know what is and isn't relevant!
I have a pure function that is mapped over a list of around 10 values, and this happens several times for each iteration of my program. It does some fairly intensive calculations using hmatrix, generating intermediate matrices along the way. The computation is significantly more complex for some values, so the work done by each call is not spread equally. I did some profiling and it seems like the program is spending about 50% of its time in that function. First of all, without any attempts at parallelism, I see this from ./Main +RTS -s
67,142,126,336 bytes allocated in the heap 147,759,264 bytes copied during GC 109,384 bytes maximum residency (58 sample(s)) 354,408 bytes maximum slop 3 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 104551 collections, 0 parallel, 1.13s, 1.11s elapsed Generation 1: 58 collections, 0 parallel, 0.01s, 0.01s elapsed
Parallel GC work balance: -nan (0 / 0, ideal 1)
MUT time (elapsed) GC time (elapsed) Task 0 (worker) : 0.00s ( 67.06s) 0.00s ( 0.00s) Task 1 (worker) : 0.00s ( 67.09s) 0.00s ( 0.00s) Task 2 (bound) : 66.95s ( 67.09s) 1.14s ( 1.12s)
SPARKS: 0 (0 converted, 0 pruned)
INIT time 0.00s ( 0.00s elapsed) MUT time 66.95s ( 67.09s elapsed) GC time 1.14s ( 1.12s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 68.09s ( 68.21s elapsed)
%GC time 1.7% (1.6% elapsed)
Alloc rate 1,002,835,517 bytes per MUT second
Productivity 98.3% of total user, 98.2% of total elapsed
gc_alloc_block_sync: 0 whitehole_spin: 0 gen[0].sync_large_objects: 0 gen[1].sync_large_objects: 0
This looks ok to me...
Then if I try to use Control.Parallel to parallelise my code, simpy replacing a map with parMap (rdeepseq), on a 12 core machine using +RTS -N12 -s I get this:
66,065,148,144 bytes allocated in the heap 197,202,056 bytes copied during GC 181,312 bytes maximum residency (251 sample(s)) 387,240 bytes maximum slop 12 MB total memory in use (3 MB lost due to fragmentation)
Generation 0: 37592 collections, 37591 parallel, 245.32s, 26.67s elapsed Generation 1: 251 collections, 251 parallel, 3.12s, 0.33s elapsed
Parallel GC work balance: 2.41 (24219609 / 10058220, ideal 12)
MUT time (elapsed) GC time (elapsed) Task 0 (worker) : 0.00s ( 0.00s) 0.00s ( 0.00s) Task 1 (worker) : 0.00s ( 0.00s) 0.00s ( 0.00s) Task 2 (worker) : 0.00s ( 17.97s) 0.00s ( 0.00s) Task 3 (worker) : 0.00s ( 19.35s) 0.00s ( 0.00s) Task 4 (worker) : 0.00s ( 40.28s) 0.00s ( 0.00s) Task 5 (worker) : 0.00s ( 45.08s) 0.00s ( 0.00s) Task 6 (worker) : 0.00s ( 47.06s) 0.00s ( 0.00s) Task 7 (worker) : 18.30s ( 49.73s) 16.24s ( 1.71s) Task 8 (worker) : 0.00s ( 51.22s) 0.00s ( 0.00s) Task 9 (worker) : 0.00s ( 53.75s) 0.00s ( 0.00s) Task 10 (worker) : 0.00s ( 54.17s) 0.00s ( 0.00s) Task 11 (worker) : 5.65s ( 54.30s) 0.70s ( 0.08s) Task 12 (worker) : 0.00s ( 54.41s) 0.41s ( 0.04s) Task 13 (worker) : 4.34s ( 54.58s) 4.50s ( 0.48s) Task 14 (worker) : 5.82s ( 54.76s) 5.91s ( 0.64s) Task 15 (worker) : 6.50s ( 55.01s) 3.37s ( 0.38s) Task 16 (worker) : 7.60s ( 55.21s) 8.56s ( 0.94s) Task 17 (worker) : 11.05s ( 55.21s) 9.04s ( 0.96s) Task 18 (worker) : 11.75s ( 55.21s) 12.94s ( 1.43s) Task 19 (worker) : 16.02s ( 55.21s) 13.32s ( 1.43s) Task 20 (worker) : 26.98s ( 55.23s) 7.43s ( 0.77s) Task 21 (worker) : 7.36s ( 55.23s) 7.47s ( 0.83s) Task 22 (worker) : 16.08s ( 55.23s) 10.25s ( 1.12s) Task 23 (worker) : 7.04s ( 55.23s) 4.99s ( 0.57s) Task 24 (worker) : 28.47s ( 55.23s) 8.78s ( 0.94s) Task 25 (worker) : 7.43s ( 55.23s) 1.62s ( 0.18s) Task 26 (worker) : 6.33s ( 55.23s) 11.42s ( 1.23s) Task 27 (worker) : 9.80s ( 55.23s) 8.72s ( 0.95s) Task 28 (worker) : 4.88s ( 55.26s) 8.92s ( 0.99s) Task 29 (worker) : 0.00s ( 55.26s) 0.00s ( 0.00s) Task 30 (bound) : 5.59s ( 55.26s) 0.59s ( 0.06s) Task 31 (worker) : 41.16s ( 55.26s) 3.48s ( 0.38s) Task 32 (worker) : 17.03s ( 55.26s) 3.90s ( 0.42s) Task 33 (worker) : 14.89s ( 55.26s) 5.29s ( 0.58s) Task 34 (worker) : 6.30s ( 55.26s) 1.99s ( 0.21s) Task 35 (worker) : 16.13s ( 55.26s) 13.95s ( 1.50s) Task 36 (worker) : 16.70s ( 55.26s) 13.02s ( 1.41s) Task 37 (worker) : 11.68s ( 55.26s) 15.45s ( 1.68s) Task 38 (worker) : 7.65s ( 55.26s) 13.61s ( 1.50s) Task 39 (worker) : 14.18s ( 55.26s) 12.48s ( 1.35s) Task 40 (worker) : 5.51s ( 55.26s) 11.92s ( 1.31s) Task 41 (worker) : 3.10s ( 55.26s) 8.17s ( 0.93s)
SPARKS: 42525 (37707 converted, 519 pruned)
INIT time 0.00s ( 0.00s elapsed) MUT time 365.55s ( 55.26s elapsed) GC time 248.45s ( 27.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 614.01s ( 82.26s elapsed)
%GC time 40.5% (32.8% elapsed)
Alloc rate 180,724,403 bytes per MUT second
Productivity 59.5% of total user, 444.4% of total elapsed
gc_alloc_block_sync: 1339708 whitehole_spin: 36 gen[0].sync_large_objects: 197621 gen[1].sync_large_objects: 6396
It seems like suddenly GC is taking a lot longer.
I also tried using Control.Monad.Par, with runPar $ do parMap, and using +RTS -N12 -s I get this:
71,480,561,920 bytes allocated in the heap 275,752,912 bytes copied during GC 198,704 bytes maximum residency (343 sample(s)) 394,232 bytes maximum slop 12 MB total memory in use (3 MB lost due to fragmentation)
Generation 0: 37575 collections, 37574 parallel, 169.08s, 16.67s elapsed Generation 1: 343 collections, 343 parallel, 6.24s, 0.61s elapsed
Parallel GC work balance: 2.56 (33928871 / 13275450, ideal 12)
MUT time (elapsed) GC time (elapsed) Task 0 (worker) : 0.70s ( 46.46s) 0.00s ( 0.00s) Task 1 (worker) : 2.94s ( 46.46s) 0.00s ( 0.00s) Task 2 (worker) : 0.10s ( 46.46s) 0.00s ( 0.00s) Task 3 (worker) : 0.06s ( 46.47s) 0.00s ( 0.00s) Task 4 (worker) : 1.70s ( 46.47s) 0.00s ( 0.00s) Task 5 (worker) : 0.07s ( 46.47s) 0.00s ( 0.00s) Task 6 (worker) : 2.75s ( 46.47s) 0.00s ( 0.00s) Task 7 (worker) : 1.21s ( 46.47s) 0.00s ( 0.00s) Task 8 (worker) : 0.43s ( 46.47s) 0.00s ( 0.00s) Task 9 (worker) : 10.44s ( 46.47s) 0.00s ( 0.00s) Task 10 (worker) : 0.06s ( 46.47s) 0.00s ( 0.00s) Task 11 (worker) : 1.36s ( 46.47s) 0.00s ( 0.00s) Task 12 (worker) : 0.00s ( 46.50s) 46.69s ( 4.63s) Task 13 (worker) : 0.00s ( 46.50s) 0.00s ( 0.00s) Task 14 (bound) : 0.05s ( 46.50s) 0.56s ( 0.06s) Task 15 (worker) : 24.84s ( 46.50s) 2.24s ( 0.21s) Task 16 (worker) : 28.00s ( 46.50s) 1.95s ( 0.19s) Task 17 (worker) : 35.30s ( 46.50s) 2.79s ( 0.27s) Task 18 (worker) : 29.03s ( 46.50s) 4.06s ( 0.39s) Task 19 (worker) : 32.71s ( 46.50s) 7.04s ( 0.69s) Task 20 (worker) : 32.74s ( 46.50s) 11.47s ( 1.09s) Task 21 (worker) : 23.35s ( 46.50s) 15.16s ( 1.49s) Task 22 (worker) : 36.00s ( 46.50s) 7.94s ( 0.76s) Task 23 (worker) : 0.00s ( 46.50s) 35.84s ( 3.63s) Task 24 (worker) : 24.70s ( 46.50s) 21.26s ( 2.06s) Task 25 (worker) : 23.07s ( 46.50s) 18.33s ( 1.82s)
SPARKS: 0 (0 converted, 0 pruned)
INIT time 0.00s ( 0.00s elapsed) MUT time 307.74s ( 46.50s elapsed) GC time 175.32s ( 17.29s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 483.06s ( 63.79s elapsed)
%GC time 36.3% (27.1% elapsed)
Alloc rate 232,274,501 bytes per MUT second
Productivity 63.7% of total user, 482.4% of total elapsed
gc_alloc_block_sync: 1619540 whitehole_spin: 258 gen[0].sync_large_objects: 232661 gen[1].sync_large_objects: 6378
This seems slightly better, but the GC time is still much bigger than I would expect it to be.
I don't think it is solely because of parallel GC, since running the parallel code with +RTS -N1 gives results similar to the non-parallel version.
Am I simply misunderstanding the way the GC time is represented in this output, or is there something amiss?
(I also noticed that there seem to be a lot more than 12 threads running if I look at top while the program is running, but I assume this is something to do with the runtime spawning more threads for GC etc)
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Thanks for the reply, I haven't actually tried threadscope yet, I will have
a look at that tomorrow at some point. I also had no idea you could use
valgrind on haskell programs, so I will look into that as well.
I think the program certainly does have problems scaling, since I made a
very basic attempt at parallelism that wasn't dividing the work equally
(although this was because in an early version of the program, this seemed
to work better than dividing the work more finely and evenly, possibly
because of the issue with parallel GC I mention below...)
I just changed the code to split the work into smaller chunks and more
evenly, using Control.Parallel, and it seems much better -- conditional on
one thing. If I switch off the parallel GC by setting -qg, my code suddenly
runs much faster and outperforms the sequential version, coming in at around
20 seconds walltime, and spends only 1% of its time on GC. top seems to be
sitting at around 800%. I can post the output from +RTS -s as well if that
would help.
The only problem is that now I am getting random occasional segmentation
faults that I was not been getting before, and once got a message saying:
Main: schedule: re-entered unsafely
Perhaps a 'foreign import unsafe' should be 'safe'?
I think this may be something to do with creating a lot of sparks though,
since this occurs whether I have the parallel GC on or not.
But it also seems that turning on the parallel GC makes the fine grained
parallel code, which is creating around 1,000,000 sparks, and outperforms
the sequential version if I turn off parallel GC, run much slower (260s
walltime) than the coarse grained parallel code (50,000 sparks, 82s
walltime). I think this is what led me to originally use the coarse grained
parallel version that divides the work unequally.
I imagine there is still some room for improvement, since 20s is only about
3 times faster than the sequential version, but I think that is more an
issue of me being something of a novice haskell programmer! It does seem to
me like there is something very strange going on with the GC though.
Unfortunately the code is for a paper (nothing to do with computer science
or haskell incidentally), and although I'm planning on open sourcing it, I
can't do that until the paper is published.
thanks again
Tom
On Wed, Oct 5, 2011 at 9:36 PM, Ryan Newton
Hi Tom,
I think debugging this sort of problem is exactly what we need to be doing (and making easier). Have you tried Duncan's newest version of Threadscope by the way?
It looks like -- completely aside from the GC time -- this program is not scaling. The mutator time itself, disregarding GC, isn't going down much with parallelism (with the total mutator time increasing drastically). Either this is completely memory bottlenecked or there is some other kind of bad interaction (e.g. false sharing, contention on a hot lock, etc).
My inclination would be to figure this out first before worrying about the GC time. Is this code that you would be able to share for debugging?
I think we need to get together some general documentation on how to debug this kind of problem. For example, you can get some hints as to the memory behavior by running valgrind/cachegrind on the program. Also, what does "top" say, by the way? Is the process using 1200% CPU?
Cheers, -Ryan
On Wed, Oct 5, 2011 at 2:15 PM, Tom Thorne
wrote: I am having some strange performance issues when using SMP parallelism, that I think may be something to do with GC. Apologies for the large readouts below but I'm not familiar enough to know what is and isn't relevant!
I have a pure function that is mapped over a list of around 10 values, and this happens several times for each iteration of my program. It does some fairly intensive calculations using hmatrix, generating intermediate matrices along the way. The computation is significantly more complex for some values, so the work done by each call is not spread equally. I did some profiling and it seems like the program is spending about 50% of its time in that function. First of all, without any attempts at parallelism, I see this from ./Main +RTS -s
67,142,126,336 bytes allocated in the heap 147,759,264 bytes copied during GC 109,384 bytes maximum residency (58 sample(s)) 354,408 bytes maximum slop 3 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 104551 collections, 0 parallel, 1.13s, 1.11s elapsed Generation 1: 58 collections, 0 parallel, 0.01s, 0.01s elapsed
Parallel GC work balance: -nan (0 / 0, ideal 1)
MUT time (elapsed) GC time (elapsed) Task 0 (worker) : 0.00s ( 67.06s) 0.00s ( 0.00s) Task 1 (worker) : 0.00s ( 67.09s) 0.00s ( 0.00s) Task 2 (bound) : 66.95s ( 67.09s) 1.14s ( 1.12s)
SPARKS: 0 (0 converted, 0 pruned)
INIT time 0.00s ( 0.00s elapsed) MUT time 66.95s ( 67.09s elapsed) GC time 1.14s ( 1.12s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 68.09s ( 68.21s elapsed)
%GC time 1.7% (1.6% elapsed)
Alloc rate 1,002,835,517 bytes per MUT second
Productivity 98.3% of total user, 98.2% of total elapsed
gc_alloc_block_sync: 0 whitehole_spin: 0 gen[0].sync_large_objects: 0 gen[1].sync_large_objects: 0
This looks ok to me...
Then if I try to use Control.Parallel to parallelise my code, simpy replacing a map with parMap (rdeepseq), on a 12 core machine using +RTS -N12 -s I get this:
66,065,148,144 bytes allocated in the heap 197,202,056 bytes copied during GC 181,312 bytes maximum residency (251 sample(s)) 387,240 bytes maximum slop 12 MB total memory in use (3 MB lost due to fragmentation)
Generation 0: 37592 collections, 37591 parallel, 245.32s, 26.67s elapsed Generation 1: 251 collections, 251 parallel, 3.12s, 0.33s elapsed
Parallel GC work balance: 2.41 (24219609 / 10058220, ideal 12)
MUT time (elapsed) GC time (elapsed) Task 0 (worker) : 0.00s ( 0.00s) 0.00s ( 0.00s) Task 1 (worker) : 0.00s ( 0.00s) 0.00s ( 0.00s) Task 2 (worker) : 0.00s ( 17.97s) 0.00s ( 0.00s) Task 3 (worker) : 0.00s ( 19.35s) 0.00s ( 0.00s) Task 4 (worker) : 0.00s ( 40.28s) 0.00s ( 0.00s) Task 5 (worker) : 0.00s ( 45.08s) 0.00s ( 0.00s) Task 6 (worker) : 0.00s ( 47.06s) 0.00s ( 0.00s) Task 7 (worker) : 18.30s ( 49.73s) 16.24s ( 1.71s) Task 8 (worker) : 0.00s ( 51.22s) 0.00s ( 0.00s) Task 9 (worker) : 0.00s ( 53.75s) 0.00s ( 0.00s) Task 10 (worker) : 0.00s ( 54.17s) 0.00s ( 0.00s) Task 11 (worker) : 5.65s ( 54.30s) 0.70s ( 0.08s) Task 12 (worker) : 0.00s ( 54.41s) 0.41s ( 0.04s) Task 13 (worker) : 4.34s ( 54.58s) 4.50s ( 0.48s) Task 14 (worker) : 5.82s ( 54.76s) 5.91s ( 0.64s) Task 15 (worker) : 6.50s ( 55.01s) 3.37s ( 0.38s) Task 16 (worker) : 7.60s ( 55.21s) 8.56s ( 0.94s) Task 17 (worker) : 11.05s ( 55.21s) 9.04s ( 0.96s) Task 18 (worker) : 11.75s ( 55.21s) 12.94s ( 1.43s) Task 19 (worker) : 16.02s ( 55.21s) 13.32s ( 1.43s) Task 20 (worker) : 26.98s ( 55.23s) 7.43s ( 0.77s) Task 21 (worker) : 7.36s ( 55.23s) 7.47s ( 0.83s) Task 22 (worker) : 16.08s ( 55.23s) 10.25s ( 1.12s) Task 23 (worker) : 7.04s ( 55.23s) 4.99s ( 0.57s) Task 24 (worker) : 28.47s ( 55.23s) 8.78s ( 0.94s) Task 25 (worker) : 7.43s ( 55.23s) 1.62s ( 0.18s) Task 26 (worker) : 6.33s ( 55.23s) 11.42s ( 1.23s) Task 27 (worker) : 9.80s ( 55.23s) 8.72s ( 0.95s) Task 28 (worker) : 4.88s ( 55.26s) 8.92s ( 0.99s) Task 29 (worker) : 0.00s ( 55.26s) 0.00s ( 0.00s) Task 30 (bound) : 5.59s ( 55.26s) 0.59s ( 0.06s) Task 31 (worker) : 41.16s ( 55.26s) 3.48s ( 0.38s) Task 32 (worker) : 17.03s ( 55.26s) 3.90s ( 0.42s) Task 33 (worker) : 14.89s ( 55.26s) 5.29s ( 0.58s) Task 34 (worker) : 6.30s ( 55.26s) 1.99s ( 0.21s) Task 35 (worker) : 16.13s ( 55.26s) 13.95s ( 1.50s) Task 36 (worker) : 16.70s ( 55.26s) 13.02s ( 1.41s) Task 37 (worker) : 11.68s ( 55.26s) 15.45s ( 1.68s) Task 38 (worker) : 7.65s ( 55.26s) 13.61s ( 1.50s) Task 39 (worker) : 14.18s ( 55.26s) 12.48s ( 1.35s) Task 40 (worker) : 5.51s ( 55.26s) 11.92s ( 1.31s) Task 41 (worker) : 3.10s ( 55.26s) 8.17s ( 0.93s)
SPARKS: 42525 (37707 converted, 519 pruned)
INIT time 0.00s ( 0.00s elapsed) MUT time 365.55s ( 55.26s elapsed) GC time 248.45s ( 27.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 614.01s ( 82.26s elapsed)
%GC time 40.5% (32.8% elapsed)
Alloc rate 180,724,403 bytes per MUT second
Productivity 59.5% of total user, 444.4% of total elapsed
gc_alloc_block_sync: 1339708 whitehole_spin: 36 gen[0].sync_large_objects: 197621 gen[1].sync_large_objects: 6396
It seems like suddenly GC is taking a lot longer.
I also tried using Control.Monad.Par, with runPar $ do parMap, and using +RTS -N12 -s I get this:
71,480,561,920 bytes allocated in the heap 275,752,912 bytes copied during GC 198,704 bytes maximum residency (343 sample(s)) 394,232 bytes maximum slop 12 MB total memory in use (3 MB lost due to fragmentation)
Generation 0: 37575 collections, 37574 parallel, 169.08s, 16.67s elapsed Generation 1: 343 collections, 343 parallel, 6.24s, 0.61s elapsed
Parallel GC work balance: 2.56 (33928871 / 13275450, ideal 12)
MUT time (elapsed) GC time (elapsed) Task 0 (worker) : 0.70s ( 46.46s) 0.00s ( 0.00s) Task 1 (worker) : 2.94s ( 46.46s) 0.00s ( 0.00s) Task 2 (worker) : 0.10s ( 46.46s) 0.00s ( 0.00s) Task 3 (worker) : 0.06s ( 46.47s) 0.00s ( 0.00s) Task 4 (worker) : 1.70s ( 46.47s) 0.00s ( 0.00s) Task 5 (worker) : 0.07s ( 46.47s) 0.00s ( 0.00s) Task 6 (worker) : 2.75s ( 46.47s) 0.00s ( 0.00s) Task 7 (worker) : 1.21s ( 46.47s) 0.00s ( 0.00s) Task 8 (worker) : 0.43s ( 46.47s) 0.00s ( 0.00s) Task 9 (worker) : 10.44s ( 46.47s) 0.00s ( 0.00s) Task 10 (worker) : 0.06s ( 46.47s) 0.00s ( 0.00s) Task 11 (worker) : 1.36s ( 46.47s) 0.00s ( 0.00s) Task 12 (worker) : 0.00s ( 46.50s) 46.69s ( 4.63s) Task 13 (worker) : 0.00s ( 46.50s) 0.00s ( 0.00s) Task 14 (bound) : 0.05s ( 46.50s) 0.56s ( 0.06s) Task 15 (worker) : 24.84s ( 46.50s) 2.24s ( 0.21s) Task 16 (worker) : 28.00s ( 46.50s) 1.95s ( 0.19s) Task 17 (worker) : 35.30s ( 46.50s) 2.79s ( 0.27s) Task 18 (worker) : 29.03s ( 46.50s) 4.06s ( 0.39s) Task 19 (worker) : 32.71s ( 46.50s) 7.04s ( 0.69s) Task 20 (worker) : 32.74s ( 46.50s) 11.47s ( 1.09s) Task 21 (worker) : 23.35s ( 46.50s) 15.16s ( 1.49s) Task 22 (worker) : 36.00s ( 46.50s) 7.94s ( 0.76s) Task 23 (worker) : 0.00s ( 46.50s) 35.84s ( 3.63s) Task 24 (worker) : 24.70s ( 46.50s) 21.26s ( 2.06s) Task 25 (worker) : 23.07s ( 46.50s) 18.33s ( 1.82s)
SPARKS: 0 (0 converted, 0 pruned)
INIT time 0.00s ( 0.00s elapsed) MUT time 307.74s ( 46.50s elapsed) GC time 175.32s ( 17.29s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 483.06s ( 63.79s elapsed)
%GC time 36.3% (27.1% elapsed)
Alloc rate 232,274,501 bytes per MUT second
Productivity 63.7% of total user, 482.4% of total elapsed
gc_alloc_block_sync: 1619540 whitehole_spin: 258 gen[0].sync_large_objects: 232661 gen[1].sync_large_objects: 6378
This seems slightly better, but the GC time is still much bigger than I would expect it to be.
I don't think it is solely because of parallel GC, since running the parallel code with +RTS -N1 gives results similar to the non-parallel version.
Am I simply misunderstanding the way the GC time is represented in this output, or is there something amiss?
(I also noticed that there seem to be a lot more than 12 threads running if I look at top while the program is running, but I assume this is something to do with the runtime spawning more threads for GC etc)
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Wed, Oct 5, 2011 at 2:37 PM, Tom Thorne
The only problem is that now I am getting random occasional segmentation faults that I was not been getting before, and once got a message saying: Main: schedule: re-entered unsafely Perhaps a 'foreign import unsafe' should be 'safe'? I think this may be something to do with creating a lot of sparks though, since this occurs whether I have the parallel GC on or not.
Unless you (or some library you're using) is doing what the error message says then you should file a GHC bug here: http://hackage.haskell.org/trac/ghc/ -- Johan

I'm trying to narrow it down so that I can submit a meaningful bug report,
and it seems to be something to do with switching off parallel GC using -qg,
whilst also passing -Nx.
Are there any known issues with this that people are aware of? At the moment
I am using the latest haskell platform release on arch.
I'd like to give 7.2 a try in case that fixes it, but I'm using rather a lot
of libraries (hmatrix, fclabels, random fu) and I don't know how to install
them for multiple ghc versions
On Wed, Oct 5, 2011 at 10:43 PM, Johan Tibell
On Wed, Oct 5, 2011 at 2:37 PM, Tom Thorne
wrote: The only problem is that now I am getting random occasional segmentation faults that I was not been getting before, and once got a message saying: Main: schedule: re-entered unsafely Perhaps a 'foreign import unsafe' should be 'safe'? I think this may be something to do with creating a lot of sparks though, since this occurs whether I have the parallel GC on or not.
Unless you (or some library you're using) is doing what the error message says then you should file a GHC bug here:
http://hackage.haskell.org/trac/ghc/
-- Johan

I have made a dummy program that seems to exhibit the same GC
slowdown behavior, minus the segmentation faults. Compiling with -threaded
and running with -N12 I get very bad performance (3x slower than -N1),
running with -N12 -qg it runs approximately 3 times faster than -N1. I don't
know if I should submit this as a bug or not? I'd certainly like to know why
this is happening!
import Numeric.LinearAlgebra
import Numeric.GSL.Special.Gamma
import Control.Parallel.Strategies
import Control.Monad
import Data.IORef
import Data.Random
import Data.Random.Source.PureMT
import Debug.Trace
--
subsets s n = (subsets_stream s) !! n
subsets_stream [] = [[]] : repeat []
subsets_stream (x:xs) =
let r = subsets_stream xs
s = map (map (x:)) r
in [[]] : zipWith (++) s (tail r)
testfun :: Matrix Double -> Int -> [Int] -> Double
testfun x k cs = lngamma (det (c+u))
where
(m,c) = meanCov xx
m' = fromRows [m]
u = (trans m') <> m'
xx = fromColumns ( [(toColumns x)!!i] ++ [(toColumns x)!!j] ++ [(toColumns
x)!!k] )
i = cs !! 0
j = cs !! 1
test :: Matrix Double -> Int -> Double
test x i = sum p
where
p = parMap (rdeepseq) (testfun x (i+1)) (subsets [0..i] 2)
ranMatrix :: Int -> RVar (Matrix Double)
ranMatrix n = do
xs <- mapM (\_ -> mapM (\_ -> uniform 0 1.0) [1..n]) [1..n]
return (fromLists xs)
loop :: Int -> Double -> Int -> RVar Double
loop n s i = traceShow i $ do
x <- ranMatrix n
let r = sum $ parMap (rdeepseq) (test x) [2..(n-2)]
return (r+s)
main = do
let n = 100
let iter = 5
rng <- newPureMT
rngr <- newIORef rng
p <- runRVar (foldM (loop n) 0.0 [1..iter]) rngr
print p
I have also found that the segmentation faults in my code disappear if I
switch from Control.Parallel to Control.Monad.Par, which is quite strange. I
get slightly better performance with Control.Parallel when it completes
without a seg. fault, and the frequency with which it does so seems to
depend on the number of sparks that are being created.
On Thu, Oct 6, 2011 at 1:56 PM, Tom Thorne
I'm trying to narrow it down so that I can submit a meaningful bug report, and it seems to be something to do with switching off parallel GC using -qg, whilst also passing -Nx.
Are there any known issues with this that people are aware of? At the moment I am using the latest haskell platform release on arch.
I'd like to give 7.2 a try in case that fixes it, but I'm using rather a lot of libraries (hmatrix, fclabels, random fu) and I don't know how to install them for multiple ghc versions
On Wed, Oct 5, 2011 at 10:43 PM, Johan Tibell
wrote: On Wed, Oct 5, 2011 at 2:37 PM, Tom Thorne
wrote: The only problem is that now I am getting random occasional segmentation faults that I was not been getting before, and once got a message saying: Main: schedule: re-entered unsafely Perhaps a 'foreign import unsafe' should be 'safe'? I think this may be something to do with creating a lot of sparks though, since this occurs whether I have the parallel GC on or not.
Unless you (or some library you're using) is doing what the error message says then you should file a GHC bug here:
http://hackage.haskell.org/trac/ghc/
-- Johan

I'm not sure if this is at all related, but if I run a small Repa program
with more threads than I have cores/CPUs then it gets drastically slower, I
have a dual core laptop - and -N2 makes my small program take approximately
0.6 of the time. Increasing to -N4 and we're running about 2x the time, -N8
and it's taking 20x or more. I guess this is probably more down to the
design of Repa rather than GHC itself?
Oliver
On Sat, Oct 8, 2011 at 1:21 AM, Tom Thorne
I have made a dummy program that seems to exhibit the same GC slowdown behavior, minus the segmentation faults. Compiling with -threaded and running with -N12 I get very bad performance (3x slower than -N1), running with -N12 -qg it runs approximately 3 times faster than -N1. I don't know if I should submit this as a bug or not? I'd certainly like to know why this is happening!
import Numeric.LinearAlgebra import Numeric.GSL.Special.Gamma import Control.Parallel.Strategies import Control.Monad import Data.IORef import Data.Random import Data.Random.Source.PureMT import Debug.Trace --
subsets s n = (subsets_stream s) !! n
subsets_stream [] = [[]] : repeat [] subsets_stream (x:xs) = let r = subsets_stream xs s = map (map (x:)) r in [[]] : zipWith (++) s (tail r)
testfun :: Matrix Double -> Int -> [Int] -> Double testfun x k cs = lngamma (det (c+u)) where (m,c) = meanCov xx m' = fromRows [m] u = (trans m') <> m' xx = fromColumns ( [(toColumns x)!!i] ++ [(toColumns x)!!j] ++ [(toColumns x)!!k] ) i = cs !! 0 j = cs !! 1
test :: Matrix Double -> Int -> Double test x i = sum p where p = parMap (rdeepseq) (testfun x (i+1)) (subsets [0..i] 2)
ranMatrix :: Int -> RVar (Matrix Double) ranMatrix n = do xs <- mapM (\_ -> mapM (\_ -> uniform 0 1.0) [1..n]) [1..n] return (fromLists xs)
loop :: Int -> Double -> Int -> RVar Double loop n s i = traceShow i $ do x <- ranMatrix n let r = sum $ parMap (rdeepseq) (test x) [2..(n-2)] return (r+s)
main = do let n = 100 let iter = 5 rng <- newPureMT rngr <- newIORef rng p <- runRVar (foldM (loop n) 0.0 [1..iter]) rngr print p
I have also found that the segmentation faults in my code disappear if I switch from Control.Parallel to Control.Monad.Par, which is quite strange. I get slightly better performance with Control.Parallel when it completes without a seg. fault, and the frequency with which it does so seems to depend on the number of sparks that are being created.
On Thu, Oct 6, 2011 at 1:56 PM, Tom Thorne
wrote: I'm trying to narrow it down so that I can submit a meaningful bug report, and it seems to be something to do with switching off parallel GC using -qg, whilst also passing -Nx.
Are there any known issues with this that people are aware of? At the moment I am using the latest haskell platform release on arch.
I'd like to give 7.2 a try in case that fixes it, but I'm using rather a lot of libraries (hmatrix, fclabels, random fu) and I don't know how to install them for multiple ghc versions
On Wed, Oct 5, 2011 at 10:43 PM, Johan Tibell
wrote: On Wed, Oct 5, 2011 at 2:37 PM, Tom Thorne
wrote: The only problem is that now I am getting random occasional segmentation faults that I was not been getting before, and once got a message saying: Main: schedule: re-entered unsafely Perhaps a 'foreign import unsafe' should be 'safe'? I think this may be something to do with creating a lot of sparks though, since this occurs whether I have the parallel GC on or not.
Unless you (or some library you're using) is doing what the error message says then you should file a GHC bug here:
http://hackage.haskell.org/trac/ghc/
-- Johan
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

I am guessing that it is slowdown caused by GC needing to co-ordinate with
blocked threads. That requires lots of re-scheduling to happen in the
kernel.
This is a hard problem I think, but also increasingly important as
virtualization becomes more important and the number of schedulable cores
unknown.
Alexander
On 7 October 2011 12:31, Oliver Batchelor
I'm not sure if this is at all related, but if I run a small Repa program with more threads than I have cores/CPUs then it gets drastically slower, I have a dual core laptop - and -N2 makes my small program take approximately 0.6 of the time. Increasing to -N4 and we're running about 2x the time, -N8 and it's taking 20x or more. I guess this is probably more down to the design of Repa rather than GHC itself?
Oliver
On Sat, Oct 8, 2011 at 1:21 AM, Tom Thorne
wrote: I have made a dummy program that seems to exhibit the same GC slowdown behavior, minus the segmentation faults. Compiling with -threaded and running with -N12 I get very bad performance (3x slower than -N1), running with -N12 -qg it runs approximately 3 times faster than -N1. I don't know if I should submit this as a bug or not? I'd certainly like to know why this is happening!
import Numeric.LinearAlgebra import Numeric.GSL.Special.Gamma import Control.Parallel.Strategies import Control.Monad import Data.IORef import Data.Random import Data.Random.Source.PureMT import Debug.Trace --
subsets s n = (subsets_stream s) !! n
subsets_stream [] = [[]] : repeat [] subsets_stream (x:xs) = let r = subsets_stream xs s = map (map (x:)) r in [[]] : zipWith (++) s (tail r)
testfun :: Matrix Double -> Int -> [Int] -> Double testfun x k cs = lngamma (det (c+u)) where (m,c) = meanCov xx m' = fromRows [m] u = (trans m') <> m' xx = fromColumns ( [(toColumns x)!!i] ++ [(toColumns x)!!j] ++ [(toColumns x)!!k] ) i = cs !! 0 j = cs !! 1
test :: Matrix Double -> Int -> Double test x i = sum p where p = parMap (rdeepseq) (testfun x (i+1)) (subsets [0..i] 2)
ranMatrix :: Int -> RVar (Matrix Double) ranMatrix n = do xs <- mapM (\_ -> mapM (\_ -> uniform 0 1.0) [1..n]) [1..n] return (fromLists xs)
loop :: Int -> Double -> Int -> RVar Double loop n s i = traceShow i $ do x <- ranMatrix n let r = sum $ parMap (rdeepseq) (test x) [2..(n-2)] return (r+s)
main = do let n = 100 let iter = 5 rng <- newPureMT rngr <- newIORef rng p <- runRVar (foldM (loop n) 0.0 [1..iter]) rngr print p
I have also found that the segmentation faults in my code disappear if I switch from Control.Parallel to Control.Monad.Par, which is quite strange. I get slightly better performance with Control.Parallel when it completes without a seg. fault, and the frequency with which it does so seems to depend on the number of sparks that are being created.
On Thu, Oct 6, 2011 at 1:56 PM, Tom Thorne
wrote: I'm trying to narrow it down so that I can submit a meaningful bug report, and it seems to be something to do with switching off parallel GC using -qg, whilst also passing -Nx.
Are there any known issues with this that people are aware of? At the moment I am using the latest haskell platform release on arch.
I'd like to give 7.2 a try in case that fixes it, but I'm using rather a lot of libraries (hmatrix, fclabels, random fu) and I don't know how to install them for multiple ghc versions
On Wed, Oct 5, 2011 at 10:43 PM, Johan Tibell
wrote: On Wed, Oct 5, 2011 at 2:37 PM, Tom Thorne
wrote: The only problem is that now I am getting random occasional segmentation faults that I was not been getting before, and once got a message saying: Main: schedule: re-entered unsafely Perhaps a 'foreign import unsafe' should be 'safe'? I think this may be something to do with creating a lot of sparks though, since this occurs whether I have the parallel GC on or not.
Unless you (or some library you're using) is doing what the error message says then you should file a GHC bug here:
http://hackage.haskell.org/trac/ghc/
-- Johan
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

It's GHC, and partly the OS scheduler in some sense. Oversaturating,
i.e. using an -N option > your number of logical cores (including
hyperthreads) will slow down your program typically. This isn't
uncommon, and is well known - GHC's lightweight threads have an M:N
threading model, but for good performance, you typically want the OS
threads and cores to have a 1:1 correspondence. Creating a massive
amount of OS threads will cause much context switching between them,
which is going to slow things down. When GHC needs to synchronize OS
threads in order to do a GC, there are going to be a lot of threads
being context swapped in/out in order to achieve this (because the GC
must halt all mutator threads to do its thing.)
Furthermore, oversaturation isn't the only problem - having the same
number of threads as cores will mean *some* thread is going to get
de-scheduled. Many times this means that the thread in which GHC does
GC will get descheduled by the OS. A corollary of this descheduling
phenomenon is that even using the same # of OS threads as you have
cores could result in -worse- performance than N-1 OS threads. This
was mitigated a bit in 7.2.1 I think. Linux was affected much more
drastically than others (OS X and Solaris have vastly different
schedulers, and as a result the performance wouldn't just tank - it
would actually get better IIRC, it just wouldn't scale as well at that
point.) At the very least, on Linux, using an -N option equivalent to
your number of logical cores should not drastically slow things down
anymore - but it won't make them faster either. This is the "dreaded
last core slowdown" bug that's been known about for a while, and as a
result, you typically only see parallel speedup on Linux up to N-1
threads, where N = the number of cores you have.
As a result, with dual-core only (and no hyperthreading,) on Linux,
you're very unlikely to be able to get parallel speedup in any case.
There's work to fix this in the garbage collector among other things,
but it's not clear if it's going into GHC just yet.
On Fri, Oct 7, 2011 at 2:31 PM, Oliver Batchelor
I'm not sure if this is at all related, but if I run a small Repa program with more threads than I have cores/CPUs then it gets drastically slower, I have a dual core laptop - and -N2 makes my small program take approximately 0.6 of the time. Increasing to -N4 and we're running about 2x the time, -N8 and it's taking 20x or more. I guess this is probably more down to the design of Repa rather than GHC itself? Oliver
On Sat, Oct 8, 2011 at 1:21 AM, Tom Thorne
wrote: I have made a dummy program that seems to exhibit the same GC slowdown behavior, minus the segmentation faults. Compiling with -threaded and running with -N12 I get very bad performance (3x slower than -N1), running with -N12 -qg it runs approximately 3 times faster than -N1. I don't know if I should submit this as a bug or not? I'd certainly like to know why this is happening! import Numeric.LinearAlgebra import Numeric.GSL.Special.Gamma import Control.Parallel.Strategies import Control.Monad import Data.IORef import Data.Random import Data.Random.Source.PureMT import Debug.Trace -- subsets s n = (subsets_stream s) !! n subsets_stream [] = [[]] : repeat [] subsets_stream (x:xs) = let r = subsets_stream xs s = map (map (x:)) r in [[]] : zipWith (++) s (tail r) testfun :: Matrix Double -> Int -> [Int] -> Double testfun x k cs = lngamma (det (c+u)) where (m,c) = meanCov xx m' = fromRows [m] u = (trans m') <> m' xx = fromColumns ( [(toColumns x)!!i] ++ [(toColumns x)!!j] ++ [(toColumns x)!!k] ) i = cs !! 0 j = cs !! 1
test :: Matrix Double -> Int -> Double test x i = sum p where p = parMap (rdeepseq) (testfun x (i+1)) (subsets [0..i] 2)
ranMatrix :: Int -> RVar (Matrix Double) ranMatrix n = do xs <- mapM (\_ -> mapM (\_ -> uniform 0 1.0) [1..n]) [1..n] return (fromLists xs)
loop :: Int -> Double -> Int -> RVar Double loop n s i = traceShow i $ do x <- ranMatrix n let r = sum $ parMap (rdeepseq) (test x) [2..(n-2)] return (r+s) main = do let n = 100 let iter = 5 rng <- newPureMT rngr <- newIORef rng p <- runRVar (foldM (loop n) 0.0 [1..iter]) rngr print p I have also found that the segmentation faults in my code disappear if I switch from Control.Parallel to Control.Monad.Par, which is quite strange. I get slightly better performance with Control.Parallel when it completes without a seg. fault, and the frequency with which it does so seems to depend on the number of sparks that are being created. On Thu, Oct 6, 2011 at 1:56 PM, Tom Thorne
wrote: I'm trying to narrow it down so that I can submit a meaningful bug report, and it seems to be something to do with switching off parallel GC using -qg, whilst also passing -Nx. Are there any known issues with this that people are aware of? At the moment I am using the latest haskell platform release on arch. I'd like to give 7.2 a try in case that fixes it, but I'm using rather a lot of libraries (hmatrix, fclabels, random fu) and I don't know how to install them for multiple ghc versions On Wed, Oct 5, 2011 at 10:43 PM, Johan Tibell
wrote: On Wed, Oct 5, 2011 at 2:37 PM, Tom Thorne
wrote: The only problem is that now I am getting random occasional segmentation faults that I was not been getting before, and once got a message saying: Main: schedule: re-entered unsafely Perhaps a 'foreign import unsafe' should be 'safe'? I think this may be something to do with creating a lot of sparks though, since this occurs whether I have the parallel GC on or not.
Unless you (or some library you're using) is doing what the error message says then you should file a GHC bug here:
http://hackage.haskell.org/trac/ghc/
-- Johan
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Regards, Austin

On 08/10/2011 01:47, austin seipp wrote:
It's GHC, and partly the OS scheduler in some sense. Oversaturating, i.e. using an -N option> your number of logical cores (including hyperthreads) will slow down your program typically. This isn't uncommon, and is well known - GHC's lightweight threads have an M:N threading model, but for good performance, you typically want the OS threads and cores to have a 1:1 correspondence. Creating a massive amount of OS threads will cause much context switching between them, which is going to slow things down. When GHC needs to synchronize OS threads in order to do a GC, there are going to be a lot of threads being context swapped in/out in order to achieve this (because the GC must halt all mutator threads to do its thing.)
Furthermore, oversaturation isn't the only problem - having the same number of threads as cores will mean *some* thread is going to get de-scheduled. Many times this means that the thread in which GHC does GC will get descheduled by the OS. A corollary of this descheduling phenomenon is that even using the same # of OS threads as you have cores could result in -worse- performance than N-1 OS threads. This was mitigated a bit in 7.2.1 I think. Linux was affected much more drastically than others (OS X and Solaris have vastly different schedulers, and as a result the performance wouldn't just tank - it would actually get better IIRC, it just wouldn't scale as well at that point.) At the very least, on Linux, using an -N option equivalent to your number of logical cores should not drastically slow things down anymore - but it won't make them faster either. This is the "dreaded last core slowdown" bug that's been known about for a while, and as a result, you typically only see parallel speedup on Linux up to N-1 threads, where N = the number of cores you have.
Incidentally, I don't think that's true any more with recent versions of GHC and Linux, I typically see speedup increasing all the way to the total number of cores, although sometimes the speedup when adding the last core is less. Take a look at the graphs in our recent papers for some concrete results.
As a result, with dual-core only (and no hyperthreading,) on Linux, you're very unlikely to be able to get parallel speedup in any case. There's work to fix this in the garbage collector among other things, but it's not clear if it's going into GHC just yet.
It probably depends on how much other activity is happening on the system. I get pretty good speedups for most benchmarks I try on my dual-core laptop running either Linux or Windows. Typically with Windows I have to wait a while after booting for all the background activity to die down, before I can use both cores reliably. Cheers, Simon
On Fri, Oct 7, 2011 at 2:31 PM, Oliver Batchelor
wrote: I'm not sure if this is at all related, but if I run a small Repa program with more threads than I have cores/CPUs then it gets drastically slower, I have a dual core laptop - and -N2 makes my small program take approximately 0.6 of the time. Increasing to -N4 and we're running about 2x the time, -N8 and it's taking 20x or more. I guess this is probably more down to the design of Repa rather than GHC itself? Oliver
On Sat, Oct 8, 2011 at 1:21 AM, Tom Thorne
wrote: I have made a dummy program that seems to exhibit the same GC slowdown behavior, minus the segmentation faults. Compiling with -threaded and running with -N12 I get very bad performance (3x slower than -N1), running with -N12 -qg it runs approximately 3 times faster than -N1. I don't know if I should submit this as a bug or not? I'd certainly like to know why this is happening! import Numeric.LinearAlgebra import Numeric.GSL.Special.Gamma import Control.Parallel.Strategies import Control.Monad import Data.IORef import Data.Random import Data.Random.Source.PureMT import Debug.Trace -- subsets s n = (subsets_stream s) !! n subsets_stream [] = [[]] : repeat [] subsets_stream (x:xs) = let r = subsets_stream xs s = map (map (x:)) r in [[]] : zipWith (++) s (tail r) testfun :: Matrix Double -> Int -> [Int] -> Double testfun x k cs = lngamma (det (c+u)) where (m,c) = meanCov xx m' = fromRows [m] u = (trans m')<> m' xx = fromColumns ( [(toColumns x)!!i] ++ [(toColumns x)!!j] ++ [(toColumns x)!!k] ) i = cs !! 0 j = cs !! 1
test :: Matrix Double -> Int -> Double test x i = sum p where p = parMap (rdeepseq) (testfun x (i+1)) (subsets [0..i] 2)
ranMatrix :: Int -> RVar (Matrix Double) ranMatrix n = do xs<- mapM (\_ -> mapM (\_ -> uniform 0 1.0) [1..n]) [1..n] return (fromLists xs)
loop :: Int -> Double -> Int -> RVar Double loop n s i = traceShow i $ do x<- ranMatrix n let r = sum $ parMap (rdeepseq) (test x) [2..(n-2)] return (r+s) main = do let n = 100 let iter = 5 rng<- newPureMT rngr<- newIORef rng p<- runRVar (foldM (loop n) 0.0 [1..iter]) rngr print p I have also found that the segmentation faults in my code disappear if I switch from Control.Parallel to Control.Monad.Par, which is quite strange. I get slightly better performance with Control.Parallel when it completes without a seg. fault, and the frequency with which it does so seems to depend on the number of sparks that are being created. On Thu, Oct 6, 2011 at 1:56 PM, Tom Thorne
wrote: I'm trying to narrow it down so that I can submit a meaningful bug report, and it seems to be something to do with switching off parallel GC using -qg, whilst also passing -Nx. Are there any known issues with this that people are aware of? At the moment I am using the latest haskell platform release on arch. I'd like to give 7.2 a try in case that fixes it, but I'm using rather a lot of libraries (hmatrix, fclabels, random fu) and I don't know how to install them for multiple ghc versions On Wed, Oct 5, 2011 at 10:43 PM, Johan Tibell
wrote: On Wed, Oct 5, 2011 at 2:37 PM, Tom Thorne
wrote: The only problem is that now I am getting random occasional segmentation faults that I was not been getting before, and once got a message saying: Main: schedule: re-entered unsafely Perhaps a 'foreign import unsafe' should be 'safe'? I think this may be something to do with creating a lot of sparks though, since this occurs whether I have the parallel GC on or not.
Unless you (or some library you're using) is doing what the error message says then you should file a GHC bug here:
http://hackage.haskell.org/trac/ghc/
-- Johan
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

It would be really useful to see the threadscope output for this.
Apart from cache effects (which may well be significant at 12 cores),
the usual problems with parallel GHC are synchronisation.
When GHC wants to perform a parallel GC it needs to stop all Haskell
threads. These are lightweight threads and are scheduled
cooperatively, i.e., there's no way to interrupt them from the outside
(except for the OS, but that doesn't help with GC). Usually, a thread
is able to yield whenever it tries to do an allocation which is common
enough in normal Haskell. However, your work contains lots of matrix
computation which likely don't do allocations in the inner loop or
call C to do their work, which isn't interruptible, either. My guess
would be that (at least part of) the reason for your slowdown is that
the parallel GC spends a lot of time waiting for threads to stop.
This would be apparent in Threadscope. (I may be wrong, because even
the single-threaded GC needs to stop all threads)
On 7 October 2011 18:21, Tom Thorne
I have made a dummy program that seems to exhibit the same GC slowdown behavior, minus the segmentation faults. Compiling with -threaded and running with -N12 I get very bad performance (3x slower than -N1), running with -N12 -qg it runs approximately 3 times faster than -N1. I don't know if I should submit this as a bug or not? I'd certainly like to know why this is happening! import Numeric.LinearAlgebra import Numeric.GSL.Special.Gamma import Control.Parallel.Strategies import Control.Monad import Data.IORef import Data.Random import Data.Random.Source.PureMT import Debug.Trace -- subsets s n = (subsets_stream s) !! n subsets_stream [] = [[]] : repeat [] subsets_stream (x:xs) = let r = subsets_stream xs s = map (map (x:)) r in [[]] : zipWith (++) s (tail r) testfun :: Matrix Double -> Int -> [Int] -> Double testfun x k cs = lngamma (det (c+u)) where (m,c) = meanCov xx m' = fromRows [m] u = (trans m') <> m' xx = fromColumns ( [(toColumns x)!!i] ++ [(toColumns x)!!j] ++ [(toColumns x)!!k] ) i = cs !! 0 j = cs !! 1
test :: Matrix Double -> Int -> Double test x i = sum p where p = parMap (rdeepseq) (testfun x (i+1)) (subsets [0..i] 2)
ranMatrix :: Int -> RVar (Matrix Double) ranMatrix n = do xs <- mapM (\_ -> mapM (\_ -> uniform 0 1.0) [1..n]) [1..n] return (fromLists xs)
loop :: Int -> Double -> Int -> RVar Double loop n s i = traceShow i $ do x <- ranMatrix n let r = sum $ parMap (rdeepseq) (test x) [2..(n-2)] return (r+s) main = do let n = 100 let iter = 5 rng <- newPureMT rngr <- newIORef rng p <- runRVar (foldM (loop n) 0.0 [1..iter]) rngr print p I have also found that the segmentation faults in my code disappear if I switch from Control.Parallel to Control.Monad.Par, which is quite strange. I get slightly better performance with Control.Parallel when it completes without a seg. fault, and the frequency with which it does so seems to depend on the number of sparks that are being created. On Thu, Oct 6, 2011 at 1:56 PM, Tom Thorne
wrote: I'm trying to narrow it down so that I can submit a meaningful bug report, and it seems to be something to do with switching off parallel GC using -qg, whilst also passing -Nx. Are there any known issues with this that people are aware of? At the moment I am using the latest haskell platform release on arch. I'd like to give 7.2 a try in case that fixes it, but I'm using rather a lot of libraries (hmatrix, fclabels, random fu) and I don't know how to install them for multiple ghc versions On Wed, Oct 5, 2011 at 10:43 PM, Johan Tibell
wrote: On Wed, Oct 5, 2011 at 2:37 PM, Tom Thorne
wrote: The only problem is that now I am getting random occasional segmentation faults that I was not been getting before, and once got a message saying: Main: schedule: re-entered unsafely Perhaps a 'foreign import unsafe' should be 'safe'? I think this may be something to do with creating a lot of sparks though, since this occurs whether I have the parallel GC on or not.
Unless you (or some library you're using) is doing what the error message says then you should file a GHC bug here:
http://hackage.haskell.org/trac/ghc/
-- Johan
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Push the envelope. Watch it bend.

Yes I will try to run threadscope on it, I tried it before and the event log
output produced about 1.8GB, and then crashed.
Is there any way to tell the RTS to perform GC less often? My code doesn't
use too much memory and I'm using fairly hefty machines (e.g one with 48
cores and 128GB of RAM) and so perhaps the default/heuristic settings aren't
optimal.
On Sun, Oct 9, 2011 at 4:16 PM, Thomas Schilling
It would be really useful to see the threadscope output for this. Apart from cache effects (which may well be significant at 12 cores), the usual problems with parallel GHC are synchronisation.
When GHC wants to perform a parallel GC it needs to stop all Haskell threads. These are lightweight threads and are scheduled cooperatively, i.e., there's no way to interrupt them from the outside (except for the OS, but that doesn't help with GC). Usually, a thread is able to yield whenever it tries to do an allocation which is common enough in normal Haskell. However, your work contains lots of matrix computation which likely don't do allocations in the inner loop or call C to do their work, which isn't interruptible, either. My guess would be that (at least part of) the reason for your slowdown is that the parallel GC spends a lot of time waiting for threads to stop. This would be apparent in Threadscope. (I may be wrong, because even the single-threaded GC needs to stop all threads)
On 7 October 2011 18:21, Tom Thorne
wrote: I have made a dummy program that seems to exhibit the same GC slowdown behavior, minus the segmentation faults. Compiling with -threaded and running with -N12 I get very bad performance (3x slower than -N1), running with -N12 -qg it runs approximately 3 times faster than -N1. I don't know if I should submit this as a bug or not? I'd certainly like to know why this is happening! import Numeric.LinearAlgebra import Numeric.GSL.Special.Gamma import Control.Parallel.Strategies import Control.Monad import Data.IORef import Data.Random import Data.Random.Source.PureMT import Debug.Trace -- subsets s n = (subsets_stream s) !! n subsets_stream [] = [[]] : repeat [] subsets_stream (x:xs) = let r = subsets_stream xs s = map (map (x:)) r in [[]] : zipWith (++) s (tail r) testfun :: Matrix Double -> Int -> [Int] -> Double testfun x k cs = lngamma (det (c+u)) where (m,c) = meanCov xx m' = fromRows [m] u = (trans m') <> m' xx = fromColumns ( [(toColumns x)!!i] ++ [(toColumns x)!!j] ++ [(toColumns x)!!k] ) i = cs !! 0 j = cs !! 1
test :: Matrix Double -> Int -> Double test x i = sum p where p = parMap (rdeepseq) (testfun x (i+1)) (subsets [0..i] 2)
ranMatrix :: Int -> RVar (Matrix Double) ranMatrix n = do xs <- mapM (\_ -> mapM (\_ -> uniform 0 1.0) [1..n]) [1..n] return (fromLists xs)
loop :: Int -> Double -> Int -> RVar Double loop n s i = traceShow i $ do x <- ranMatrix n let r = sum $ parMap (rdeepseq) (test x) [2..(n-2)] return (r+s) main = do let n = 100 let iter = 5 rng <- newPureMT rngr <- newIORef rng p <- runRVar (foldM (loop n) 0.0 [1..iter]) rngr print p I have also found that the segmentation faults in my code disappear if I switch from Control.Parallel to Control.Monad.Par, which is quite strange. I get slightly better performance with Control.Parallel when it completes without a seg. fault, and the frequency with which it does so seems to depend on the number of sparks that are being created. On Thu, Oct 6, 2011 at 1:56 PM, Tom Thorne
wrote: I'm trying to narrow it down so that I can submit a meaningful bug
report,
and it seems to be something to do with switching off parallel GC using -qg, whilst also passing -Nx. Are there any known issues with this that people are aware of? At the moment I am using the latest haskell platform release on arch. I'd like to give 7.2 a try in case that fixes it, but I'm using rather a lot of libraries (hmatrix, fclabels, random fu) and I don't know how to install them for multiple ghc versions On Wed, Oct 5, 2011 at 10:43 PM, Johan Tibell
wrote: On Wed, Oct 5, 2011 at 2:37 PM, Tom Thorne
wrote: The only problem is that now I am getting random occasional
segmentation
faults that I was not been getting before, and once got a message saying: Main: schedule: re-entered unsafely Perhaps a 'foreign import unsafe' should be 'safe'? I think this may be something to do with creating a lot of sparks though, since this occurs whether I have the parallel GC on or not.
Unless you (or some library you're using) is doing what the error message says then you should file a GHC bug here:
http://hackage.haskell.org/trac/ghc/
-- Johan
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Push the envelope. Watch it bend.

On Mon, Oct 10, 2011 at 3:55 PM, Tom Thorne
Yes I will try to run threadscope on it, I tried it before and the event log output produced about 1.8GB, and then crashed. Is there any way to tell the RTS to perform GC less often? My code doesn't use too much memory and I'm using fairly hefty machines (e.g one with 48 cores and 128GB of RAM) and so perhaps the default/heuristic settings aren't optimal.
Increasing "-A" and "-H" in the RTS options should help with this.
G
--
Gregory Collins

thanks! I just tried setting -A32M and this seems to fix the parallel GC
problems, I now get a speedup with parallel GC on and performance is the
same as passing -qg. I had tried -H before and it only made things worse,
but -A seems to do the trick.
I'm still having problems with segmentation faults though. Depending on how
I apply parMap, and whether I use monad-par or control.parallel, they seem
to come and go arbitrarily. In a successful run that lasted about 30s in
total with control.parallel, +RTS -s reports:
SPARKS: 422712 (394377 converted, 0 pruned)
am I creating too many sparks?
On Mon, Oct 10, 2011 at 3:07 PM, Gregory Collins
On Mon, Oct 10, 2011 at 3:55 PM, Tom Thorne
wrote: Yes I will try to run threadscope on it, I tried it before and the event
log output produced about 1.8GB, and then crashed.
Is there any way to tell the RTS to perform GC less often? My code doesn't use too much memory and I'm using fairly hefty machines (e.g one with 48 cores and 128GB of RAM) and so perhaps the default/heuristic settings aren't optimal.
Increasing "-A" and "-H" in the RTS options should help with this.
G -- Gregory Collins

On 10/10/2011 15:44, Tom Thorne wrote:
thanks! I just tried setting -A32M and this seems to fix the parallel GC problems, I now get a speedup with parallel GC on and performance is the same as passing -qg. I had tried -H before and it only made things worse, but -A seems to do the trick.
I'm still having problems with segmentation faults though. Depending on how I apply parMap, and whether I use monad-par or control.parallel, they seem to come and go arbitrarily. In a successful run that lasted about 30s in total with control.parallel, +RTS -s reports: SPARKS: 422712 (394377 converted, 0 pruned)
am I creating too many sparks?
Please report the bug. General rule: if in doubt, report it. We'll decide whether it's a bug or not. If it's a segfault, and you're not doing any FFI or unsafe stuff, and your hardware isn't faulty, then it's definitely a bug. Cheers, Simon
On Mon, Oct 10, 2011 at 3:07 PM, Gregory Collins
mailto:greg@gregorycollins.net> wrote: On Mon, Oct 10, 2011 at 3:55 PM, Tom Thorne
mailto:thomas.thorne21@gmail.com> wrote: > > Yes I will try to run threadscope on it, I tried it before and the event log output produced about 1.8GB, and then crashed. > Is there any way to tell the RTS to perform GC less often? My code doesn't use too much memory and I'm using fairly hefty machines (e.g one with 48 cores and 128GB of RAM) and so perhaps the default/heuristic settings aren't optimal. Increasing "-A" and "-H" in the RTS options should help with this.
G -- Gregory Collins
mailto:greg@gregorycollins.net> _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Thanks, I would except that my code, whilst pure, uses hmatrix, and hmatrix
uses lapack internally and so presumably calls FFI functions. As far as I
know lapack ought to be thread safe, but potentially the way it interfaces
with haskell in hmatrix isn't. I don't want to blame hmatrix since it is an
excellent library but I can't rule it out and I'm not sure how to pinpoint
where exactly it is happening.
I'm going to try and modify my code to use a pure haskell implementation and
see if the segfaults go away..
On Mon, Oct 10, 2011 at 4:25 PM, Simon Marlow
Please report the bug. General rule: if in doubt, report it. We'll decide whether it's a bug or not. If it's a segfault, and you're not doing any FFI or unsafe stuff, and your hardware isn't faulty, then it's definitely a bug.
Cheers, Simon

On Mon, Oct 10, 2011 at 16:44, Tom Thorne
thanks! I just tried setting -A32M and this seems to fix the parallel GC problems, I now get a speedup with parallel GC on and performance is the same as passing -qg. I had tried -H before and it only made things worse, but -A seems to do the trick.
You might be able to use ghc-gc-tune [1] to find the right settings for -A and -H. Erik [1] http://hackage.haskell.org/package/ghc-gc-tune

On Mon, Oct 10, 2011 at 15:55, Tom Thorne
Yes I will try to run threadscope on it, I tried it before and the event log output produced about 1.8GB, and then crashed.
Hi Tom, I'm one of the TS/ghc-events hackers and I'd like to learn more, fix it or at least put it on the TS/ghc-events issue tracker (http://trac.haskell.org/ThreadScope). Could you help me reproduce the problem? Did ThreadScope crash or RTS? Which versions? Was it 1.8GB of the log file or RAM? Did you succeed eventually? Any other TS feedback? Thank you, Mikolaj

Hi, I can't remember if it was threadscope that crashed or the RTS, since I was also having segfaults in the RTS because of this bug, that is fixed in 7.2.2: http://hackage.haskell.org/trac/ghc/ticket/5552 I successfully used threadscope by running my code for fewer iterations to produce a smaller log, and it was helpful to make sure I was dividing work equally between the threads. I think I still have the log file that was about 1.8GB so I will try running threadscope on it and see what happens. The performance problems I was having turned out to be fixed completely by changing the GC options passed to the RTS. thanks! Tom On Friday, 16 December 2011 at 09:20, Mikolaj Konarski wrote:
On Mon, Oct 10, 2011 at 15:55, Tom Thorne
wrote: Yes I will try to run threadscope on it, I tried it before and the event log output produced about 1.8GB, and then crashed.
Hi Tom,
I'm one of the TS/ghc-events hackers and I'd like to learn more, fix it or at least put it on the TS/ghc-events issue tracker (http://trac.haskell.org/ThreadScope). Could you help me reproduce the problem? Did ThreadScope crash or RTS? Which versions? Was it 1.8GB of the log file or RAM? Did you succeed eventually? Any other TS feedback?
Thank you, Mikolaj

Tom, thank you very much for the ThreadScope feedback.
Anything new? Anybody? We are close to a new release,
so that's the last call for bug reports before the release.
Stay tuned. :)
On Fri, Dec 16, 2011 at 11:34, Tom Thorne
Hi,
I can't remember if it was threadscope that crashed or the RTS, since I was also having segfaults in the RTS because of this bug, that is fixed in 7.2.2: http://hackage.haskell.org/trac/ghc/ticket/5552
I successfully used threadscope by running my code for fewer iterations to produce a smaller log, and it was helpful to make sure I was dividing work equally between the threads.
I think I still have the log file that was about 1.8GB so I will try running threadscope on it and see what happens.
The performance problems I was having turned out to be fixed completely by changing the GC options passed to the RTS.
thanks!
Tom
On Friday, 16 December 2011 at 09:20, Mikolaj Konarski wrote:
On Mon, Oct 10, 2011 at 15:55, Tom Thorne
wrote: Yes I will try to run threadscope on it, I tried it before and the event log output produced about 1.8GB, and then crashed.
Hi Tom,
I'm one of the TS/ghc-events hackers and I'd like to learn more, fix it or at least put it on the TS/ghc-events issue tracker (http://trac.haskell.org/ThreadScope). Could you help me reproduce the problem? Did ThreadScope crash or RTS? Which versions? Was it 1.8GB of the log file or RAM? Did you succeed eventually? Any other TS feedback?
Thank you, Mikolaj

On Mon, 9 Jan 2012 18:22:57 +0100, Mikolaj Konarski
Tom, thank you very much for the ThreadScope feedback. Anything new? Anybody? We are close to a new release, so that's the last call for bug reports before the release. Stay tuned. :)
As it turns out, I ran into a similar issue with a concurrent Gibbs sampling implmentation I've been working on. Increasing -H fixed the regression, as expected. I'd be happy to provide data if someone was interested. Cheers, - Ben

As it turns out, I ran into a similar issue with a concurrent Gibbs sampling implmentation I've been working on. Increasing -H fixed the regression, as expected. I'd be happy to provide data if someone was interested.
Yes, please. Even if it turns out not a ThreadScope issue, it's still
a valuable test case. I will try reproducing it and report back.
Please use my email address

I don't know if this is relevant to your problems, but I'm currently struggling to get some performance out of a parallel - or rather, concurrent - program. Basically, the initial thread parses some data into an IntMap, and then multiple threads access this read-only to do the Real Work. Now, there appears to be a lot of overhead incurred when using multiple threads, and I suspect that this is caused by the map storing unevaluated thunks, which then are forced by accesses by the worker threads. Ideally, the evaluation should be performed in parallel, but perhaps there are issues (of synchronization, say) that makes this less performant? -k -- If I haven't seen further, it is by standing in the footprints of giants

Ketil, For your particular problem, unevaluated thunks should be easy to check: dump a heap profile and look for a decreasing allocation of thunks. That being said, IntMap is spine strict, so that will all be evaluated, and if your threads are accessing disjoint keys there should be no contention. If there is, yes, threads will be blocking on evaluation, I don't have a good sense for how slow that tends to be. (Cache effects may be swamping you.) You may be interested in repa, if your maps are dense. Edward Excerpts from Ketil Malde's message of Wed Oct 05 17:00:11 -0400 2011:
I don't know if this is relevant to your problems, but I'm currently struggling to get some performance out of a parallel - or rather, concurrent - program.
Basically, the initial thread parses some data into an IntMap, and then multiple threads access this read-only to do the Real Work.
Now, there appears to be a lot of overhead incurred when using multiple threads, and I suspect that this is caused by the map storing unevaluated thunks, which then are forced by accesses by the worker threads. Ideally, the evaluation should be performed in parallel, but perhaps there are issues (of synchronization, say) that makes this less performant?
-k
participants (14)
-
Alexander Kjeldaas
-
austin seipp
-
Ben Gamari
-
Edward Z. Yang
-
Erik Hesselink
-
Gregory Collins
-
Johan Tibell
-
Ketil Malde
-
Mikolaj Konarski
-
Oliver Batchelor
-
Ryan Newton
-
Simon Marlow
-
Thomas Schilling
-
Tom Thorne