Re: memory fragmentation with ghc-7.6.1

Simon Marlow
On 28/09/12 17:36, Ben Gamari wrote:
Unfortunately, after poking around I found a few obvious problems with both the code and my testing configuration which explained the performance drop. Things seem to be back to normal now. Sorry for the noise! Great job on the new codegen.
That's good to hear, thanks for letting me know!
Of course! That being said, I have run in to a bit of a performance issue which could be related to the runtime system. In particular, as I scale up in thread count (from 6 up to 48, the core count of the machine) in my program[1] (test data available), I'm seeing the total runtime increase, as well as a corresponding increase in CPU-seconds used. This despite the RTS claiming consistently high (~94%) productivity. Meanwhile Threadscope shows that nearly all of my threads are working busily with very few STM retries and no idle time. This in an application which should scale reasonably well (or so I believe). Attached below you will find a crude listing of various runtime statistics over a variety of thread counts (ranging into what should probably be regarded as the absurdly large). The application is a parallel Gibbs sampler for learning probabilistic graphical models. It involves a set of worker threads (updateWorkers) pulling work units off of a common TQueue. After grabbing a work unit, the thread will read a reference to the current global state from an IORef. It will then begin a long-running calculation, resulting in a small value (forced to normal form with deepseq) which it then communicates back to a global update thread (diffWorker) in the form of a lambda through another TQueue. The global update thread then maps the global state (the same as was read from the IORef earlier) through this lambda with atomicModifyIORef'. This is all implemented in [2]. I do understand that I'm asking a lot of the language and I have been quite impressed by how well Haskell and GHC have stood up to the challenge thusfar. That being said, the behavior I'm seeing seems a bit strange. If synchronization overhead were the culprit, I'd expect to observe STM retries or thread blocking, which I do not see (by eye it seems that STM retries occur on the order of 5/second and worker threads otherwise appear to run uninterrupted except for GC; GHC event log from a 16 thread run available here[3]). If GC were the problem, I would expect this to be manifested in the productivity, which it is clearly not. Do you have any idea what else might be causing such extreme performance degradation with higher thread counts? I would appreciate any input you would have to offer. Thanks for all of your work! Cheers, - Ben [1] https://github.com/bgamari/bayes-stack/v2 [2] https://github.com/bgamari/bayes-stack/blob/v2/BayesStack/Core/Gibbs.hs [3] http://goldnerlab.physics.umass.edu/~bgamari/RunCI.eventlog Performance of Citation Influence model on lda-handcraft data set 1115 arcs, 702 nodes, 50 items per node average 100 sweeps in blocks of 10, 200 topics Running with +RTS -A1G ghc-7.7 9c15249e082642f9c4c0113133afd78f07f1ade2 Cores User time (s) Walltime (s) CPU % Productivity ====== ============= ============= ====== ============= 2 488.66 269.41 188% 93.7% 3 533.43 195.28 281% 94.1% 4 603.92 166.94 374% 94.3% 5 622.40 138.16 466% 93.8% 6 663.73 123.00 558% 94.2% 7 713.96 114.17 647% 94.0% 8 724.66 101.98 736% 93.7% 9 802.75 100.59 826% . 10 865.05 97.69 917% . 11 966.97 99.09 1010% . 12 1238.42 114.28 1117% 13 1242.43 106.53 1206% 14 1428.59 112.48 1310% 15 1299.40 97.52 1387% 16 1559.99 108.86 1481% 17 1972.02 126.49 1604% 18 2157.03 130.91 1696% 19 1966.24 115.29 1770% . 20 2693.64 146.76 1887% . 21 3051.16 158.48 1990% . 22 4100.88 199.18 2109% 93.7% 23 4156.94 193.38 2201% 93.5% 24 4780.46 212.13 2303% 94.3% 25 5733.64 242.78 2407% . 26 7806.47 313.92 2526% . 27 6368.32 249.65 2596% . 28 8563.18 320.26 2717% -sstderr output from 2 cores: ====================================== 214,358,463,960 bytes allocated in the heap 2,261,367,592 bytes copied during GC 71,053,384 bytes maximum residency (7 sample(s)) 2,077,816 bytes maximum slop 2299 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 100 colls, 100 par 28.58s 14.27s 0.1427s 0.3345s Gen 1 7 colls, 6 par 3.36s 1.74s 0.2486s 0.3241s Parallel GC work balance: 25.06% (serial 0%, perfect 100%) TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.03s ( 0.03s elapsed) MUT time 474.28s (252.98s elapsed) GC time 31.94s ( 16.01s elapsed) EXIT time 0.12s ( 0.12s elapsed) Total time 506.49s (269.26s elapsed) Alloc rate 451,969,681 bytes per MUT second Productivity 93.7% of total user, 176.2% of total elapsed gc_alloc_block_sync: 108391 whitehole_spin: 0 gen[0].sync: 7392 gen[1].sync: 97126 -sstderr output from 24 cores: ======================================= 262,335,150,000 bytes allocated in the heap 1,323,818,912 bytes copied during GC 67,648,048 bytes maximum residency (7 sample(s)) 2,927,176 bytes maximum slop 25172 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 29 colls, 29 par 162.40s 6.77s 0.2335s 0.3469s Gen 1 7 colls, 6 par 118.34s 5.32s 0.7600s 1.2448s Parallel GC work balance: 5.61% (serial 0%, perfect 100%) TASKS: 27 (1 bound, 26 peak workers (26 total), using -N24) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.34s ( 0.34s elapsed) MUT time 4606.75s (197.59s elapsed) GC time 280.74s ( 12.09s elapsed) EXIT time 0.40s ( 0.40s elapsed) Total time 4888.61s (210.80s elapsed) Alloc rate 56,945,837 bytes per MUT second Productivity 94.3% of total user, 2185.8% of total elapsed gc_alloc_block_sync: 5397994 whitehole_spin: 0 gen[0].sync: 32282 gen[1].sync: 5072

Hi Ben, My guess would be that you're running into some kind of memory bottleneck. Three common ones are: (1) total memory bandwidth (2) cache ping-ponging (3) NUMA overheads You would run into (1) if you were using an allocation area size (-A or -H) larger than the L2 cache. Your stats seem to indicate that you're running with a large heap - could that be the case? (2) happens if you share data a lot between cores. It can also happen if the RTS shares data between cores, but I've tried to squash as much of that as I can. (3) is sadly something that happens on these large AMD machines (and to some extent large multicore Intel boxes too). Improving our NUMA support is something we really need to do. NUMA overheads tend to manifest as very unpredictable runtimes. I suggest using perf to gather some low-level stats about cache misses and suchlike. http://hackage.haskell.org/trac/ghc/wiki/Debugging/LowLevelProfiling/Perf Cheers, Simon On 29/09/2012 07:47, Ben Gamari wrote:
Simon Marlow
writes: On 28/09/12 17:36, Ben Gamari wrote:
Unfortunately, after poking around I found a few obvious problems with both the code and my testing configuration which explained the performance drop. Things seem to be back to normal now. Sorry for the noise! Great job on the new codegen.
That's good to hear, thanks for letting me know!
Of course!
That being said, I have run in to a bit of a performance issue which could be related to the runtime system. In particular, as I scale up in thread count (from 6 up to 48, the core count of the machine) in my program[1] (test data available), I'm seeing the total runtime increase, as well as a corresponding increase in CPU-seconds used. This despite the RTS claiming consistently high (~94%) productivity. Meanwhile Threadscope shows that nearly all of my threads are working busily with very few STM retries and no idle time. This in an application which should scale reasonably well (or so I believe). Attached below you will find a crude listing of various runtime statistics over a variety of thread counts (ranging into what should probably be regarded as the absurdly large).
The application is a parallel Gibbs sampler for learning probabilistic graphical models. It involves a set of worker threads (updateWorkers) pulling work units off of a common TQueue. After grabbing a work unit, the thread will read a reference to the current global state from an IORef. It will then begin a long-running calculation, resulting in a small value (forced to normal form with deepseq) which it then communicates back to a global update thread (diffWorker) in the form of a lambda through another TQueue. The global update thread then maps the global state (the same as was read from the IORef earlier) through this lambda with atomicModifyIORef'. This is all implemented in [2].
I do understand that I'm asking a lot of the language and I have been quite impressed by how well Haskell and GHC have stood up to the challenge thusfar. That being said, the behavior I'm seeing seems a bit strange. If synchronization overhead were the culprit, I'd expect to observe STM retries or thread blocking, which I do not see (by eye it seems that STM retries occur on the order of 5/second and worker threads otherwise appear to run uninterrupted except for GC; GHC event log from a 16 thread run available here[3]). If GC were the problem, I would expect this to be manifested in the productivity, which it is clearly not. Do you have any idea what else might be causing such extreme performance degradation with higher thread counts? I would appreciate any input you would have to offer.
Thanks for all of your work!
Cheers,
- Ben
[1] https://github.com/bgamari/bayes-stack/v2 [2] https://github.com/bgamari/bayes-stack/blob/v2/BayesStack/Core/Gibbs.hs [3] http://goldnerlab.physics.umass.edu/~bgamari/RunCI.eventlog
Performance of Citation Influence model on lda-handcraft data set 1115 arcs, 702 nodes, 50 items per node average 100 sweeps in blocks of 10, 200 topics Running with +RTS -A1G ghc-7.7 9c15249e082642f9c4c0113133afd78f07f1ade2
Cores User time (s) Walltime (s) CPU % Productivity ====== ============= ============= ====== ============= 2 488.66 269.41 188% 93.7% 3 533.43 195.28 281% 94.1% 4 603.92 166.94 374% 94.3% 5 622.40 138.16 466% 93.8% 6 663.73 123.00 558% 94.2% 7 713.96 114.17 647% 94.0% 8 724.66 101.98 736% 93.7% 9 802.75 100.59 826% . 10 865.05 97.69 917% . 11 966.97 99.09 1010% . 12 1238.42 114.28 1117% 13 1242.43 106.53 1206% 14 1428.59 112.48 1310% 15 1299.40 97.52 1387% 16 1559.99 108.86 1481% 17 1972.02 126.49 1604% 18 2157.03 130.91 1696% 19 1966.24 115.29 1770% . 20 2693.64 146.76 1887% . 21 3051.16 158.48 1990% . 22 4100.88 199.18 2109% 93.7% 23 4156.94 193.38 2201% 93.5% 24 4780.46 212.13 2303% 94.3% 25 5733.64 242.78 2407% . 26 7806.47 313.92 2526% . 27 6368.32 249.65 2596% . 28 8563.18 320.26 2717%
-sstderr output from 2 cores: ======================================
214,358,463,960 bytes allocated in the heap 2,261,367,592 bytes copied during GC 71,053,384 bytes maximum residency (7 sample(s)) 2,077,816 bytes maximum slop 2299 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause Gen 0 100 colls, 100 par 28.58s 14.27s 0.1427s 0.3345s Gen 1 7 colls, 6 par 3.36s 1.74s 0.2486s 0.3241s
Parallel GC work balance: 25.06% (serial 0%, perfect 100%)
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.03s ( 0.03s elapsed) MUT time 474.28s (252.98s elapsed) GC time 31.94s ( 16.01s elapsed) EXIT time 0.12s ( 0.12s elapsed) Total time 506.49s (269.26s elapsed)
Alloc rate 451,969,681 bytes per MUT second
Productivity 93.7% of total user, 176.2% of total elapsed
gc_alloc_block_sync: 108391 whitehole_spin: 0 gen[0].sync: 7392 gen[1].sync: 97126
-sstderr output from 24 cores: =======================================
262,335,150,000 bytes allocated in the heap 1,323,818,912 bytes copied during GC 67,648,048 bytes maximum residency (7 sample(s)) 2,927,176 bytes maximum slop 25172 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause Gen 0 29 colls, 29 par 162.40s 6.77s 0.2335s 0.3469s Gen 1 7 colls, 6 par 118.34s 5.32s 0.7600s 1.2448s
Parallel GC work balance: 5.61% (serial 0%, perfect 100%)
TASKS: 27 (1 bound, 26 peak workers (26 total), using -N24)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.34s ( 0.34s elapsed) MUT time 4606.75s (197.59s elapsed) GC time 280.74s ( 12.09s elapsed) EXIT time 0.40s ( 0.40s elapsed) Total time 4888.61s (210.80s elapsed)
Alloc rate 56,945,837 bytes per MUT second
Productivity 94.3% of total user, 2185.8% of total elapsed
gc_alloc_block_sync: 5397994 whitehole_spin: 0 gen[0].sync: 32282 gen[1].sync: 5072

Hi Ben,
I would bet on the same memory issues Simon mentioned. But... while you're
at it would you mind trying a little experiment to share your work items
through a lockfree queue rather than a TQueue?
http://hackage.haskell.org/package/lockfree-queue
Under some situations this can yield some benefit. But again, you didn't
see workers retrying transactions so this is probably not an issue.
-Ryan
On Mon, Oct 1, 2012 at 4:18 AM, Simon Marlow
Hi Ben,
My guess would be that you're running into some kind of memory bottleneck. Three common ones are:
(1) total memory bandwidth (2) cache ping-ponging (3) NUMA overheads
You would run into (1) if you were using an allocation area size (-A or -H) larger than the L2 cache. Your stats seem to indicate that you're running with a large heap - could that be the case?
(2) happens if you share data a lot between cores. It can also happen if the RTS shares data between cores, but I've tried to squash as much of that as I can.
(3) is sadly something that happens on these large AMD machines (and to some extent large multicore Intel boxes too). Improving our NUMA support is something we really need to do. NUMA overheads tend to manifest as very unpredictable runtimes.
I suggest using perf to gather some low-level stats about cache misses and suchlike.
http://hackage.haskell.org/**trac/ghc/wiki/Debugging/** LowLevelProfiling/Perfhttp://hackage.haskell.org/trac/ghc/wiki/Debugging/LowLevelProfiling/Perf
Cheers, Simon
On 29/09/2012 07:47, Ben Gamari wrote:
Simon Marlow
writes: On 28/09/12 17:36, Ben Gamari wrote:
Unfortunately, after poking around I found a few obvious problems with both the code and my testing configuration which explained the performance drop. Things seem to be back to normal now. Sorry for the noise! Great job on the new codegen.
That's good to hear, thanks for letting me know!
Of course!
That being said, I have run in to a bit of a performance issue which could be related to the runtime system. In particular, as I scale up in thread count (from 6 up to 48, the core count of the machine) in my program[1] (test data available), I'm seeing the total runtime increase, as well as a corresponding increase in CPU-seconds used. This despite the RTS claiming consistently high (~94%) productivity. Meanwhile Threadscope shows that nearly all of my threads are working busily with very few STM retries and no idle time. This in an application which should scale reasonably well (or so I believe). Attached below you will find a crude listing of various runtime statistics over a variety of thread counts (ranging into what should probably be regarded as the absurdly large).
The application is a parallel Gibbs sampler for learning probabilistic graphical models. It involves a set of worker threads (updateWorkers) pulling work units off of a common TQueue. After grabbing a work unit, the thread will read a reference to the current global state from an IORef. It will then begin a long-running calculation, resulting in a small value (forced to normal form with deepseq) which it then communicates back to a global update thread (diffWorker) in the form of a lambda through another TQueue. The global update thread then maps the global state (the same as was read from the IORef earlier) through this lambda with atomicModifyIORef'. This is all implemented in [2].
I do understand that I'm asking a lot of the language and I have been quite impressed by how well Haskell and GHC have stood up to the challenge thusfar. That being said, the behavior I'm seeing seems a bit strange. If synchronization overhead were the culprit, I'd expect to observe STM retries or thread blocking, which I do not see (by eye it seems that STM retries occur on the order of 5/second and worker threads otherwise appear to run uninterrupted except for GC; GHC event log from a 16 thread run available here[3]). If GC were the problem, I would expect this to be manifested in the productivity, which it is clearly not. Do you have any idea what else might be causing such extreme performance degradation with higher thread counts? I would appreciate any input you would have to offer.
Thanks for all of your work!
Cheers,
- Ben
[1] https://github.com/bgamari/**bayes-stack/v2https://github.com/bgamari/bayes-stack/v2 [2] https://github.com/bgamari/**bayes-stack/blob/v2/** BayesStack/Core/Gibbs.hshttps://github.com/bgamari/bayes-stack/blob/v2/BayesStack/Core/Gibbs.hs [3] http://goldnerlab.physics.**umass.edu/~bgamari/RunCI.**eventloghttp://goldnerlab.physics.umass.edu/~bgamari/RunCI.eventlog
Performance of Citation Influence model on lda-handcraft data set 1115 arcs, 702 nodes, 50 items per node average 100 sweeps in blocks of 10, 200 topics Running with +RTS -A1G ghc-7.7 9c15249e082642f9c4c0113133afd7**8f07f1ade2
Cores User time (s) Walltime (s) CPU % Productivity ====== ============= ============= ====== ============= 2 488.66 269.41 188% 93.7% 3 533.43 195.28 281% 94.1% 4 603.92 166.94 374% 94.3% 5 622.40 138.16 466% 93.8% 6 663.73 123.00 558% 94.2% 7 713.96 114.17 647% 94.0% 8 724.66 101.98 736% 93.7% 9 802.75 100.59 826% . 10 865.05 97.69 917% . 11 966.97 99.09 1010% . 12 1238.42 114.28 1117% 13 1242.43 106.53 1206% 14 1428.59 112.48 1310% 15 1299.40 97.52 1387% 16 1559.99 108.86 1481% 17 1972.02 126.49 1604% 18 2157.03 130.91 1696% 19 1966.24 115.29 1770% . 20 2693.64 146.76 1887% . 21 3051.16 158.48 1990% . 22 4100.88 199.18 2109% 93.7% 23 4156.94 193.38 2201% 93.5% 24 4780.46 212.13 2303% 94.3% 25 5733.64 242.78 2407% . 26 7806.47 313.92 2526% . 27 6368.32 249.65 2596% . 28 8563.18 320.26 2717%
-sstderr output from 2 cores: ==============================**========
214,358,463,960 bytes allocated in the heap 2,261,367,592 bytes copied during GC 71,053,384 bytes maximum residency (7 sample(s)) 2,077,816 bytes maximum slop 2299 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause Gen 0 100 colls, 100 par 28.58s 14.27s 0.1427s 0.3345s Gen 1 7 colls, 6 par 3.36s 1.74s 0.2486s 0.3241s
Parallel GC work balance: 25.06% (serial 0%, perfect 100%)
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.03s ( 0.03s elapsed) MUT time 474.28s (252.98s elapsed) GC time 31.94s ( 16.01s elapsed) EXIT time 0.12s ( 0.12s elapsed) Total time 506.49s (269.26s elapsed)
Alloc rate 451,969,681 bytes per MUT second
Productivity 93.7% of total user, 176.2% of total elapsed
gc_alloc_block_sync: 108391 whitehole_spin: 0 gen[0].sync: 7392 gen[1].sync: 97126
-sstderr output from 24 cores: ==============================**=========
262,335,150,000 bytes allocated in the heap 1,323,818,912 bytes copied during GC 67,648,048 bytes maximum residency (7 sample(s)) 2,927,176 bytes maximum slop 25172 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause Gen 0 29 colls, 29 par 162.40s 6.77s 0.2335s 0.3469s Gen 1 7 colls, 6 par 118.34s 5.32s 0.7600s 1.2448s
Parallel GC work balance: 5.61% (serial 0%, perfect 100%)
TASKS: 27 (1 bound, 26 peak workers (26 total), using -N24)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.34s ( 0.34s elapsed) MUT time 4606.75s (197.59s elapsed) GC time 280.74s ( 12.09s elapsed) EXIT time 0.40s ( 0.40s elapsed) Total time 4888.61s (210.80s elapsed)
Alloc rate 56,945,837 bytes per MUT second
Productivity 94.3% of total user, 2185.8% of total elapsed
gc_alloc_block_sync: 5397994 whitehole_spin: 0 gen[0].sync: 32282 gen[1].sync: 5072
______________________________**_________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.**org
http://www.haskell.org/**mailman/listinfo/glasgow-**haskell-usershttp://www.haskell.org/mailman/listinfo/glasgow-haskell-users
participants (3)
-
Ben Gamari
-
Ryan Newton
-
Simon Marlow