Fwd: Removing latency spikes. Garbage collector related?

Will Sewell

29 Sep 2015 29 Sep '15

5:03 a.m.

90% of the memory is used for our message index, which is a temporary store of messages that have gone through the system. These messages are stored in aligned chunks in memory that are merged together. I initially though this was causing the spikes, but they were still

Thanks for the reply Greg. I have already tried tweaking these values a bit, and this is what I found: * I first tried -A256k because the L2 cache is that size (Simon Marlow mentioned this can lead to good performance http://stackoverflow.com/a/3172704/1018290) * I then tried a value of -A2048k because he also said "using a very large young generation size might outweigh the cache benefits". I don't exactly know what he meant by "a very large young generation size", so I guessed at this value. Is it in the right ballpark? * With -H, I tried values of -H8m, -H32m, -H128m, -H512m, -H1024m But all lead to worse performance over the defaults (and -H didn't really have much affect at all). I will try your suggestion of setting -A to the L3 cache size. Are there any other values I should try setting these at? As for your final point, I have run space profiling, and it looks like there even after I removed the component. I will try and run space profiling in the build with the message index. Thanks again. On 28 September 2015 at 19:02, Gregory Collins wrote:

...

On Mon, Sep 28, 2015 at 9:08 AM, Will Sewell wrote:

...
If it is the GC, then is there anything that can be done about it?

Increase value of -A (the default is too small) -- best value for this is L3 cache size of the chip Increase value of -H (total heap size) -- this will use more ram but you'll run GC less often This will sound flip, but: generate less garbage. Frequency of GC runs is proportional to the amount of garbage being produced, so if you can lower mutator allocation rate then you will also increase net productivity. Built-up thunks can transparently hide a lot of allocation so fire up the profiler and tighten those up (there's an 80-20 rule here). Reuse output buffers if you aren't already, etc.

G

-- Gregory Collins

Show replies by date

Neil Davies

29 Sep 29 Sep

5:16 a.m.

New subject: Removing latency spikes. Garbage collector related?

Will is your issue with the spikes i response time, rather than the mean values? If so, once you’ve reduced the amount of unnecessary mutation, you might want to take more control over when the GC is taking place. You might want to disable GC on timer (-I0) and force GC to occur at points you select - we found this useful. Lastly, is the arrival pattern (and distribution pattern) of messages constant or variable? just making sure that you are not trying to fight basic queueing theory here. Neil On 29 Sep 2015, at 10:03, Will Sewell wrote:

...

Thanks for the reply Greg. I have already tried tweaking these values a bit, and this is what I found:

* I first tried -A256k because the L2 cache is that size (Simon Marlow mentioned this can lead to good performance http://stackoverflow.com/a/3172704/1018290) * I then tried a value of -A2048k because he also said "using a very large young generation size might outweigh the cache benefits". I don't exactly know what he meant by "a very large young generation size", so I guessed at this value. Is it in the right ballpark? * With -H, I tried values of -H8m, -H32m, -H128m, -H512m, -H1024m

But all lead to worse performance over the defaults (and -H didn't really have much affect at all).

I will try your suggestion of setting -A to the L3 cache size.

Are there any other values I should try setting these at?

...
90% of the memory is used for our message index, which is a temporary store of messages that have gone through the system. These messages are stored in aligned chunks in memory that are merged together. I initially though this was causing the spikes, but they were still

As for your final point, I have run space profiling, and it looks like there even after I removed the component. I will try and run space profiling in the build with the message index.

Thanks again.

On 28 September 2015 at 19:02, Gregory Collins wrote:

...
On Mon, Sep 28, 2015 at 9:08 AM, Will Sewell wrote:

...
If it is the GC, then is there anything that can be done about it?

Increase value of -A (the default is too small) -- best value for this is L3 cache size of the chip Increase value of -H (total heap size) -- this will use more ram but you'll run GC less often This will sound flip, but: generate less garbage. Frequency of GC runs is proportional to the amount of garbage being produced, so if you can lower mutator allocation rate then you will also increase net productivity. Built-up thunks can transparently hide a lot of allocation so fire up the profiler and tighten those up (there's an 80-20 rule here). Reuse output buffers if you aren't already, etc.

G

-- Gregory Collins

_______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users

Will Sewell

5:35 a.m.

New subject: Removing latency spikes. Garbage collector related?

Thank you for the reply Neil. The spikes are in response time. The graph I linked to shows the distribution of response times in a given window of time (darkness of the square is the number of messages in a particular window of response time). So the spikes are in the mean and also the max response time. Having said that I'm not exactly sure what you mean by "mean values". I will have a look into -I0. Yes the arrival of messages is constant. This graph shows the number of messages that have been published to the system: http://i.imgur.com/ADzMPIp.png On 29 September 2015 at 10:16, Neil Davies wrote:

...

Will

is your issue with the spikes i response time, rather than the mean values?

If so, once you’ve reduced the amount of unnecessary mutation, you might want to take more control over when the GC is taking place. You might want to disable GC on timer (-I0) and force GC to occur at points you select - we found this useful.

Lastly, is the arrival pattern (and distribution pattern) of messages constant or variable? just making sure that you are not trying to fight basic queueing theory here.

Neil

On 29 Sep 2015, at 10:03, Will Sewell wrote:

...
Thanks for the reply Greg. I have already tried tweaking these values a bit, and this is what I found:

* I first tried -A256k because the L2 cache is that size (Simon Marlow mentioned this can lead to good performance http://stackoverflow.com/a/3172704/1018290) * I then tried a value of -A2048k because he also said "using a very large young generation size might outweigh the cache benefits". I don't exactly know what he meant by "a very large young generation size", so I guessed at this value. Is it in the right ballpark? * With -H, I tried values of -H8m, -H32m, -H128m, -H512m, -H1024m

But all lead to worse performance over the defaults (and -H didn't really have much affect at all).

I will try your suggestion of setting -A to the L3 cache size.

Are there any other values I should try setting these at?

...
90% of the memory is used for our message index, which is a temporary store of messages that have gone through the system. These messages are stored in aligned chunks in memory that are merged together. I initially though this was causing the spikes, but they were still

As for your final point, I have run space profiling, and it looks like there even after I removed the component. I will try and run space profiling in the build with the message index.

Thanks again.

On 28 September 2015 at 19:02, Gregory Collins wrote:

...
On Mon, Sep 28, 2015 at 9:08 AM, Will Sewell wrote:

...
If it is the GC, then is there anything that can be done about it?

Increase value of -A (the default is too small) -- best value for this is L3 cache size of the chip Increase value of -H (total heap size) -- this will use more ram but you'll run GC less often This will sound flip, but: generate less garbage. Frequency of GC runs is proportional to the amount of garbage being produced, so if you can lower mutator allocation rate then you will also increase net productivity. Built-up thunks can transparently hide a lot of allocation so fire up the profiler and tighten those up (there's an 80-20 rule here). Reuse output buffers if you aren't already, etc.

G

-- Gregory Collins

_______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users

Neil Davies

6:45 a.m.

New subject: Removing latency spikes. Garbage collector related?

Will I was trying to get a feeling for what those coloured squares actually denoted - typically we examine this sort of performance information as CDFs (cumulative distribution functions[1]) trying to pull apart the issues that “mean” effecting (i.e typical path through code/system) and those that are “tail” effecting (i.e exceptions - and GC running could be seen as an “exception” - one that you can manage and time shift in the relative timing). I’m assuming that messages have a similar “cost” (i.e similar work to complete) - so that a uniform arrival rate equates to a uniform rate of work to be done arriving. Neil [1] We plot the CDF’s in two ways, the “usual” way for the major part of the probability mass and then as a (1-CDF) on a log log scale to expose the tail behaviour. On 29 Sep 2015, at 10:35, Will Sewell wrote:

...

Thank you for the reply Neil.

The spikes are in response time. The graph I linked to shows the distribution of response times in a given window of time (darkness of the square is the number of messages in a particular window of response time). So the spikes are in the mean and also the max response time. Having said that I'm not exactly sure what you mean by "mean values".

I will have a look into -I0.

Yes the arrival of messages is constant. This graph shows the number of messages that have been published to the system: http://i.imgur.com/ADzMPIp.png

On 29 September 2015 at 10:16, Neil Davies wrote:

...
Will

is your issue with the spikes i response time, rather than the mean values?

If so, once you’ve reduced the amount of unnecessary mutation, you might want to take more control over when the GC is taking place. You might want to disable GC on timer (-I0) and force GC to occur at points you select - we found this useful.

Lastly, is the arrival pattern (and distribution pattern) of messages constant or variable? just making sure that you are not trying to fight basic queueing theory here.

Neil

On 29 Sep 2015, at 10:03, Will Sewell wrote:

...
Thanks for the reply Greg. I have already tried tweaking these values a bit, and this is what I found:

* I first tried -A256k because the L2 cache is that size (Simon Marlow mentioned this can lead to good performance http://stackoverflow.com/a/3172704/1018290) * I then tried a value of -A2048k because he also said "using a very large young generation size might outweigh the cache benefits". I don't exactly know what he meant by "a very large young generation size", so I guessed at this value. Is it in the right ballpark? * With -H, I tried values of -H8m, -H32m, -H128m, -H512m, -H1024m

But all lead to worse performance over the defaults (and -H didn't really have much affect at all).

I will try your suggestion of setting -A to the L3 cache size.

Are there any other values I should try setting these at?

...
90% of the memory is used for our message index, which is a temporary store of messages that have gone through the system. These messages are stored in aligned chunks in memory that are merged together. I initially though this was causing the spikes, but they were still

As for your final point, I have run space profiling, and it looks like there even after I removed the component. I will try and run space profiling in the build with the message index.

Thanks again.

On 28 September 2015 at 19:02, Gregory Collins wrote:

...
On Mon, Sep 28, 2015 at 9:08 AM, Will Sewell wrote:

...
If it is the GC, then is there anything that can be done about it?

Increase value of -A (the default is too small) -- best value for this is L3 cache size of the chip Increase value of -H (total heap size) -- this will use more ram but you'll run GC less often This will sound flip, but: generate less garbage. Frequency of GC runs is proportional to the amount of garbage being produced, so if you can lower mutator allocation rate then you will also increase net productivity. Built-up thunks can transparently hide a lot of allocation so fire up the profiler and tighten those up (there's an 80-20 rule here). Reuse output buffers if you aren't already, etc.

G

-- Gregory Collins

_______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users

Will Sewell

8:37 a.m.

New subject: Removing latency spikes. Garbage collector related?

That's interesting. I have not done this kind of work before, and had not come across CDFs. I can see why it make sense to look at the mean and tail. Your assumption is correct. The messages have a similar cost, which is why the graph I posted is relatively flat most of the time. The spikes suggest to me that it is a tail affecting issue because the messages are following the same code path as when it is running normally. On 29 September 2015 at 11:45, Neil Davies wrote:

...

Will

I was trying to get a feeling for what those coloured squares actually denoted - typically we examine this sort of performance information as CDFs (cumulative distribution functions[1]) trying to pull apart the issues that “mean” effecting (i.e typical path through code/system) and those that are “tail” effecting (i.e exceptions - and GC running could be seen as an “exception” - one that you can manage and time shift in the relative timing).

I’m assuming that messages have a similar “cost” (i.e similar work to complete) - so that a uniform arrival rate equates to a uniform rate of work to be done arriving.

Neil [1] We plot the CDF’s in two ways, the “usual” way for the major part of the probability mass and then as a (1-CDF) on a log log scale to expose the tail behaviour.

On 29 Sep 2015, at 10:35, Will Sewell wrote:

...
Thank you for the reply Neil.

The spikes are in response time. The graph I linked to shows the distribution of response times in a given window of time (darkness of the square is the number of messages in a particular window of response time). So the spikes are in the mean and also the max response time. Having said that I'm not exactly sure what you mean by "mean values".

I will have a look into -I0.

Yes the arrival of messages is constant. This graph shows the number of messages that have been published to the system: http://i.imgur.com/ADzMPIp.png

On 29 September 2015 at 10:16, Neil Davies wrote:

...
Will

is your issue with the spikes i response time, rather than the mean values?

If so, once you’ve reduced the amount of unnecessary mutation, you might want to take more control over when the GC is taking place. You might want to disable GC on timer (-I0) and force GC to occur at points you select - we found this useful.

Lastly, is the arrival pattern (and distribution pattern) of messages constant or variable? just making sure that you are not trying to fight basic queueing theory here.

Neil

On 29 Sep 2015, at 10:03, Will Sewell wrote:

...
Thanks for the reply Greg. I have already tried tweaking these values a bit, and this is what I found:

* I first tried -A256k because the L2 cache is that size (Simon Marlow mentioned this can lead to good performance http://stackoverflow.com/a/3172704/1018290) * I then tried a value of -A2048k because he also said "using a very large young generation size might outweigh the cache benefits". I don't exactly know what he meant by "a very large young generation size", so I guessed at this value. Is it in the right ballpark? * With -H, I tried values of -H8m, -H32m, -H128m, -H512m, -H1024m

But all lead to worse performance over the defaults (and -H didn't really have much affect at all).

I will try your suggestion of setting -A to the L3 cache size.

Are there any other values I should try setting these at?

...
90% of the memory is used for our message index, which is a temporary store of messages that have gone through the system. These messages are stored in aligned chunks in memory that are merged together. I initially though this was causing the spikes, but they were still

As for your final point, I have run space profiling, and it looks like there even after I removed the component. I will try and run space profiling in the build with the message index.

Thanks again.

On 28 September 2015 at 19:02, Gregory Collins wrote:

...
On Mon, Sep 28, 2015 at 9:08 AM, Will Sewell wrote:

...
If it is the GC, then is there anything that can be done about it?

Increase value of -A (the default is too small) -- best value for this is L3 cache size of the chip Increase value of -H (total heap size) -- this will use more ram but you'll run GC less often This will sound flip, but: generate less garbage. Frequency of GC runs is proportional to the amount of garbage being produced, so if you can lower mutator allocation rate then you will also increase net productivity. Built-up thunks can transparently hide a lot of allocation so fire up the profiler and tighten those up (there's an 80-20 rule here). Reuse output buffers if you aren't already, etc.

G

-- Gregory Collins

_______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users

Gregory Collins

11:33 a.m.

New subject: Removing latency spikes. Garbage collector related?

On Tue, Sep 29, 2015 at 2:03 AM, Will Sewell wrote:

...

* I then tried a value of -A2048k because he also said "using a very large young generation size might outweigh the cache benefits". I don't exactly know what he meant by "a very large young generation size", so I guessed at this value. Is it in the right ballpark?

I usually use 2-8M for this value, depending on the chip. Most values in the young generation are going to be garbage, and collection is O(num_live_objects), so as long as you can keep this buffer and your working set (i.e. the long-lived stuff that doesn't get GC'ed) in L3 cache, higher values are better. I expect there is another such phase transition as you set -A around the L2 cache size, but everything depends on what your program is actually doing. Keeping a smaller young generation will mean that those cache lines are hotter than they would be if you set it larger, and that means increasing L2 cache pressure and potentially evicting working set, so maybe you make average GC pause time faster (helping with tail latency) at the expense of doing GC more often and maybe reducing the amount of L2 cache available. * With -H, I tried values of -H8m, -H32m, -H128m, -H512m, -H1024m

...

But all lead to worse performance over the defaults (and -H didn't really have much affect at all).

What you should expect to see as you increase -H is that major GC pauses become more infrequent, but average GC times go up. Dumping +RTS -S for us will also help us understand your GC behaviour, since I wouldn't expect to see 1s pauses on any but the largest heaps. Are you using large MutableArrays? -- Gregory Collins

3643

Age (days ago)

3643

Last active (days ago)

List overview

Download

5 comments

3 participants

participants (3)

Gregory Collins
Neil Davies
Will Sewell