Timeout on pure code

Hi! Our company's main commercial product is a Snap-based web app which we compile with GHC 7.8.4. It works on four app-servers currently load-balanced behind Haproxy. I recently implemented a new piece of functionality, which led to weird behavior which I have no idea how to debug, so I'm asking here for help and ideas! The new functionality is this: on specific url-handler, we need to query n external services concurrently with a timeout, gather and render results. Easy (in Haskell)! The implementation looks, as you might imagine, something like this (sorry for almost-real-haskell, I'm sure I forgot tons of imports and other things, but I hope everything is clear as-is, if not -- I'll be glad to update gist to make things more specific): https://gist.github.com/k-bx/0cf7035aaf1ad6306e76 Now, this works wonderful for some time, and in logs I can see both, successful fetches of external-content, and also lots of timeouts from our external providers. Life is good. But! After several days of work (sometimes a day, sometimes couple days), apps on all 4 servers go crazy. It might take some interval (like 20 minutes) before they're all crazy, so it's not super-synchronous. Now: how crazy, exactly? First of all, this endpoint timeouts. Haproxy requests for a response, and response times out, so they "hang". Secondly, logs are interesting. If you look at the code from gist once again, you can see, that some of CandidateProvider's don't actually require any networking work, so all they do is actually just logging that they're working (I added this as part of debugging actually) and return pure data. So what's weird is that they timeout also! Here's how output of our logs starts to look like after the bug happens: ``` [2015-04-22 09:56:20] provider: CandidateProvider1 [2015-04-22 09:56:20] provider: CandidateProvider2 [2015-04-22 09:56:21] Got timeout while requesting CandidateProvider1 [2015-04-22 09:56:21] Got timeout while requesting CandidateProvider2 [2015-04-22 09:56:22] provider: CandidateProvider1 [2015-04-22 09:56:22] provider: CandidateProvider2 [2015-04-22 09:56:23] Got timeout while requesting CandidateProvider1 [2015-04-22 09:56:23] Got timeout while requesting CandidateProvider2 ... and so on ``` What's also weird is that, even after timeout is logged, the string ""Got responses!" never gets logged also! So hanging happens somewhere in-between. I have to say I'm sorry that I don't have strace output now, I'll have to wait until this situation happens once again, but I'll get later to you with this info. So, how is this possible that almost-pure code gets timed-out? And why does it hang afterwards? CPU and other resource usage is quite low, number of open file-descriptors also (it seems). Thanks for all the suggestions in advance!

Given your gist, the timeout on your requests is set to a half-second so
it's conceivable that a highly-loaded server might have GC pause times
approaching that long. Smells to me like a classic Haskell memory leak
(that's why the problem occurs after the server has been up for a while):
run your program with the heap profiler, and audit any shared
tables/IORefs/MVars to make sure you are not building up thunks there.
Greg
On Wed, Apr 22, 2015 at 9:14 AM, Kostiantyn Rybnikov
Hi!
Our company's main commercial product is a Snap-based web app which we compile with GHC 7.8.4. It works on four app-servers currently load-balanced behind Haproxy.
I recently implemented a new piece of functionality, which led to weird behavior which I have no idea how to debug, so I'm asking here for help and ideas!
The new functionality is this: on specific url-handler, we need to query n external services concurrently with a timeout, gather and render results. Easy (in Haskell)!
The implementation looks, as you might imagine, something like this (sorry for almost-real-haskell, I'm sure I forgot tons of imports and other things, but I hope everything is clear as-is, if not -- I'll be glad to update gist to make things more specific):
https://gist.github.com/k-bx/0cf7035aaf1ad6306e76
Now, this works wonderful for some time, and in logs I can see both, successful fetches of external-content, and also lots of timeouts from our external providers. Life is good.
But! After several days of work (sometimes a day, sometimes couple days), apps on all 4 servers go crazy. It might take some interval (like 20 minutes) before they're all crazy, so it's not super-synchronous. Now: how crazy, exactly?
First of all, this endpoint timeouts. Haproxy requests for a response, and response times out, so they "hang".
Secondly, logs are interesting. If you look at the code from gist once again, you can see, that some of CandidateProvider's don't actually require any networking work, so all they do is actually just logging that they're working (I added this as part of debugging actually) and return pure data. So what's weird is that they timeout also! Here's how output of our logs starts to look like after the bug happens:
``` [2015-04-22 09:56:20] provider: CandidateProvider1 [2015-04-22 09:56:20] provider: CandidateProvider2 [2015-04-22 09:56:21] Got timeout while requesting CandidateProvider1 [2015-04-22 09:56:21] Got timeout while requesting CandidateProvider2 [2015-04-22 09:56:22] provider: CandidateProvider1 [2015-04-22 09:56:22] provider: CandidateProvider2 [2015-04-22 09:56:23] Got timeout while requesting CandidateProvider1 [2015-04-22 09:56:23] Got timeout while requesting CandidateProvider2 ... and so on ```
What's also weird is that, even after timeout is logged, the string ""Got responses!" never gets logged also! So hanging happens somewhere in-between.
I have to say I'm sorry that I don't have strace output now, I'll have to wait until this situation happens once again, but I'll get later to you with this info.
So, how is this possible that almost-pure code gets timed-out? And why does it hang afterwards?
CPU and other resource usage is quite low, number of open file-descriptors also (it seems).
Thanks for all the suggestions in advance!
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
--
Gregory Collins

Gregory,
Servers are far from being highly-overloaded, since they're currently under
a much less load they used to be. Memory consumption is stable and low, and
there's a lot of free RAM also.
Would you say that given these factors this scenario is unlikely?
On Wed, Apr 22, 2015 at 7:56 PM, Gregory Collins
Given your gist, the timeout on your requests is set to a half-second so it's conceivable that a highly-loaded server might have GC pause times approaching that long. Smells to me like a classic Haskell memory leak (that's why the problem occurs after the server has been up for a while): run your program with the heap profiler, and audit any shared tables/IORefs/MVars to make sure you are not building up thunks there.
Greg
On Wed, Apr 22, 2015 at 9:14 AM, Kostiantyn Rybnikov
wrote: Hi!
Our company's main commercial product is a Snap-based web app which we compile with GHC 7.8.4. It works on four app-servers currently load-balanced behind Haproxy.
I recently implemented a new piece of functionality, which led to weird behavior which I have no idea how to debug, so I'm asking here for help and ideas!
The new functionality is this: on specific url-handler, we need to query n external services concurrently with a timeout, gather and render results. Easy (in Haskell)!
The implementation looks, as you might imagine, something like this (sorry for almost-real-haskell, I'm sure I forgot tons of imports and other things, but I hope everything is clear as-is, if not -- I'll be glad to update gist to make things more specific):
https://gist.github.com/k-bx/0cf7035aaf1ad6306e76
Now, this works wonderful for some time, and in logs I can see both, successful fetches of external-content, and also lots of timeouts from our external providers. Life is good.
But! After several days of work (sometimes a day, sometimes couple days), apps on all 4 servers go crazy. It might take some interval (like 20 minutes) before they're all crazy, so it's not super-synchronous. Now: how crazy, exactly?
First of all, this endpoint timeouts. Haproxy requests for a response, and response times out, so they "hang".
Secondly, logs are interesting. If you look at the code from gist once again, you can see, that some of CandidateProvider's don't actually require any networking work, so all they do is actually just logging that they're working (I added this as part of debugging actually) and return pure data. So what's weird is that they timeout also! Here's how output of our logs starts to look like after the bug happens:
``` [2015-04-22 09:56:20] provider: CandidateProvider1 [2015-04-22 09:56:20] provider: CandidateProvider2 [2015-04-22 09:56:21] Got timeout while requesting CandidateProvider1 [2015-04-22 09:56:21] Got timeout while requesting CandidateProvider2 [2015-04-22 09:56:22] provider: CandidateProvider1 [2015-04-22 09:56:22] provider: CandidateProvider2 [2015-04-22 09:56:23] Got timeout while requesting CandidateProvider1 [2015-04-22 09:56:23] Got timeout while requesting CandidateProvider2 ... and so on ```
What's also weird is that, even after timeout is logged, the string ""Got responses!" never gets logged also! So hanging happens somewhere in-between.
I have to say I'm sorry that I don't have strace output now, I'll have to wait until this situation happens once again, but I'll get later to you with this info.
So, how is this possible that almost-pure code gets timed-out? And why does it hang afterwards?
CPU and other resource usage is quite low, number of open file-descriptors also (it seems).
Thanks for all the suggestions in advance!
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
-- Gregory Collins

Maybe but it would be helpful to rule the scenario out. Johan's ekg library
is also useful, it exports a webserver on a different port that you can use
to track metrics like gc times, etc.
Other options for further debugging include gathering strace logs from the
binary. You'll have to do some data gathering to narrow down the cause
unfortunately -- http client? your code? Snap server? GHC event manager
(System.timeout is implemented here)? GC? etc
G
On Wed, Apr 22, 2015 at 10:14 AM, Kostiantyn Rybnikov
Gregory,
Servers are far from being highly-overloaded, since they're currently under a much less load they used to be. Memory consumption is stable and low, and there's a lot of free RAM also.
Would you say that given these factors this scenario is unlikely?
On Wed, Apr 22, 2015 at 7:56 PM, Gregory Collins
wrote: Given your gist, the timeout on your requests is set to a half-second so it's conceivable that a highly-loaded server might have GC pause times approaching that long. Smells to me like a classic Haskell memory leak (that's why the problem occurs after the server has been up for a while): run your program with the heap profiler, and audit any shared tables/IORefs/MVars to make sure you are not building up thunks there.
Greg
On Wed, Apr 22, 2015 at 9:14 AM, Kostiantyn Rybnikov
wrote: Hi!
Our company's main commercial product is a Snap-based web app which we compile with GHC 7.8.4. It works on four app-servers currently load-balanced behind Haproxy.
I recently implemented a new piece of functionality, which led to weird behavior which I have no idea how to debug, so I'm asking here for help and ideas!
The new functionality is this: on specific url-handler, we need to query n external services concurrently with a timeout, gather and render results. Easy (in Haskell)!
The implementation looks, as you might imagine, something like this (sorry for almost-real-haskell, I'm sure I forgot tons of imports and other things, but I hope everything is clear as-is, if not -- I'll be glad to update gist to make things more specific):
https://gist.github.com/k-bx/0cf7035aaf1ad6306e76
Now, this works wonderful for some time, and in logs I can see both, successful fetches of external-content, and also lots of timeouts from our external providers. Life is good.
But! After several days of work (sometimes a day, sometimes couple days), apps on all 4 servers go crazy. It might take some interval (like 20 minutes) before they're all crazy, so it's not super-synchronous. Now: how crazy, exactly?
First of all, this endpoint timeouts. Haproxy requests for a response, and response times out, so they "hang".
Secondly, logs are interesting. If you look at the code from gist once again, you can see, that some of CandidateProvider's don't actually require any networking work, so all they do is actually just logging that they're working (I added this as part of debugging actually) and return pure data. So what's weird is that they timeout also! Here's how output of our logs starts to look like after the bug happens:
``` [2015-04-22 09:56:20] provider: CandidateProvider1 [2015-04-22 09:56:20] provider: CandidateProvider2 [2015-04-22 09:56:21] Got timeout while requesting CandidateProvider1 [2015-04-22 09:56:21] Got timeout while requesting CandidateProvider2 [2015-04-22 09:56:22] provider: CandidateProvider1 [2015-04-22 09:56:22] provider: CandidateProvider2 [2015-04-22 09:56:23] Got timeout while requesting CandidateProvider1 [2015-04-22 09:56:23] Got timeout while requesting CandidateProvider2 ... and so on ```
What's also weird is that, even after timeout is logged, the string ""Got responses!" never gets logged also! So hanging happens somewhere in-between.
I have to say I'm sorry that I don't have strace output now, I'll have to wait until this situation happens once again, but I'll get later to you with this info.
So, how is this possible that almost-pure code gets timed-out? And why does it hang afterwards?
CPU and other resource usage is quite low, number of open file-descriptors also (it seems).
Thanks for all the suggestions in advance!
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
-- Gregory Collins
--
Gregory Collins

All right, good news!
After adding ekg, gathering its data via bosun and seeing nothing useful I
actually figured out that I could try harder to reproduce issue by myself
instead of waiting for users to do that. And I succeeded! :)
So, after launching 20 infinite curl loops to that handler's url I was
quickly able to reproduce the issue, so the task seems clear now: keep
reducing the code, reproduce locally, possibly without external services
etc. I'll write up after I get to something.
Thanks.
On Wed, Apr 22, 2015 at 11:09 PM, Gregory Collins
Maybe but it would be helpful to rule the scenario out. Johan's ekg library is also useful, it exports a webserver on a different port that you can use to track metrics like gc times, etc.
Other options for further debugging include gathering strace logs from the binary. You'll have to do some data gathering to narrow down the cause unfortunately -- http client? your code? Snap server? GHC event manager (System.timeout is implemented here)? GC? etc
G
On Wed, Apr 22, 2015 at 10:14 AM, Kostiantyn Rybnikov
wrote: Gregory,
Servers are far from being highly-overloaded, since they're currently under a much less load they used to be. Memory consumption is stable and low, and there's a lot of free RAM also.
Would you say that given these factors this scenario is unlikely?
On Wed, Apr 22, 2015 at 7:56 PM, Gregory Collins
wrote:
Given your gist, the timeout on your requests is set to a half-second so it's conceivable that a highly-loaded server might have GC pause times approaching that long. Smells to me like a classic Haskell memory leak (that's why the problem occurs after the server has been up for a while): run your program with the heap profiler, and audit any shared tables/IORefs/MVars to make sure you are not building up thunks there.
Greg
On Wed, Apr 22, 2015 at 9:14 AM, Kostiantyn Rybnikov
wrote: Hi!
Our company's main commercial product is a Snap-based web app which we compile with GHC 7.8.4. It works on four app-servers currently load-balanced behind Haproxy.
I recently implemented a new piece of functionality, which led to weird behavior which I have no idea how to debug, so I'm asking here for help and ideas!
The new functionality is this: on specific url-handler, we need to query n external services concurrently with a timeout, gather and render results. Easy (in Haskell)!
The implementation looks, as you might imagine, something like this (sorry for almost-real-haskell, I'm sure I forgot tons of imports and other things, but I hope everything is clear as-is, if not -- I'll be glad to update gist to make things more specific):
https://gist.github.com/k-bx/0cf7035aaf1ad6306e76
Now, this works wonderful for some time, and in logs I can see both, successful fetches of external-content, and also lots of timeouts from our external providers. Life is good.
But! After several days of work (sometimes a day, sometimes couple days), apps on all 4 servers go crazy. It might take some interval (like 20 minutes) before they're all crazy, so it's not super-synchronous. Now: how crazy, exactly?
First of all, this endpoint timeouts. Haproxy requests for a response, and response times out, so they "hang".
Secondly, logs are interesting. If you look at the code from gist once again, you can see, that some of CandidateProvider's don't actually require any networking work, so all they do is actually just logging that they're working (I added this as part of debugging actually) and return pure data. So what's weird is that they timeout also! Here's how output of our logs starts to look like after the bug happens:
``` [2015-04-22 09:56:20] provider: CandidateProvider1 [2015-04-22 09:56:20] provider: CandidateProvider2 [2015-04-22 09:56:21] Got timeout while requesting CandidateProvider1 [2015-04-22 09:56:21] Got timeout while requesting CandidateProvider2 [2015-04-22 09:56:22] provider: CandidateProvider1 [2015-04-22 09:56:22] provider: CandidateProvider2 [2015-04-22 09:56:23] Got timeout while requesting CandidateProvider1 [2015-04-22 09:56:23] Got timeout while requesting CandidateProvider2 ... and so on ```
What's also weird is that, even after timeout is logged, the string ""Got responses!" never gets logged also! So hanging happens somewhere in-between.
I have to say I'm sorry that I don't have strace output now, I'll have to wait until this situation happens once again, but I'll get later to you with this info.
So, how is this possible that almost-pure code gets timed-out? And why does it hang afterwards?
CPU and other resource usage is quite low, number of open file-descriptors also (it seems).
Thanks for all the suggestions in advance!
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
-- Gregory Collins
-- Gregory Collins

An update for everyone interested (and not). Turned out it's neither GHC
RTS, Snap or networking issues, it's hslogger being very slow. I thought
it's slow when used concurrently, but just did a test when it writes 2000
5kb messages sequentially and that finishes in 111 seconds (while minimal
program that writes same 2000 messages finishes in 0.12s).
I hope I'll have a chance to investigate why hslogger is so slow in future,
but meanwhile will just remove logging.
On Thu, Apr 23, 2015 at 4:08 PM, Kostiantyn Rybnikov
All right, good news!
After adding ekg, gathering its data via bosun and seeing nothing useful I actually figured out that I could try harder to reproduce issue by myself instead of waiting for users to do that. And I succeeded! :)
So, after launching 20 infinite curl loops to that handler's url I was quickly able to reproduce the issue, so the task seems clear now: keep reducing the code, reproduce locally, possibly without external services etc. I'll write up after I get to something.
Thanks.
On Wed, Apr 22, 2015 at 11:09 PM, Gregory Collins
wrote:
Maybe but it would be helpful to rule the scenario out. Johan's ekg library is also useful, it exports a webserver on a different port that you can use to track metrics like gc times, etc.
Other options for further debugging include gathering strace logs from the binary. You'll have to do some data gathering to narrow down the cause unfortunately -- http client? your code? Snap server? GHC event manager (System.timeout is implemented here)? GC? etc
G
On Wed, Apr 22, 2015 at 10:14 AM, Kostiantyn Rybnikov
wrote: Gregory,
Servers are far from being highly-overloaded, since they're currently under a much less load they used to be. Memory consumption is stable and low, and there's a lot of free RAM also.
Would you say that given these factors this scenario is unlikely?
On Wed, Apr 22, 2015 at 7:56 PM, Gregory Collins < greg@gregorycollins.net> wrote:
Given your gist, the timeout on your requests is set to a half-second so it's conceivable that a highly-loaded server might have GC pause times approaching that long. Smells to me like a classic Haskell memory leak (that's why the problem occurs after the server has been up for a while): run your program with the heap profiler, and audit any shared tables/IORefs/MVars to make sure you are not building up thunks there.
Greg
On Wed, Apr 22, 2015 at 9:14 AM, Kostiantyn Rybnikov
wrote: Hi!
Our company's main commercial product is a Snap-based web app which we compile with GHC 7.8.4. It works on four app-servers currently load-balanced behind Haproxy.
I recently implemented a new piece of functionality, which led to weird behavior which I have no idea how to debug, so I'm asking here for help and ideas!
The new functionality is this: on specific url-handler, we need to query n external services concurrently with a timeout, gather and render results. Easy (in Haskell)!
The implementation looks, as you might imagine, something like this (sorry for almost-real-haskell, I'm sure I forgot tons of imports and other things, but I hope everything is clear as-is, if not -- I'll be glad to update gist to make things more specific):
https://gist.github.com/k-bx/0cf7035aaf1ad6306e76
Now, this works wonderful for some time, and in logs I can see both, successful fetches of external-content, and also lots of timeouts from our external providers. Life is good.
But! After several days of work (sometimes a day, sometimes couple days), apps on all 4 servers go crazy. It might take some interval (like 20 minutes) before they're all crazy, so it's not super-synchronous. Now: how crazy, exactly?
First of all, this endpoint timeouts. Haproxy requests for a response, and response times out, so they "hang".
Secondly, logs are interesting. If you look at the code from gist once again, you can see, that some of CandidateProvider's don't actually require any networking work, so all they do is actually just logging that they're working (I added this as part of debugging actually) and return pure data. So what's weird is that they timeout also! Here's how output of our logs starts to look like after the bug happens:
``` [2015-04-22 09:56:20] provider: CandidateProvider1 [2015-04-22 09:56:20] provider: CandidateProvider2 [2015-04-22 09:56:21] Got timeout while requesting CandidateProvider1 [2015-04-22 09:56:21] Got timeout while requesting CandidateProvider2 [2015-04-22 09:56:22] provider: CandidateProvider1 [2015-04-22 09:56:22] provider: CandidateProvider2 [2015-04-22 09:56:23] Got timeout while requesting CandidateProvider1 [2015-04-22 09:56:23] Got timeout while requesting CandidateProvider2 ... and so on ```
What's also weird is that, even after timeout is logged, the string ""Got responses!" never gets logged also! So hanging happens somewhere in-between.
I have to say I'm sorry that I don't have strace output now, I'll have to wait until this situation happens once again, but I'll get later to you with this info.
So, how is this possible that almost-pure code gets timed-out? And why does it hang afterwards?
CPU and other resource usage is quite low, number of open file-descriptors also (it seems).
Thanks for all the suggestions in advance!
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
-- Gregory Collins
-- Gregory Collins

Well I do now :) I'll also check out new but interesting haskell-logger [0] I just think that it would be a good idea to investigate hslogger because it seems that a lot of people are using it already, so providing an update that would upgrade their performance with no additional work would make them happy. [0]: https://github.com/wdanilo/haskell-logger On Fri, Apr 24, 2015 at 10:29 PM, Mikhail Glushenkov < the.dead.shall.rise@gmail.com> wrote:
Hi,
On 24 April 2015 at 21:25, Kostiantyn Rybnikov
wrote: I hope I'll have a chance to investigate why hslogger is so slow in future, but meanwhile will just remove logging.
Have you considered using fast-logger?

Hi,
I've had a look at this as we use hslogger too, so I'm keen to avoid
this kind of performance issue. I threw a quick Criterion benchmark
together:
https://gist.github.com/DaveCTurner/f977123b4498c4c64569
The headline result on my test machine are that each log call takes
~540us, so 2000 should take about a second. Would be interested if you
could run the same benchmark on your setup as it's possible that
there's something else downstream that's causing you a problem.
A couple of things that might be worth bearing in mind: if you're
talking to syslog over /dev/log then that can block if the log daemon
falls behind: unix datagram sockets don't drop datagrams when they're
congested. If the /dev/log test is slow but the UDP test is fast then
it could be that your syslog can't handle the load.
I'm using rsyslogd and have enabled the feature that combines
identical messages, so this test doesn't generate much disk IO and it
keeps up easily, so the UDP and /dev/log tests run about equally fast
for me. Is your syslog writing out every message? It may be flushing
to disk after every message too, which would be terribly slow.
If you're not logging to syslog, what's your hslogger config?
Cheers,
David
On 24 April 2015 at 20:25, Kostiantyn Rybnikov
An update for everyone interested (and not). Turned out it's neither GHC RTS, Snap or networking issues, it's hslogger being very slow. I thought it's slow when used concurrently, but just did a test when it writes 2000 5kb messages sequentially and that finishes in 111 seconds (while minimal program that writes same 2000 messages finishes in 0.12s).
I hope I'll have a chance to investigate why hslogger is so slow in future, but meanwhile will just remove logging.
On Thu, Apr 23, 2015 at 4:08 PM, Kostiantyn Rybnikov
wrote: All right, good news!
After adding ekg, gathering its data via bosun and seeing nothing useful I actually figured out that I could try harder to reproduce issue by myself instead of waiting for users to do that. And I succeeded! :)
So, after launching 20 infinite curl loops to that handler's url I was quickly able to reproduce the issue, so the task seems clear now: keep reducing the code, reproduce locally, possibly without external services etc. I'll write up after I get to something.
Thanks.
On Wed, Apr 22, 2015 at 11:09 PM, Gregory Collins
wrote: Maybe but it would be helpful to rule the scenario out. Johan's ekg library is also useful, it exports a webserver on a different port that you can use to track metrics like gc times, etc.
Other options for further debugging include gathering strace logs from the binary. You'll have to do some data gathering to narrow down the cause unfortunately -- http client? your code? Snap server? GHC event manager (System.timeout is implemented here)? GC? etc
G
On Wed, Apr 22, 2015 at 10:14 AM, Kostiantyn Rybnikov
wrote: Gregory,
Servers are far from being highly-overloaded, since they're currently under a much less load they used to be. Memory consumption is stable and low, and there's a lot of free RAM also.
Would you say that given these factors this scenario is unlikely?
On Wed, Apr 22, 2015 at 7:56 PM, Gregory Collins
wrote: Given your gist, the timeout on your requests is set to a half-second so it's conceivable that a highly-loaded server might have GC pause times approaching that long. Smells to me like a classic Haskell memory leak (that's why the problem occurs after the server has been up for a while): run your program with the heap profiler, and audit any shared tables/IORefs/MVars to make sure you are not building up thunks there.
Greg
On Wed, Apr 22, 2015 at 9:14 AM, Kostiantyn Rybnikov
wrote: Hi!
Our company's main commercial product is a Snap-based web app which we compile with GHC 7.8.4. It works on four app-servers currently load-balanced behind Haproxy.
I recently implemented a new piece of functionality, which led to weird behavior which I have no idea how to debug, so I'm asking here for help and ideas!
The new functionality is this: on specific url-handler, we need to query n external services concurrently with a timeout, gather and render results. Easy (in Haskell)!
The implementation looks, as you might imagine, something like this (sorry for almost-real-haskell, I'm sure I forgot tons of imports and other things, but I hope everything is clear as-is, if not -- I'll be glad to update gist to make things more specific):
https://gist.github.com/k-bx/0cf7035aaf1ad6306e76
Now, this works wonderful for some time, and in logs I can see both, successful fetches of external-content, and also lots of timeouts from our external providers. Life is good.
But! After several days of work (sometimes a day, sometimes couple days), apps on all 4 servers go crazy. It might take some interval (like 20 minutes) before they're all crazy, so it's not super-synchronous. Now: how crazy, exactly?
First of all, this endpoint timeouts. Haproxy requests for a response, and response times out, so they "hang".
Secondly, logs are interesting. If you look at the code from gist once again, you can see, that some of CandidateProvider's don't actually require any networking work, so all they do is actually just logging that they're working (I added this as part of debugging actually) and return pure data. So what's weird is that they timeout also! Here's how output of our logs starts to look like after the bug happens:
``` [2015-04-22 09:56:20] provider: CandidateProvider1 [2015-04-22 09:56:20] provider: CandidateProvider2 [2015-04-22 09:56:21] Got timeout while requesting CandidateProvider1 [2015-04-22 09:56:21] Got timeout while requesting CandidateProvider2 [2015-04-22 09:56:22] provider: CandidateProvider1 [2015-04-22 09:56:22] provider: CandidateProvider2 [2015-04-22 09:56:23] Got timeout while requesting CandidateProvider1 [2015-04-22 09:56:23] Got timeout while requesting CandidateProvider2 ... and so on ```
What's also weird is that, even after timeout is logged, the string ""Got responses!" never gets logged also! So hanging happens somewhere in-between.
I have to say I'm sorry that I don't have strace output now, I'll have to wait until this situation happens once again, but I'll get later to you with this info.
So, how is this possible that almost-pure code gets timed-out? And why does it hang afterwards?
CPU and other resource usage is quite low, number of open file-descriptors also (it seems).
Thanks for all the suggestions in advance!
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
-- Gregory Collins
-- Gregory Collins
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

Hi David. I planned to create a detailed bug-report at hslogger's issues to start investigation there (as a better place) on Monday, but since I have code prepared already, it's easy to share it right now: https://gist.github.com/k-bx/ccf6fd1c73680c8a4345 I'm launching it as: time ./dist/build/seq/seq &> /dev/null We don't use syslog driver, instead we have a separate file-to-syslog worker to decouple these components. On Sun, Apr 26, 2015 at 12:08 AM, David Turner < dct25-561bs@mythic-beasts.com> wrote:
Hi,
I've had a look at this as we use hslogger too, so I'm keen to avoid this kind of performance issue. I threw a quick Criterion benchmark together:
https://gist.github.com/DaveCTurner/f977123b4498c4c64569
The headline result on my test machine are that each log call takes ~540us, so 2000 should take about a second. Would be interested if you could run the same benchmark on your setup as it's possible that there's something else downstream that's causing you a problem.
A couple of things that might be worth bearing in mind: if you're talking to syslog over /dev/log then that can block if the log daemon falls behind: unix datagram sockets don't drop datagrams when they're congested. If the /dev/log test is slow but the UDP test is fast then it could be that your syslog can't handle the load.
I'm using rsyslogd and have enabled the feature that combines identical messages, so this test doesn't generate much disk IO and it keeps up easily, so the UDP and /dev/log tests run about equally fast for me. Is your syslog writing out every message? It may be flushing to disk after every message too, which would be terribly slow.
If you're not logging to syslog, what's your hslogger config?
Cheers,
David
An update for everyone interested (and not). Turned out it's neither GHC RTS, Snap or networking issues, it's hslogger being very slow. I thought it's slow when used concurrently, but just did a test when it writes 2000 5kb messages sequentially and that finishes in 111 seconds (while minimal program that writes same 2000 messages finishes in 0.12s).
I hope I'll have a chance to investigate why hslogger is so slow in future, but meanwhile will just remove logging.
On Thu, Apr 23, 2015 at 4:08 PM, Kostiantyn Rybnikov
wrote: All right, good news!
After adding ekg, gathering its data via bosun and seeing nothing
useful I
actually figured out that I could try harder to reproduce issue by myself instead of waiting for users to do that. And I succeeded! :)
So, after launching 20 infinite curl loops to that handler's url I was quickly able to reproduce the issue, so the task seems clear now: keep reducing the code, reproduce locally, possibly without external services etc. I'll write up after I get to something.
Thanks.
On Wed, Apr 22, 2015 at 11:09 PM, Gregory Collins
wrote: Maybe but it would be helpful to rule the scenario out. Johan's ekg library is also useful, it exports a webserver on a different port
can use to track metrics like gc times, etc.
Other options for further debugging include gathering strace logs from the binary. You'll have to do some data gathering to narrow down the cause unfortunately -- http client? your code? Snap server? GHC event manager (System.timeout is implemented here)? GC? etc
G
On Wed, Apr 22, 2015 at 10:14 AM, Kostiantyn Rybnikov
wrote: Gregory,
Servers are far from being highly-overloaded, since they're currently under a much less load they used to be. Memory consumption is stable
and
low, and there's a lot of free RAM also.
Would you say that given these factors this scenario is unlikely?
On Wed, Apr 22, 2015 at 7:56 PM, Gregory Collins
wrote: Given your gist, the timeout on your requests is set to a half-second so it's conceivable that a highly-loaded server might have GC pause
times
approaching that long. Smells to me like a classic Haskell memory leak (that's why the problem occurs after the server has been up for a while): run your program with the heap profiler, and audit any shared tables/IORefs/MVars to make sure you are not building up thunks
Greg
On Wed, Apr 22, 2015 at 9:14 AM, Kostiantyn Rybnikov
wrote: > > Hi! > > Our company's main commercial product is a Snap-based web app which we
> compile with GHC 7.8.4. It works on four app-servers currently load-balanced > behind Haproxy. > > I recently implemented a new piece of functionality, which led to > weird behavior which I have no idea how to debug, so I'm asking here for > help and ideas! > > The new functionality is this: on specific url-handler, we need to > query n external services concurrently with a timeout, gather and render > results. Easy (in Haskell)! > > The implementation looks, as you might imagine, something like this > (sorry for almost-real-haskell, I'm sure I forgot tons of imports and other > things, but I hope everything is clear as-is, if not -- I'll be glad to > update gist to make things more specific): > > https://gist.github.com/k-bx/0cf7035aaf1ad6306e76 > > Now, this works wonderful for some time, and in logs I can see both, > successful fetches of external-content, and also lots of timeouts from our > external providers. Life is good. > > But! After several days of work (sometimes a day, sometimes couple > days), apps on all 4 servers go crazy. It might take some interval (like 20 > minutes) before they're all crazy, so it's not super-synchronous. Now: how > crazy, exactly? > > First of all, this endpoint timeouts. Haproxy requests for a response, > and response times out, so they "hang". > > Secondly, logs are interesting. If you look at the code from gist once > again, you can see, that some of CandidateProvider's don't actually require > any networking work, so all they do is actually just logging that
> working (I added this as part of debugging actually) and return
On 24 April 2015 at 20:25, Kostiantyn Rybnikov
wrote: that you there. they're pure data. > So what's weird is that they timeout also! Here's how output of our logs > starts to look like after the bug happens: > > ``` > [2015-04-22 09:56:20] provider: CandidateProvider1 > [2015-04-22 09:56:20] provider: CandidateProvider2 > [2015-04-22 09:56:21] Got timeout while requesting CandidateProvider1 > [2015-04-22 09:56:21] Got timeout while requesting CandidateProvider2 > [2015-04-22 09:56:22] provider: CandidateProvider1 > [2015-04-22 09:56:22] provider: CandidateProvider2 > [2015-04-22 09:56:23] Got timeout while requesting CandidateProvider1 > [2015-04-22 09:56:23] Got timeout while requesting CandidateProvider2 > ... and so on > ``` > > What's also weird is that, even after timeout is logged, the string > ""Got responses!" never gets logged also! So hanging happens somewhere > in-between. > > I have to say I'm sorry that I don't have strace output now, I'll have > to wait until this situation happens once again, but I'll get later to you > with this info. > > So, how is this possible that almost-pure code gets timed-out? And why > does it hang afterwards? > > CPU and other resource usage is quite low, number of open > file-descriptors also (it seems). > > Thanks for all the suggestions in advance! > > _______________________________________________ > Haskell-Cafe mailing list > Haskell-Cafe@haskell.org > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >
-- Gregory Collins
-- Gregory Collins
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

I see.
The issue seems to be the default handler which writes the log to
stderr. Replacing 'addHandler h' with 'setHandlers [h]' makes it run
in a reasonable time as 'setHandlers' starts afresh; conversely it's
still slow even if you just use the default handler on its own, i.e.
removing the call to 'addHandler'.
I was a bit suspicious about the 'hFlush' in System.Log.Handler.Simple
but removing that didn't help. However, suspecting that the issue may
be too-much-flushing I eventually found
https://ghc.haskell.org/trac/ghc/ticket/7418 which says that writing
to stderr is slow because its default buffering mode is NoBuffering. I
added 'hSetBuffering stderr LineBuffering' and boom, it runs at a
sensible speed.
Recommend you get rid of writing to stderr and just log to the file
unless you've a good reason to send the output both ways, in which
case switch to LineBuffering as above.
If you're using rsyslog you also may be interested to read
http://www.rsyslog.com/doc/queues.html - this describes how it can
decouple the final log output from the input using a combination of
in-memory and on-disk buffers, and even discard lower-priority
messages from the queue if the going gets really tough. We judged it a
lot of effort to implement our own on-disk spool and were particularly
worried about it growing without bound if the downstream was too slow,
so this feature of rsyslog was just what we needed.
Hope that helps,
David
On 25 April 2015 at 23:44, Kostiantyn Rybnikov
Hi David.
I planned to create a detailed bug-report at hslogger's issues to start investigation there (as a better place) on Monday, but since I have code prepared already, it's easy to share it right now: https://gist.github.com/k-bx/ccf6fd1c73680c8a4345
I'm launching it as:
time ./dist/build/seq/seq &> /dev/null
We don't use syslog driver, instead we have a separate file-to-syslog worker to decouple these components.
On Sun, Apr 26, 2015 at 12:08 AM, David Turner
wrote: Hi,
I've had a look at this as we use hslogger too, so I'm keen to avoid this kind of performance issue. I threw a quick Criterion benchmark together:
https://gist.github.com/DaveCTurner/f977123b4498c4c64569
The headline result on my test machine are that each log call takes ~540us, so 2000 should take about a second. Would be interested if you could run the same benchmark on your setup as it's possible that there's something else downstream that's causing you a problem.
A couple of things that might be worth bearing in mind: if you're talking to syslog over /dev/log then that can block if the log daemon falls behind: unix datagram sockets don't drop datagrams when they're congested. If the /dev/log test is slow but the UDP test is fast then it could be that your syslog can't handle the load.
I'm using rsyslogd and have enabled the feature that combines identical messages, so this test doesn't generate much disk IO and it keeps up easily, so the UDP and /dev/log tests run about equally fast for me. Is your syslog writing out every message? It may be flushing to disk after every message too, which would be terribly slow.
If you're not logging to syslog, what's your hslogger config?
Cheers,
David
On 24 April 2015 at 20:25, Kostiantyn Rybnikov
wrote: An update for everyone interested (and not). Turned out it's neither GHC RTS, Snap or networking issues, it's hslogger being very slow. I thought it's slow when used concurrently, but just did a test when it writes 2000 5kb messages sequentially and that finishes in 111 seconds (while minimal program that writes same 2000 messages finishes in 0.12s).
I hope I'll have a chance to investigate why hslogger is so slow in future, but meanwhile will just remove logging.
On Thu, Apr 23, 2015 at 4:08 PM, Kostiantyn Rybnikov
wrote: All right, good news!
After adding ekg, gathering its data via bosun and seeing nothing useful I actually figured out that I could try harder to reproduce issue by myself instead of waiting for users to do that. And I succeeded! :)
So, after launching 20 infinite curl loops to that handler's url I was quickly able to reproduce the issue, so the task seems clear now: keep reducing the code, reproduce locally, possibly without external services etc. I'll write up after I get to something.
Thanks.
On Wed, Apr 22, 2015 at 11:09 PM, Gregory Collins
wrote: Maybe but it would be helpful to rule the scenario out. Johan's ekg library is also useful, it exports a webserver on a different port that you can use to track metrics like gc times, etc.
Other options for further debugging include gathering strace logs from the binary. You'll have to do some data gathering to narrow down the cause unfortunately -- http client? your code? Snap server? GHC event manager (System.timeout is implemented here)? GC? etc
G
On Wed, Apr 22, 2015 at 10:14 AM, Kostiantyn Rybnikov
wrote: Gregory,
Servers are far from being highly-overloaded, since they're currently under a much less load they used to be. Memory consumption is stable and low, and there's a lot of free RAM also.
Would you say that given these factors this scenario is unlikely?
On Wed, Apr 22, 2015 at 7:56 PM, Gregory Collins
wrote: > > Given your gist, the timeout on your requests is set to a > half-second > so it's conceivable that a highly-loaded server might have GC pause > times > approaching that long. Smells to me like a classic Haskell memory > leak > (that's why the problem occurs after the server has been up for a > while): > run your program with the heap profiler, and audit any shared > tables/IORefs/MVars to make sure you are not building up thunks > there. > > Greg > > On Wed, Apr 22, 2015 at 9:14 AM, Kostiantyn Rybnikov > wrote: >> >> Hi! >> >> Our company's main commercial product is a Snap-based web app which >> we >> compile with GHC 7.8.4. It works on four app-servers currently >> load-balanced >> behind Haproxy. >> >> I recently implemented a new piece of functionality, which led to >> weird behavior which I have no idea how to debug, so I'm asking >> here for >> help and ideas! >> >> The new functionality is this: on specific url-handler, we need to >> query n external services concurrently with a timeout, gather and >> render >> results. Easy (in Haskell)! >> >> The implementation looks, as you might imagine, something like this >> (sorry for almost-real-haskell, I'm sure I forgot tons of imports >> and other >> things, but I hope everything is clear as-is, if not -- I'll be >> glad to >> update gist to make things more specific): >> >> https://gist.github.com/k-bx/0cf7035aaf1ad6306e76 >> >> Now, this works wonderful for some time, and in logs I can see >> both, >> successful fetches of external-content, and also lots of timeouts >> from our >> external providers. Life is good. >> >> But! After several days of work (sometimes a day, sometimes couple >> days), apps on all 4 servers go crazy. It might take some interval >> (like 20 >> minutes) before they're all crazy, so it's not super-synchronous. >> Now: how >> crazy, exactly? >> >> First of all, this endpoint timeouts. Haproxy requests for a >> response, >> and response times out, so they "hang". >> >> Secondly, logs are interesting. If you look at the code from gist >> once >> again, you can see, that some of CandidateProvider's don't actually >> require >> any networking work, so all they do is actually just logging that >> they're >> working (I added this as part of debugging actually) and return >> pure data. >> So what's weird is that they timeout also! Here's how output of our >> logs >> starts to look like after the bug happens: >> >> ``` >> [2015-04-22 09:56:20] provider: CandidateProvider1 >> [2015-04-22 09:56:20] provider: CandidateProvider2 >> [2015-04-22 09:56:21] Got timeout while requesting >> CandidateProvider1 >> [2015-04-22 09:56:21] Got timeout while requesting >> CandidateProvider2 >> [2015-04-22 09:56:22] provider: CandidateProvider1 >> [2015-04-22 09:56:22] provider: CandidateProvider2 >> [2015-04-22 09:56:23] Got timeout while requesting >> CandidateProvider1 >> [2015-04-22 09:56:23] Got timeout while requesting >> CandidateProvider2 >> ... and so on >> ``` >> >> What's also weird is that, even after timeout is logged, the string >> ""Got responses!" never gets logged also! So hanging happens >> somewhere >> in-between. >> >> I have to say I'm sorry that I don't have strace output now, I'll >> have >> to wait until this situation happens once again, but I'll get later >> to you >> with this info. >> >> So, how is this possible that almost-pure code gets timed-out? And >> why >> does it hang afterwards? >> >> CPU and other resource usage is quite low, number of open >> file-descriptors also (it seems). >> >> Thanks for all the suggestions in advance! >> >> _______________________________________________ >> Haskell-Cafe mailing list >> Haskell-Cafe@haskell.org >> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >> > > > > -- > Gregory Collins -- Gregory Collins
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

David, thank you very much. I confirm everything you wrote, both doing
LineBuffering and removing output to stderr resolves the problem.
I will create a ticket for hslogger in order to propose somehow putting
LineBuffering to stderr in case logging there is enabled. That should help
people like myself who don't know they have the problem, but may suffer
from it from time to time (at least every write to stderr slows down
response-time quite significantly).
Thanks for the "queues" rsyslog article, I'll take a look. At the time of
having task to "centralize logs" I didn't want to add rsyslog as one more
required service, I also had an impression that all current "syslog"
implementations are in a somewhat outdated stage with multiple badly
supported forks, so I did minimal work of adding decoupled forwarding logs
from files to central rsyslog, but maybe it's time to migrate everything to
rsyslogs and get rid of forwarder services.
Anyway, thank you once again.
On Sun, Apr 26, 2015 at 5:52 AM, David Turner wrote: I see. The issue seems to be the default handler which writes the log to
stderr. Replacing 'addHandler h' with 'setHandlers [h]' makes it run
in a reasonable time as 'setHandlers' starts afresh; conversely it's
still slow even if you just use the default handler on its own, i.e.
removing the call to 'addHandler'. I was a bit suspicious about the 'hFlush' in System.Log.Handler.Simple
but removing that didn't help. However, suspecting that the issue may
be too-much-flushing I eventually found
https://ghc.haskell.org/trac/ghc/ticket/7418 which says that writing
to stderr is slow because its default buffering mode is NoBuffering. I
added 'hSetBuffering stderr LineBuffering' and boom, it runs at a
sensible speed. Recommend you get rid of writing to stderr and just log to the file
unless you've a good reason to send the output both ways, in which
case switch to LineBuffering as above. If you're using rsyslog you also may be interested to read
http://www.rsyslog.com/doc/queues.html - this describes how it can
decouple the final log output from the input using a combination of
in-memory and on-disk buffers, and even discard lower-priority
messages from the queue if the going gets really tough. We judged it a
lot of effort to implement our own on-disk spool and were particularly
worried about it growing without bound if the downstream was too slow,
so this feature of rsyslog was just what we needed. Hope that helps, David Hi David. I planned to create a detailed bug-report at hslogger's issues to start
investigation there (as a better place) on Monday, but since I have code
prepared already, it's easy to share it right now:
https://gist.github.com/k-bx/ccf6fd1c73680c8a4345 I'm launching it as: time ./dist/build/seq/seq &> /dev/null We don't use syslog driver, instead we have a separate file-to-syslog
worker
to decouple these components. On Sun, Apr 26, 2015 at 12:08 AM, David Turner
Hi, I've had a look at this as we use hslogger too, so I'm keen to avoid
this kind of performance issue. I threw a quick Criterion benchmark
together: https://gist.github.com/DaveCTurner/f977123b4498c4c64569 The headline result on my test machine are that each log call takes
~540us, so 2000 should take about a second. Would be interested if you
could run the same benchmark on your setup as it's possible that
there's something else downstream that's causing you a problem. A couple of things that might be worth bearing in mind: if you're
talking to syslog over /dev/log then that can block if the log daemon
falls behind: unix datagram sockets don't drop datagrams when they're
congested. If the /dev/log test is slow but the UDP test is fast then
it could be that your syslog can't handle the load. I'm using rsyslogd and have enabled the feature that combines
identical messages, so this test doesn't generate much disk IO and it
keeps up easily, so the UDP and /dev/log tests run about equally fast
for me. Is your syslog writing out every message? It may be flushing
to disk after every message too, which would be terribly slow. If you're not logging to syslog, what's your hslogger config? Cheers, David On 24 April 2015 at 20:25, Kostiantyn Rybnikov An update for everyone interested (and not). Turned out it's neither GHC RTS, Snap or networking issues, it's hslogger being very slow. I it's slow when used concurrently, but just did a test when it writes
2000
5kb messages sequentially and that finishes in 111 seconds (while
minimal
program that writes same 2000 messages finishes in 0.12s). I hope I'll have a chance to investigate why hslogger is so slow in
future,
but meanwhile will just remove logging. On Thu, Apr 23, 2015 at 4:08 PM, Kostiantyn Rybnikov All right, good news! After adding ekg, gathering its data via bosun and seeing nothing
useful I
actually figured out that I could try harder to reproduce issue by
myself
instead of waiting for users to do that. And I succeeded! :) So, after launching 20 infinite curl loops to that handler's url I was quickly able to reproduce the issue, so the task seems clear now:
keep
reducing the code, reproduce locally, possibly without external
services
etc. I'll write up after I get to something. Thanks. On Wed, Apr 22, 2015 at 11:09 PM, Gregory Collins
Maybe but it would be helpful to rule the scenario out. Johan's ekg
library is also useful, it exports a webserver on a different port
that you
can use to track metrics like gc times, etc. Other options for further debugging include gathering strace logs from the binary. You'll have to do some data gathering to narrow down the
cause
unfortunately -- http client? your code? Snap server? GHC event
manager
(System.timeout is implemented here)? GC? etc G On Wed, Apr 22, 2015 at 10:14 AM, Kostiantyn Rybnikov <
k-bx@k-bx.com>
wrote:
>
> Gregory,
>
> Servers are far from being highly-overloaded, since they're
currently
> under a much less load they used to be. Memory consumption is
stable
> and
> low, and there's a lot of free RAM also.
>
> Would you say that given these factors this scenario is unlikely?
>
> On Wed, Apr 22, 2015 at 7:56 PM, Gregory Collins
> >> times
>> approaching that long. Smells to me like a classic Haskell memory
>> leak
>> (that's why the problem occurs after the server has been up for a
>> while):
>> run your program with the heap profiler, and audit any shared
>> tables/IORefs/MVars to make sure you are not building up thunks
>> there.
>>
>> Greg
>>
>> On Wed, Apr 22, 2015 at 9:14 AM, Kostiantyn Rybnikov <
k-bx@k-bx.com>
>> wrote:
>>>
>>> Hi!
>>>
>>> Our company's main commercial product is a Snap-based web app
which
>>> we
>>> compile with GHC 7.8.4. It works on four app-servers currently
>>> load-balanced
>>> behind Haproxy.
>>>
>>> I recently implemented a new piece of functionality, which led to
>>> weird behavior which I have no idea how to debug, so I'm asking
>>> here for
>>> help and ideas!
>>>
>>> The new functionality is this: on specific url-handler, we need
to
>>> query n external services concurrently with a timeout, gather and
>>> render
>>> results. Easy (in Haskell)!
>>>
>>> The implementation looks, as you might imagine, something like On 25 April 2015 at 23:44, Kostiantyn Rybnikov >>> (sorry for almost-real-haskell, I'm sure I forgot tons of imports
>>> and other
>>> things, but I hope everything is clear as-is, if not -- I'll be
>>> glad to
>>> update gist to make things more specific):
>>>
>>> https://gist.github.com/k-bx/0cf7035aaf1ad6306e76
>>>
>>> Now, this works wonderful for some time, and in logs I can see
>>> both,
>>> successful fetches of external-content, and also lots of timeouts
>>> from our
>>> external providers. Life is good.
>>>
>>> But! After several days of work (sometimes a day, sometimes
couple
>>> days), apps on all 4 servers go crazy. It might take some
interval
>>> (like 20
>>> minutes) before they're all crazy, so it's not super-synchronous.
>>> Now: how
>>> crazy, exactly?
>>>
>>> First of all, this endpoint timeouts. Haproxy requests for a
>>> response,
>>> and response times out, so they "hang".
>>>
>>> Secondly, logs are interesting. If you look at the code from gist
>>> once
>>> again, you can see, that some of CandidateProvider's don't
actually
>>> require
>>> any networking work, so all they do is actually just logging that
>>> they're
>>> working (I added this as part of debugging actually) and return
>>> pure data.
>>> So what's weird is that they timeout also! Here's how output of
our
>>> logs
>>> starts to look like after the bug happens:
>>>
>>> ```
>>> [2015-04-22 09:56:20] provider: CandidateProvider1
>>> [2015-04-22 09:56:20] provider: CandidateProvider2
>>> [2015-04-22 09:56:21] Got timeout while requesting
>>> CandidateProvider1
>>> [2015-04-22 09:56:21] Got timeout while requesting
>>> CandidateProvider2
>>> [2015-04-22 09:56:22] provider: CandidateProvider1
>>> [2015-04-22 09:56:22] provider: CandidateProvider2
>>> [2015-04-22 09:56:23] Got timeout while requesting
>>> CandidateProvider1
>>> [2015-04-22 09:56:23] Got timeout while requesting
>>> CandidateProvider2
>>> ... and so on
>>> ```
>>>
>>> What's also weird is that, even after timeout is logged, the
string
>>> ""Got responses!" never gets logged also! So hanging happens
>>> somewhere
>>> in-between.
>>>
>>> I have to say I'm sorry that I don't have strace output now, I'll
>>> have
>>> to wait until this situation happens once again, but I'll get
later
>>> to you
>>> with this info.
>>>
>>> So, how is this possible that almost-pure code gets timed-out?
And
>>> why
>>> does it hang afterwards?
>>>
>>> CPU and other resource usage is quite low, number of open
>>> file-descriptors also (it seems).
>>>
>>> Thanks for all the suggestions in advance!
>>>
>>> _______________________________________________
>>> Haskell-Cafe mailing list
>>> Haskell-Cafe@haskell.org
>>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>>>
>>
>>
>>
>> --
>> Gregory Collins --
Gregory Collins _______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

You might already have considered it, but the GHC eventlog together with Threadscope and ghc-events-analyze (http://www.well-typed.com/blog/2014/02/ghc-events-analyze/) can be very helpful to debug such issues.

Niklas,
This seems a very helpful tool indeed, but I'm not completely sure how
could it be helpful in for specific problem. Was there anything specific
you would suggest to measure with it, or just consider it as a general nice
tool while approaching the problem?
Thanks!
On Wed, Apr 22, 2015 at 8:30 PM, Niklas Hambüchen
You might already have considered it, but the GHC eventlog together with Threadscope and ghc-events-analyze (http://www.well-typed.com/blog/2014/02/ghc-events-analyze/) can be very helpful to debug such issues.

I meant in the general sense of approaching the problem, and the linked blog post takes as an example situation a web server that doesn't respond well after some amount of time, I associated that with your problem description. On 23/04/15 02:46, Kostiantyn Rybnikov wrote:
Niklas,
This seems a very helpful tool indeed, but I'm not completely sure how could it be helpful in for specific problem. Was there anything specific you would suggest to measure with it, or just consider it as a general nice tool while approaching the problem?
Thanks!
On Wed, Apr 22, 2015 at 8:30 PM, Niklas Hambüchen
mailto:mail@nh2.me> wrote: You might already have considered it, but the GHC eventlog together with Threadscope and ghc-events-analyze (http://www.well-typed.com/blog/2014/02/ghc-events-analyze/) can be very helpful to debug such issues.
participants (5)
-
David Turner
-
Gregory Collins
-
Kostiantyn Rybnikov
-
Mikhail Glushenkov
-
Niklas Hambüchen