Re: [Haskell-cafe] Timeout on pure code

26 Apr 2015

      Hi David.

I planned to create a detailed bug-report at hslogger's issues to start
investigation there (as a better place) on Monday, but since I have code
prepared already, it's easy to share it right now:
https://gist.github.com/k-bx/ccf6fd1c73680c8a4345

I'm launching it as:

    time ./dist/build/seq/seq &> /dev/null

We don't use syslog driver, instead we have a separate file-to-syslog
worker to decouple these components.

On Sun, Apr 26, 2015 at 12:08 AM, David Turner <
dct25-561bs@mythic-beasts.com> wrote:
...
Hi,
I've had a look at this as we use hslogger too, so I'm keen to avoid
this kind of performance issue. I threw a quick Criterion benchmark
together:
https://gist.github.com/DaveCTurner/f977123b4498c4c64569
The headline result on my test machine are that each log call takes
~540us, so 2000 should take about a second. Would be interested if you
could run the same benchmark on your setup as it's possible that
there's something else downstream that's causing you a problem.
A couple of things that might be worth bearing in mind: if you're
talking to syslog over /dev/log then that can block if the log daemon
falls behind: unix datagram sockets don't drop datagrams when they're
congested. If the /dev/log test is slow but the UDP test is fast then
it could be that your syslog can't handle the load.
I'm using rsyslogd and have enabled the feature that combines
identical messages, so this test doesn't generate much disk IO and it
keeps up easily, so the UDP and /dev/log tests run about equally fast
for me. Is your syslog writing out every message? It may be flushing
to disk after every message too, which would be terribly slow.
If you're not logging to syslog, what's your hslogger config?
Cheers,
David
On 24 April 2015 at 20:25, Kostiantyn Rybnikov <k-bx@k-bx.com> wrote:
...
An update for everyone interested (and not). Turned out it's neither GHC
RTS, Snap or networking issues, it's hslogger being very slow. I thought
it's slow when used concurrently, but just did a test when it writes 2000
5kb messages sequentially and that finishes in 111 seconds (while minimal
program that writes same 2000 messages finishes in 0.12s).
I hope I'll have a chance to investigate why hslogger is so slow in
future,
but meanwhile will just remove logging.
On Thu, Apr 23, 2015 at 4:08 PM, Kostiantyn Rybnikov <k-bx@k-bx.com>
wrote:
...
All right, good news!
After adding ekg, gathering its data via bosun and seeing nothing
useful I
...
actually figured out that I could try harder to reproduce issue by
myself
instead of waiting for users to do that. And I succeeded! :)
So, after launching 20 infinite curl loops to that handler's url I was
quickly able to reproduce the issue, so the task seems clear now: keep
reducing the code, reproduce locally, possibly without external services
etc. I'll write up after I get to something.
Thanks.
On Wed, Apr 22, 2015 at 11:09 PM, Gregory Collins
<greg@gregorycollins.net> wrote:
...
Maybe but it would be helpful to rule the scenario out. Johan's ekg
library is also useful, it exports a webserver on a different port
that you
...
can use to track metrics like gc times, etc.
Other options for further debugging include gathering strace logs from
the binary. You'll have to do some data gathering to narrow down the
cause
unfortunately -- http client? your code? Snap server? GHC event manager
(System.timeout is implemented here)? GC? etc
G
On Wed, Apr 22, 2015 at 10:14 AM, Kostiantyn Rybnikov <k-bx@k-bx.com>
wrote:
...
Gregory,
Servers are far from being highly-overloaded, since they're currently
under a much less load they used to be. Memory consumption is stable
and
...
low, and there's a lot of free RAM also.
Would you say that given these factors this scenario is unlikely?
On Wed, Apr 22, 2015 at 7:56 PM, Gregory Collins
<greg@gregorycollins.net> wrote:
...
Given your gist, the timeout on your requests is set to a half-second
so it's conceivable that a highly-loaded server might have GC pause
times
...
approaching that long. Smells to me like a classic Haskell memory
leak
(that's why the problem occurs after the server has been up for a
while):
run your program with the heap profiler, and audit any shared
tables/IORefs/MVars to make sure you are not building up thunks
there.
Greg
On Wed, Apr 22, 2015 at 9:14 AM, Kostiantyn Rybnikov <k-bx@k-bx.com>
wrote:
>
> Hi!
>
> Our company's main commercial product is a Snap-based web app which
we
> compile with GHC 7.8.4. It works on four app-servers currently
load-balanced
> behind Haproxy.
>
> I recently implemented a new piece of functionality, which led to
> weird behavior which I have no idea how to debug, so I'm asking
here for
> help and ideas!
>
> The new functionality is this: on specific url-handler, we need to
> query n external services concurrently with a timeout, gather and
render
> results. Easy (in Haskell)!
>
> The implementation looks, as you might imagine, something like this
> (sorry for almost-real-haskell, I'm sure I forgot tons of imports
and other
> things, but I hope everything is clear as-is, if not -- I'll be
glad to
> update gist to make things more specific):
>
> https://gist.github.com/k-bx/0cf7035aaf1ad6306e76
>
> Now, this works wonderful for some time, and in logs I can see both,
> successful fetches of external-content, and also lots of timeouts
from our
> external providers. Life is good.
>
> But! After several days of work (sometimes a day, sometimes couple
> days), apps on all 4 servers go crazy. It might take some interval
(like 20
> minutes) before they're all crazy, so it's not super-synchronous.
Now: how
> crazy, exactly?
>
> First of all, this endpoint timeouts. Haproxy requests for a
response,
> and response times out, so they "hang".
>
> Secondly, logs are interesting. If you look at the code from gist
once
> again, you can see, that some of CandidateProvider's don't actually
require
> any networking work, so all they do is actually just logging that
they're
> working (I added this as part of debugging actually) and return
pure data.
> So what's weird is that they timeout also! Here's how output of our
logs
> starts to look like after the bug happens:
>
> ```
> [2015-04-22 09:56:20] provider: CandidateProvider1
> [2015-04-22 09:56:20] provider: CandidateProvider2
> [2015-04-22 09:56:21] Got timeout while requesting
CandidateProvider1
> [2015-04-22 09:56:21] Got timeout while requesting
CandidateProvider2
> [2015-04-22 09:56:22] provider: CandidateProvider1
> [2015-04-22 09:56:22] provider: CandidateProvider2
> [2015-04-22 09:56:23] Got timeout while requesting
CandidateProvider1
> [2015-04-22 09:56:23] Got timeout while requesting
CandidateProvider2
> ... and so on
> ```
>
> What's also weird is that, even after timeout is logged, the string
> ""Got responses!" never gets logged also! So hanging happens
somewhere
> in-between.
>
> I have to say I'm sorry that I don't have strace output now, I'll
have
> to wait until this situation happens once again, but I'll get later
to you
> with this info.
>
> So, how is this possible that almost-pure code gets timed-out? And
why
> does it hang afterwards?
>
> CPU and other resource usage is quite low, number of open
> file-descriptors also (it seems).
>
> Thanks for all the suggestions in advance!
>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe@haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>
--
Gregory Collins <greg@gregorycollins.net>
--
Gregory Collins <greg@gregorycollins.net>
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe