Timeout on pure code

22 Apr 2015

      Hi!

Our company's main commercial product is a Snap-based web app which we
compile with GHC 7.8.4. It works on four app-servers currently
load-balanced behind Haproxy.

I recently implemented a new piece of functionality, which led to weird
behavior which I have no idea how to debug, so I'm asking here for help and
ideas!

The new functionality is this: on specific url-handler, we need to query n
external services concurrently with a timeout, gather and render results.
Easy (in Haskell)!

The implementation looks, as you might imagine, something like this (sorry
for almost-real-haskell, I'm sure I forgot tons of imports and other
things, but I hope everything is clear as-is, if not -- I'll be glad to
update gist to make things more specific):

https://gist.github.com/k-bx/0cf7035aaf1ad6306e76

Now, this works wonderful for some time, and in logs I can see both,
successful fetches of external-content, and also lots of timeouts from our
external providers. Life is good.

But! After several days of work (sometimes a day, sometimes couple days),
apps on all 4 servers go crazy. It might take some interval (like 20
minutes) before they're all crazy, so it's not super-synchronous. Now: how
crazy, exactly?

First of all, this endpoint timeouts. Haproxy requests for a response, and
response times out, so they "hang".

Secondly, logs are interesting. If you look at the code from gist once
again, you can see, that some of CandidateProvider's don't actually require
any networking work, so all they do is actually just logging that they're
working (I added this as part of debugging actually) and return pure data.
So what's weird is that they timeout also! Here's how output of our logs
starts to look like after the bug happens:

```
[2015-04-22 09:56:20] provider: CandidateProvider1
[2015-04-22 09:56:20] provider: CandidateProvider2
[2015-04-22 09:56:21] Got timeout while requesting CandidateProvider1
[2015-04-22 09:56:21] Got timeout while requesting CandidateProvider2
[2015-04-22 09:56:22] provider: CandidateProvider1
[2015-04-22 09:56:22] provider: CandidateProvider2
[2015-04-22 09:56:23] Got timeout while requesting CandidateProvider1
[2015-04-22 09:56:23] Got timeout while requesting CandidateProvider2
... and so on
```

What's also weird is that, even after timeout is logged, the string ""Got
responses!" never gets logged also! So hanging happens somewhere in-between.

I have to say I'm sorry that I don't have strace output now, I'll have to
wait until this situation happens once again, but I'll get later to you
with this info.

So, how is this possible that almost-pure code gets timed-out? And why does
it hang afterwards?

CPU and other resource usage is quite low, number of open file-descriptors
also (it seems).

Thanks for all the suggestions in advance!

Kostiantyn Rybnikov

Gregory Collins

Kostiantyn Rybnikov

Gregory Collins

Kostiantyn Rybnikov

Kostiantyn Rybnikov

Mikhail Glushenkov

Kostiantyn Rybnikov

David Turner

Kostiantyn Rybnikov

David Turner

Kostiantyn Rybnikov

Niklas Hambüchen

Kostiantyn Rybnikov

Niklas Hambüchen

tags

participants (5)