Amazonka, conduit and sockets not closing

I've run into a problem with running out of filedescriptors. The following snippet is a trimmed down version of what I'm doing: #+begin_src haskell main :: IO () main = do awsEnv <- newEnv Discover runAWSCond awsEnv $ sqsSource queueUrl .| C.mapC snd .| sqsDeleteSink queueUrl where runAWSCond awsEnv = runResourceT . runAWS awsEnv . within Frankfurt . C.runConduit sqsSource :: MonadAWS m => T.Text -> C.ConduitT () (T.Text, T.Text) m () sqsSource queueUrl = do (_, msgs) <- C.lift $ recvSQS queueUrl C.yieldMany msgs sqsSource queueUrl sqsDeleteSink :: MonadAWS m => T.Text -> C.ConduitT T.Text o m () sqsDeleteSink queueUrl = do C.await >>= \case Nothing -> pure () Just receiptHandle -> do void $ C.lift $ delSQS queueUrl receiptHandle sqsDeleteSink queueUrl recvSQS queueUrl = do let rm = receiveMessage queueUrl & rmMaxNumberOfMessages ?~ 10 rmrs <- send rm let status = rmrs ^. rmrsResponseStatus msgs = rmrs ^. rmrsMessages & traversed %~ extract pure (status, catMaybes msgs) where extract msg = do body <- msg ^. mBody rh <- msg ^. mReceiptHandle pure (body, rh) delSQS queueUrl receiptHandle = do let dm = deleteMessage queueUrl receiptHandle send dm #+end_src This works fine for a while, but given a queue with enough messages it will fail with something like #+begin_example TransportError (HttpExceptionRequest Request { host = "sqs.eu-central-1.amazonaws.com" port = 443 secure = True requestHeaders = [("Host","sqs.eu-central-1.amazonaws.com"),("X-Amz-Date","20201126T101659Z"),("X-Amz-Content-SHA256","2e4bdf20a857a1416f218b1218670cf019ff53268d0adb34fe06402a62f3271d"),("Content-Type","application/x-www-form-urlencoded; charset=utf-8"),("Authorization","<REDACTED>")] path = "/" queryString = "" method = "POST" proxy = Nothing rawBody = False redirectCount = 0 responseTimeout = ResponseTimeoutMicro 70000000 requestVersion = HTTP/1.1 } (ConnectionFailure Network.Socket.getAddrInfo (called with preferred socket type/protocol: AddrInfo {addrFlags = [AI_ADDRCONFIG], addrFamily = AF_UNSPEC, addrSocketType = Stream, addrProtocol = 0, addrAddress = <assumed to be undefined>, addrCanonName = <assumed to be undefined>}, host name: Just "sqs.eu-central-1.amazonaws.com", service name: Just "443"): does not exist (System error))) #+end_example After some detours I found out that it's actually not a network issue, but rather that the process runs out of filedescriptors. Using =lsof= I can see that it doesn't seem to close /any/ sockets at all, instead they get stuck in a =CLOSE_WAIT= state: #+begin_example COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME wd-stats 88674 magnus 23u IPv4 815196 0t0 TCP ip-192-168-0-9.eu-central-1.compute.internal:60624->52.119.188.213:https (CLOSE_WAIT) wd-stats 88674 magnus 24u IPv4 811362 0t0 TCP ip-192-168-0-9.eu-central-1.compute.internal:43482->52.119.189.184:https (CLOSE_WAIT) wd-stats 88674 magnus 25u IPv4 811386 0t0 TCP ip-192-168-0-9.eu-central-1.compute.internal:60628->52.119.188.213:https (CLOSE_WAIT) wd-stats 88674 magnus 26u IPv4 813527 0t0 TCP ip-192-168-0-9.eu-central-1.compute.internal:43486->52.119.189.184:https (CLOSE_WAIT) ... #+end_example Am I using Amazonka and/or Conduit in a way that results in this? How do I should I use them? Or, is it an issue somewhere "below" my code? What can I do address that? Thanks for any insights or help /M -- Magnus Therning OpenPGP: 0x927912051716CE39 email: magnus@therning.org twitter: magthe http://magnus.therning.org/ Action is the foundational key to all success. — Pablo Picasso

I thought CLOSE_WAIT *is* one of the "closed" states. TCP sockets
stick around for a few minutes after use, right? You may simply be
generating sockets faster than one operating system can handle. Find
some way to reuse existing sockets, perhaps?
On Thu, Nov 26, 2020 at 3:13 PM Magnus Therning
I've run into a problem with running out of filedescriptors. The following snippet is a trimmed down version of what I'm doing:
#+begin_src haskell main :: IO () main = do awsEnv <- newEnv Discover runAWSCond awsEnv $ sqsSource queueUrl .| C.mapC snd .| sqsDeleteSink queueUrl where runAWSCond awsEnv = runResourceT . runAWS awsEnv . within Frankfurt . C.runConduit
sqsSource :: MonadAWS m => T.Text -> C.ConduitT () (T.Text, T.Text) m () sqsSource queueUrl = do (_, msgs) <- C.lift $ recvSQS queueUrl C.yieldMany msgs sqsSource queueUrl
sqsDeleteSink :: MonadAWS m => T.Text -> C.ConduitT T.Text o m () sqsDeleteSink queueUrl = do C.await >>= \case Nothing -> pure () Just receiptHandle -> do void $ C.lift $ delSQS queueUrl receiptHandle sqsDeleteSink queueUrl
recvSQS queueUrl = do let rm = receiveMessage queueUrl & rmMaxNumberOfMessages ?~ 10 rmrs <- send rm let status = rmrs ^. rmrsResponseStatus msgs = rmrs ^. rmrsMessages & traversed %~ extract pure (status, catMaybes msgs) where extract msg = do body <- msg ^. mBody rh <- msg ^. mReceiptHandle pure (body, rh)
delSQS queueUrl receiptHandle = do let dm = deleteMessage queueUrl receiptHandle send dm #+end_src
This works fine for a while, but given a queue with enough messages it will fail with something like
#+begin_example TransportError (HttpExceptionRequest Request { host = "sqs.eu-central-1.amazonaws.com" port = 443 secure = True requestHeaders = [("Host","sqs.eu-central-1.amazonaws.com"),("X-Amz-Date","20201126T101659Z"),("X-Amz-Content-SHA256","2e4bdf20a857a1416f218b1218670cf019ff53268d0adb34fe06402a62f3271d"),("Content-Type","application/x-www-form-urlencoded; charset=utf-8"),("Authorization","<REDACTED>")] path = "/" queryString = "" method = "POST" proxy = Nothing rawBody = False redirectCount = 0 responseTimeout = ResponseTimeoutMicro 70000000 requestVersion = HTTP/1.1 } (ConnectionFailure Network.Socket.getAddrInfo (called with preferred socket type/protocol: AddrInfo {addrFlags = [AI_ADDRCONFIG], addrFamily = AF_UNSPEC, addrSocketType = Stream, addrProtocol = 0, addrAddress = <assumed to be undefined>, addrCanonName = <assumed to be undefined>}, host name: Just "sqs.eu-central-1.amazonaws.com", service name: Just "443"): does not exist (System error))) #+end_example
After some detours I found out that it's actually not a network issue, but rather that the process runs out of filedescriptors. Using =lsof= I can see that it doesn't seem to close /any/ sockets at all, instead they get stuck in a =CLOSE_WAIT= state:
#+begin_example COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME wd-stats 88674 magnus 23u IPv4 815196 0t0 TCP ip-192-168-0-9.eu-central-1.compute.internal:60624->52.119.188.213:https (CLOSE_WAIT) wd-stats 88674 magnus 24u IPv4 811362 0t0 TCP ip-192-168-0-9.eu-central-1.compute.internal:43482->52.119.189.184:https (CLOSE_WAIT) wd-stats 88674 magnus 25u IPv4 811386 0t0 TCP ip-192-168-0-9.eu-central-1.compute.internal:60628->52.119.188.213:https (CLOSE_WAIT) wd-stats 88674 magnus 26u IPv4 813527 0t0 TCP ip-192-168-0-9.eu-central-1.compute.internal:43486->52.119.189.184:https (CLOSE_WAIT) ... #+end_example
Am I using Amazonka and/or Conduit in a way that results in this? How do I should I use them?
Or, is it an issue somewhere "below" my code? What can I do address that?
Thanks for any insights or help /M
-- Magnus Therning OpenPGP: 0x927912051716CE39 email: magnus@therning.org twitter: magthe http://magnus.therning.org/
Action is the foundational key to all success. — Pablo Picasso _______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

Linux has kernel params you can tweak for socket reuse. Also look up SO_REUSEADDR for background.
On Nov 28, 2020, at 8:44 AM, Bryan Richter wrote:
I thought CLOSE_WAIT *is* one of the "closed" states. TCP sockets stick around for a few minutes after use, right? You may simply be generating sockets faster than one operating system can handle. Find some way to reuse existing sockets, perhaps?
On Thu, Nov 26, 2020 at 3:13 PM Magnus Therning
wrote: I've run into a problem with running out of filedescriptors. The following snippet is a trimmed down version of what I'm doing:
#+begin_src haskell main :: IO () main = do awsEnv <- newEnv Discover runAWSCond awsEnv $ sqsSource queueUrl .| C.mapC snd .| sqsDeleteSink queueUrl where runAWSCond awsEnv = runResourceT . runAWS awsEnv . within Frankfurt . C.runConduit
sqsSource :: MonadAWS m => T.Text -> C.ConduitT () (T.Text, T.Text) m () sqsSource queueUrl = do (_, msgs) <- C.lift $ recvSQS queueUrl C.yieldMany msgs sqsSource queueUrl
sqsDeleteSink :: MonadAWS m => T.Text -> C.ConduitT T.Text o m () sqsDeleteSink queueUrl = do C.await >>= \case Nothing -> pure () Just receiptHandle -> do void $ C.lift $ delSQS queueUrl receiptHandle sqsDeleteSink queueUrl
recvSQS queueUrl = do let rm = receiveMessage queueUrl & rmMaxNumberOfMessages ?~ 10 rmrs <- send rm let status = rmrs ^. rmrsResponseStatus msgs = rmrs ^. rmrsMessages & traversed %~ extract pure (status, catMaybes msgs) where extract msg = do body <- msg ^. mBody rh <- msg ^. mReceiptHandle pure (body, rh)
delSQS queueUrl receiptHandle = do let dm = deleteMessage queueUrl receiptHandle send dm #+end_src
This works fine for a while, but given a queue with enough messages it will fail with something like
#+begin_example TransportError (HttpExceptionRequest Request { host = "sqs.eu-central-1.amazonaws.com" port = 443 secure = True requestHeaders = [("Host","sqs.eu-central-1.amazonaws.com"),("X-Amz-Date","20201126T101659Z"),("X-Amz-Content-SHA256","2e4bdf20a857a1416f218b1218670cf019ff53268d0adb34fe06402a62f3271d"),("Content-Type","application/x-www-form-urlencoded; charset=utf-8"),("Authorization","<REDACTED>")] path = "/" queryString = "" method = "POST" proxy = Nothing rawBody = False redirectCount = 0 responseTimeout = ResponseTimeoutMicro 70000000 requestVersion = HTTP/1.1 } (ConnectionFailure Network.Socket.getAddrInfo (called with preferred socket type/protocol: AddrInfo {addrFlags = [AI_ADDRCONFIG], addrFamily = AF_UNSPEC, addrSocketType = Stream, addrProtocol = 0, addrAddress = <assumed to be undefined>, addrCanonName = <assumed to be undefined>}, host name: Just "sqs.eu-central-1.amazonaws.com", service name: Just "443"): does not exist (System error))) #+end_example
After some detours I found out that it's actually not a network issue, but rather that the process runs out of filedescriptors. Using =lsof= I can see that it doesn't seem to close /any/ sockets at all, instead they get stuck in a =CLOSE_WAIT= state:
#+begin_example COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME wd-stats 88674 magnus 23u IPv4 815196 0t0 TCP ip-192-168-0-9.eu-central-1.compute.internal:60624->52.119.188.213:https (CLOSE_WAIT) wd-stats 88674 magnus 24u IPv4 811362 0t0 TCP ip-192-168-0-9.eu-central-1.compute.internal:43482->52.119.189.184:https (CLOSE_WAIT) wd-stats 88674 magnus 25u IPv4 811386 0t0 TCP ip-192-168-0-9.eu-central-1.compute.internal:60628->52.119.188.213:https (CLOSE_WAIT) wd-stats 88674 magnus 26u IPv4 813527 0t0 TCP ip-192-168-0-9.eu-central-1.compute.internal:43486->52.119.189.184:https (CLOSE_WAIT) ... #+end_example
Am I using Amazonka and/or Conduit in a way that results in this? How do I should I use them?
Or, is it an issue somewhere "below" my code? What can I do address that?
Thanks for any insights or help /M
-- Magnus Therning OpenPGP: 0x927912051716CE39 email: magnus@therning.org twitter: magthe http://magnus.therning.org/
Action is the foundational key to all success. — Pablo Picasso _______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.
Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

On Thu, Nov 26, 2020 at 02:12:49PM +0100, Magnus Therning wrote:
After some detours I found out that it's actually not a network issue, but rather that the process runs out of filedescriptors. Using =lsof= I can see that it doesn't seem to close /any/ sockets at all, instead they get stuck in a =CLOSE_WAIT= state:
#+begin_example COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME wd-stats 88674 magnus 23u IPv4 815196 0t0 TCP ip-192-168-0-9.eu-central-1.compute.internal:60624->52.119.188.213:https (CLOSE_WAIT) wd-stats 88674 magnus 24u IPv4 811362 0t0 TCP ip-192-168-0-9.eu-central-1.compute.internal:43482->52.119.189.184:https (CLOSE_WAIT) wd-stats 88674 magnus 25u IPv4 811386 0t0 TCP ip-192-168-0-9.eu-central-1.compute.internal:60628->52.119.188.213:https (CLOSE_WAIT) wd-stats 88674 magnus 26u IPv4 813527 0t0 TCP ip-192-168-0-9.eu-central-1.compute.internal:43486->52.119.189.184:https (CLOSE_WAIT) ... #+end_example
How many such still open file descriptors did you find? (If you run "lsof -n -P -i tcp -a -p $pid", it'll produce the output faster, reporting only sockets). Contrary to other replies, indeed the sockets above are NOT closed in your process, otherwise they'd not be associated with a file descriptor and would just show up in "netstat", but not "lsof" output. I don't know what happens inside Amazonka, but typically clients doing many concurrent HTTPS calls employ a TlsManager that maintains a connection pool, and would avoid opening too many concurrent connections, but would also keep a limited number of connections open for more requests. How many still open connections did you find? I don't know whether TlsManager aggregates connections by name or IP address, if the latter, perhaps (very speculatively, without looking at the underlying code, ...) Amazon's IP address is changing quickly (short or 0 TTL) breaking the connection pool's per-destination connection limits. This is a wild guess, more evidence is needed to make it actually plausible or rule it out. -- Viktor.

Viktor Dukhovni
On Thu, Nov 26, 2020 at 02:12:49PM +0100, Magnus Therning wrote:
After some detours I found out that it's actually not a network issue, but rather that the process runs out of filedescriptors. Using =lsof= I can see that it doesn't seem to close /any/ sockets at all, instead they get stuck in a =CLOSE_WAIT= state:
#+begin_example COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME wd-stats 88674 magnus 23u IPv4 815196 0t0 TCP ip-192-168-0-9.eu-central-1.compute.internal:60624->52.119.188.213:https (CLOSE_WAIT) wd-stats 88674 magnus 24u IPv4 811362 0t0 TCP ip-192-168-0-9.eu-central-1.compute.internal:43482->52.119.189.184:https (CLOSE_WAIT) wd-stats 88674 magnus 25u IPv4 811386 0t0 TCP ip-192-168-0-9.eu-central-1.compute.internal:60628->52.119.188.213:https (CLOSE_WAIT) wd-stats 88674 magnus 26u IPv4 813527 0t0 TCP ip-192-168-0-9.eu-central-1.compute.internal:43486->52.119.189.184:https (CLOSE_WAIT) ... #+end_example
How many such still open file descriptors did you find?
Hundreds of them.
(If you run "lsof -n -P -i tcp -a -p $pid", it'll produce the output faster, reporting only sockets).
Contrary to other replies, indeed the sockets above are NOT closed in your process, otherwise they'd not be associated with a file descriptor and would just show up in "netstat", but not "lsof" output.
I don't know what happens inside Amazonka, but typically clients doing many concurrent HTTPS calls employ a TlsManager that maintains a connection pool, and would avoid opening too many concurrent connections, but would also keep a limited number of connections open for more requests.
After I reported it to the Amazonka project[1] I found out that it most likely is a known issue[2]. I have still to confirm that the fix for [2] solves the issue I'm seeing. [1]: https://github.com/brendanhay/amazonka/issues/608 [2]: https://github.com/brendanhay/amazonka/issues/490 /M -- Magnus Therning OpenPGP: 0x927912051716CE39 email: magnus@therning.org twitter: magthe http://magnus.therning.org/ I am always doing that which I cannot do, in order that I may learn how to do it. — Pablo Picasso
participants (4)
-
Bryan Richter
-
Magnus Therning
-
Viktor Dukhovni
-
Will Yager