Exceeding OS limits for simultaneous socket connections

Hi,
I'm experiencing the "accept: resource exhausted (Too many open
files)" exception when trying to use sockets in my Haskell program.
The situation:
- Around a dozen Linux machines running my Haskell program,
transmitting thousands of messages to each other, sometimes within a
small period of time.
- I'm using the Network.Socket.ByteString.Lazy module to send and
receive lazy bytestrings
- The socket sent to the getMsg is bound and is listening, and is not
closed until the program exits. getMsg is called in a loop to receive
lazy bytestring from remote nodes. The socket is initialized with:
sock <- Network.Socket.socket (addrFamily myAddr) Stream defaultProtocol
bindSocket sock (addrAddress myAddr)
listen sock 10
Here's the code:
sendMsg :: Maybe HostName -> Int -> Lazy.ByteString -> IO ()
sendMsg dest sckt msg = do
result <- try $ withSocketsDo $ do
addrinfos <- getAddrInfo Nothing dest (Just (show sckt))
let serveraddr = head addrinfos
sock <- socket (addrFamily serveraddr) Stream defaultProtocol
connect sock (addrAddress serveraddr)
sendAll sock msg
sClose sock
case result of
Left (ex::IOException) -> return () -- permit send failure
Right _ -> return ()
getMsg :: Socket -> IO Lazy.ByteString
getMsg sock = do
result <- try $ withSocketsDo $ do
(conn, addr) <- accept sock
getContents conn
case result of
Left (ex::IOException) -> putStrLn (show ex) >> getMsg sock
Right msg -> return msg
The current topology is a master/slave setup. For some programs that
use these functions above, `sendMsg' is called thousands of times in
quick succession on the remote nodes, where the destination of the
`sendAll' function is the master node. Here's the maximum number of
simultaneous sockets I am permitted to have open on my Linux machines:
$ ulimit -n
1024
Indeed, when I experience the "accept: resource exhausted (Too many
open files)" exception, I check the number of open sockets, which
exceeds 1024, by looking at the contents of the directory:
ls -lah /proc/

What you can try - reduce amount of threads running at the same time - thus accept less connections. AFAIK the OS has a buffer caching the connection request for a while - thus this may just work. Using STM it would be trivial to implement it: try incrementing a var, if it is > 100 fail. It will only be retried if the vars change or such, correct? When you're done decrease the number. - increase limit (you said this is no option) - replace getContents conn by something strict and close the handle yourself? (not sure about this.) Eg yesod introduces conduits for that reason => http://www.yesodweb.com/blog/2011/12/conduits There are alternative implementations on hackage. - not sure how many apps are running at the same time. But instead of creating many connections from machine A to B you could try establishing a permanent connection sending binary streams or "chunk" the messages. Eg wait for 5 requests - then bundle them and send them all at once (depneds on your implementation whether this could be an option). That's all which comes to my mind. Probably more experienced users have additional ideas. Thus keep waiting and reading. Marc Weber

Quoth Marc Weber
- replace getContents conn by something strict and close the handle yourself? (not sure about this.)
That's an easy one to try, right? Just force evaluation of the getContents return value, in getMsg. If lazy I/O is the culprit here, it wouldn't be the first time. But that will only reduce the magnitude of the problem (albeit perhaps vastly reduce), and as you say, the more reliable solution is to limit the number of concurrent accepts, and concurrent open-and-sends. It isn't like socket I/O really benefits from unlimited concurrency, as all the data is still serially pushed through a wire; you just need enough concurrent peer connections to keep the interface busy. Donn

On 30 January 2012 14:22, Rob Stewart
Hi,
I'm experiencing the "accept: resource exhausted (Too many open files)" exception when trying to use sockets in my Haskell program.
The situation: - Around a dozen Linux machines running my Haskell program, transmitting thousands of messages to each other, sometimes within a small period of time.
...
$ ulimit -n 1024
This is not an OS limit, this is your freely chosen limit. You should not run with this few file descriptors on a server. Increasing this by 50x is entirely reasonable. However, having too many open TCP connections is not a good thing either. 1024 was an upper limit way way back on the i386 linux architecture for code using the select() system call, that is why it is still a common default. There are a few ways to get out of this situation. 1. Reuse your TCP connections. Maybe you could even use HTTP. An HTTP library might do reusing of connections for you. 2. Since you are blocking in getContents, there is a probability that it is the senders that are being lazy in sendAll. They opened the TCP connection, but now they are not sending everything in sendAll, so your receiver is having lots of threads that are blocked on reading. Try to be strict when *sending* so you do not have too many ongoing TCP connections. 3. On the receiver side, to be robust, you could limit the number of threads that are allowed to do an accept() to the number of file descriptors you have free. You can also block on a semaphore whenever accept returns out of resources, and signal that semaphore after every close. Alexander Indeed, when I experience the "accept: resource exhausted (Too many
open files)" exception, I check the number of open sockets, which exceeds 1024, by looking at the contents of the directory: ls -lah /proc/
/fd It is within the getContents function that, once the lazy bytestring is fully received, the socket is shutdown http://goo.gl/B6XcV : shutdown sock ShutdownReceive
There seems to be no way of limiting the number of permitted connection requests from remote nodes. What I am perhaps looking for is a mailbox implementation on top of sockets, or another way to avoid this error. I am looking to scale up to 100's of nodes, where the possibility of more than 1024 simultaneous socket connections to one node is increased. Merely increasing the ulimit feels like a temporary measure. Part of the dilemma is that the `connect' call in `sendMsg' does not throw an error, despite the fact that it does indeed cause an error on the receiving node, by pushing the number of open connections to the same socket on the master node, beyond the 1024 limit permitted by the OS.
Am I missing something? One would have thought such a problem occurs frequently with Haskell web servers and the like.. ?
-- Rob Stewart
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
participants (5)
-
Alexander Kjeldaas
-
Donn Cave
-
Marc Weber
-
Matthew Farkas-Dyck
-
Rob Stewart