thread/socket behvior

We have a server that accepts messages over a socket, spawning threads to process them. Processing these messages may cause other, outgoing connections, to be spawned. Under sufficient load, the main server loop (i.e. the call to accept, followed by a forkIO), becomes nonresponsive. A smaller distilled testcase reveals that when sufficient socket activity is occurring, an incoming connection may not be responded to until other connections have been cleared out of the way, despite the fact that these other connections are being handled by separate threads. One issue that we?ve been trying to figure out is where this behavior arises from-- the GHC rts, the Network library, the underlying C libraries. Have other GHC users doing applications with large amounts of socket usage observed similar behavior and managed to trace back where it originates from? Are there any particular architectural solutions that people have found to work well for these situations? thanks, Jeff --- This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden.

jeff.polakow:
We have a server that accepts messages over a socket, spawning threads to process them. Processing these messages may cause other, outgoing connections, to be spawned. Under sufficient load, the main server loop (i.e. the call to accept, followed by a forkIO), becomes nonresponsive.
A smaller distilled testcase reveals that when sufficient socket activity is occurring, an incoming connection may not be responded to until other connections have been cleared out of the way, despite the fact that these other connections are being handled by separate threads. One issue that we've been trying to figure out is where this behavior arises from-- the GHC rts, the Network library, the underlying C libraries.
Have other GHC users doing applications with large amounts of socket usage observed similar behavior and managed to trace back where it originates from? Are there any particular architectural solutions that people have found to work well for these situations?
Hey Jeff, Can you say which GHC you used, and whether you used the threaded runtime or non-threaded runtime? -- Don

jeff.polakow:
We have a server that accepts messages over a socket, spawning
process them. Processing these messages may cause other, outgoing connections, to be spawned. Under sufficient load, the main server loop (i.e. the call to accept, followed by a forkIO), becomes nonresponsive.
A smaller distilled testcase reveals that when sufficient socket activity is occurring, an incoming connection may not be responded to until other connections have been cleared out of the way, despite the fact that
other connections are being handled by separate threads. One issue
Don Stewart
we've been trying to figure out is where this behavior arises from-- the GHC rts, the Network library, the underlying C libraries.
Have other GHC users doing applications with large amounts of socket usage observed similar behavior and managed to trace back where it originates from? Are there any particular architectural solutions that people have found to work well for these situations?
Hey Jeff,
Can you say which GHC you used, and whether you used the threaded runtime or non-threaded runtime?
Oops, forgot about that... We used both ghc-6.8.3 and ghc-6.10.rc1 and we used the threaded runtime. We are running on a 64 bit linux machine using openSUSE 10. thanks, Jeff --- This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden.

Jeff Polakow wrote:
Don Stewart
wrote on 10/09/2008 02:56:02 PM: We have a server that accepts messages over a socket, spawning
process them. Processing these messages may cause other, outgoing connections, to be spawned. Under sufficient load, the main server loop (i.e. the call to accept, followed by a forkIO), becomes nonresponsive.
A smaller distilled testcase reveals that when sufficient socket activity is occurring, an incoming connection may not be responded to until other connections have been cleared out of the way, despite the fact
other connections are being handled by separate threads. One issue that we've been trying to figure out is where this behavior arises from-- the GHC rts, the Network library, the underlying C libraries.
Have other GHC users doing applications with large amounts of socket usage observed similar behavior and managed to trace back where it originates from? Are there any particular architectural solutions that
jeff.polakow: threads to that these people have
found to work well for these situations?
Hey Jeff,
Can you say which GHC you used, and whether you used the threaded runtime or non-threaded runtime?
Oops, forgot about that...
We used both ghc-6.8.3 and ghc-6.10.rc1 and we used the threaded runtime. We are running on a 64 bit linux machine using openSUSE 10.
The scheduler doesn't have a concept of priorities, so the accepting thread will get the same share of the CPU as the other threads. Another issue is that the accepting thread has to be woken up by the IO manager thread when a new connection is available, so we might have to wait for the IO manager thread to run too. But I wouldn't expect to see overly long delays. Maybe you could try network-alt which does its own IO multiplexing. If you have multiple cores, you might want to try fixing the thread affinity - e.g. put all the worker threads on one core, and the accepting thread on the other core. You can do this using GHC.Conc.forkOnIO, with the +RTS -qm -qw options. Other than that, I'm not sure what to try right now. We're hoping to get some better profiling for parallel/concurrent programs in the future, but it's not ready yet. Cheers, Simon

Hello,
Just writing to let people know the resolution of this problem...
After much frustration and toil, we realized there was a bug in GHC's
handle abstraction over sockets.
We resolved our immediate problem by having our code deal directly with
the sockets, and we filed a bug report, #2703, which has just been
(partially fixed) by Simon Marlow.
thanks,
Jeff
Simon Marlow
Jeff Polakow wrote:
Don Stewart
wrote on 10/09/2008 02:56:02 PM: jeff.polakow:
We have a server that accepts messages over a socket, spawning
threads to
process them. Processing these messages may cause other, outgoing connections, to be spawned. Under sufficient load, the main server loop (i.e. the call to accept, followed by a forkIO), becomes nonresponsive.
A smaller distilled testcase reveals that when sufficient socket activity is occurring, an incoming connection may not be responded to until other connections have been cleared out of the way, despite the fact
other connections are being handled by separate threads. One issue that we've been trying to figure out is where this behavior arises from-- the GHC rts, the Network library, the underlying C libraries.
Have other GHC users doing applications with large amounts of socket usage observed similar behavior and managed to trace back where it originates from? Are there any particular architectural solutions that
that these people have
found to work well for these situations?
Hey Jeff,
Can you say which GHC you used, and whether you used the threaded runtime or non-threaded runtime?
Oops, forgot about that...
We used both ghc-6.8.3 and ghc-6.10.rc1 and we used the threaded runtime. We are running on a 64 bit linux machine using openSUSE 10.
The scheduler doesn't have a concept of priorities, so the accepting thread will get the same share of the CPU as the other threads. Another issue is that the accepting thread has to be woken up by the IO manager thread when a new connection is available, so we might have to wait for the IO manager thread to run too. But I wouldn't expect to see overly long delays. Maybe you could try network-alt which does its own IO multiplexing.
If you have multiple cores, you might want to try fixing the thread affinity - e.g. put all the worker threads on one core, and the accepting thread on the other core. You can do this using GHC.Conc.forkOnIO, with
the +RTS -qm -qw options.
Other than that, I'm not sure what to try right now. We're hoping to get some better profiling for parallel/concurrent programs in the future, but it's not ready yet.
Cheers, Simon
--- This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden.

I'll be interested to know if the fix helps your application. The bug reported in #2703 results in the program just allocating memory endlessly until it dies, so it doesn't sound exactly like the symptoms you were originally describing. Cheers, Simon Jeff Polakow wrote:
Hello,
Just writing to let people know the resolution of this problem...
After much frustration and toil, we realized there was a bug in GHC's handle abstraction over sockets.
We resolved our immediate problem by having our code deal directly with the sockets, and we filed a bug report, #2703, which has just been (partially fixed) by Simon Marlow.
thanks, Jeff
Simon Marlow
wrote on 10/10/2008 09:23:31 AM: Jeff Polakow wrote:
Don Stewart
wrote on 10/09/2008 02:56:02 PM: We have a server that accepts messages over a socket, spawning
process them. Processing these messages may cause other, outgoing connections, to be spawned. Under sufficient load, the main server loop (i.e. the call to accept, followed by a forkIO), becomes nonresponsive.
A smaller distilled testcase reveals that when sufficient socket activity is occurring, an incoming connection may not be responded to until other connections have been cleared out of the way, despite the fact
other connections are being handled by separate threads. One issue that we've been trying to figure out is where this behavior arises from-- the GHC rts, the Network library, the underlying C libraries.
Have other GHC users doing applications with large amounts of socket usage observed similar behavior and managed to trace back where it originates from? Are there any particular architectural solutions that
jeff.polakow: threads to that these people have
found to work well for these situations?
Hey Jeff,
Can you say which GHC you used, and whether you used the threaded runtime or non-threaded runtime?
Oops, forgot about that...
We used both ghc-6.8.3 and ghc-6.10.rc1 and we used the threaded runtime. We are running on a 64 bit linux machine using openSUSE 10.
The scheduler doesn't have a concept of priorities, so the accepting thread will get the same share of the CPU as the other threads. Another issue is that the accepting thread has to be woken up by the IO manager thread when a new connection is available, so we might have to wait for the IO manager thread to run too. But I wouldn't expect to see overly long delays. Maybe you could try network-alt which does its own IO multiplexing.
If you have multiple cores, you might want to try fixing the thread affinity - e.g. put all the worker threads on one core, and the accepting thread on the other core. You can do this using GHC.Conc.forkOnIO, with the +RTS -qm -qw options.
Other than that, I'm not sure what to try right now. We're hoping to get some better profiling for parallel/concurrent programs in the future, but it's not ready yet.
Cheers, Simon
---
This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden.

Hello,
I'll be interested to know if the fix helps your application. The bug reported in #2703 results in the program just allocating memory endlessly until it dies, so it doesn't sound exactly like the symptoms you were originally describing.
We are currently using GHC-6.8.3 so we can't try the fixed version. We'll switch to 6.10 after it becomes the stable release and (hopefully) minimal work needs to be done to get everything to compile. This bug actually perfectly explains the behavior we saw in our full system. The distilled test case we reported was based on our then current theory that too many connections were the cause of our problem. We're pretty sure the behavior we described was real, but it was not the cause of our problem as subsequent testing revealed. After delving deeper, we realized that the real culprit was too much data over one connection. -Jeff --- This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden.
participants (3)
-
Don Stewart
-
Jeff Polakow
-
Simon Marlow