Memory consumption issues under heavy network throughput/concurrency loads

I have been testing solutions in several languages for running a network daemon that accepts hundreds of thousands of websocket connections, and moves messages between them. Originally I used this websocket lib, http://jaspervdj.be/websockets/, but upon discovering a rather severe memory leak issue even when sending just a basic ping, switched to Michael Snoyman's first stab at a websocket impl for Yesod here: https://github.com/yesodweb/yesod/commit/66437453f57e6a2747ff7c2199aa7ad25db.... When under mild load (5k connections, each pinging once every 10+ seconds), memory usage remained stable and acceptable around 205 MB. Switching to higher load (pinging every second or less), memory usage spiked to 560 MB, and continued to 'leak' slowly. When I quit the server with the profile diagnostics, it indicate that hundreds of MB were "lost due to fragmentation". In addition, merely opening the connections and dropping them, repeatedly, made the base memory usage go up. Somewhere, memory is not being fully reclaimed when the connections are dropped. For a few thousand connections doing little, this wouldn't matter much. However, I'm trying to gauge whether its feasible to use Haskell to handle 150-200k connections that regularly come/go and are held open for long periods of time. Such massive memory use with what looks like leaks (or fragmentation issues) is problematic. I have created a very simple TCP based echo client/server here: https://github.com/bbangert/echo The server can be run after compiling with no additional options, and will listen on port 8080. The client can be run like so: ./dist/build/echoclient/echoclient localhost 8080 2000 0.5 The last number is the frequency to ping (every half second), the second to last is how many clients to connect. Under my local tests, when pinging every 5+ seconds, 2k clients will take about 50-75 MB of ram. Pinging every 0.5 seconds jumps to ~ 180 MB of ram, and this is for a mere 2k clients. Starting/stopping the echoclient repeatedly also causes the servers overall memory usage to continue to climb higher and higher. Releasing the connections never quite gets the memory usage down to where it started. The issue seems to occur under both GHC 7.6.3 and 7.8.2. I know there's a variety of ghc options that might be tuned, am I missing some critical option to keep memory usage under control? Is there a better way to build high TCP throughput/concurrent servers in Haskell that I'm missing? Thanks, Ben

+kazu, +Simon M
On Tue, Jul 15, 2014 at 7:18 PM, Ben Bangert
I have created a very simple TCP based echo client/server here: https://github.com/bbangert/echo
Ben, could you please tell us what sort of machine you are running this on?
Is it Mac, Linux, or Windows?
I took your test case and hacked it down to eliminate some possible sources
of error (I was suspicious of Handle for a while, also of an old space leak
bug in "forever" which I think is fixed now):
https://github.com/gregorycollins/ghc-echo-leak-bug
The revised echoserver is fine on my machine (stable at 22MB resident) but *the
echo client leaks*. Happens with/without -O2 on GHC 7.8.3 for OSX.
Kazu, I think there's a good chance this is a bug in the multicore IO
manager, the test code is doing little more than write + read + threadDelay.
G
--
Gregory Collins

On Jul 15, 2014, at 1:49 PM, Gregory Collins
+kazu, +Simon M
On Tue, Jul 15, 2014 at 7:18 PM, Ben Bangert
wrote: I have created a very simple TCP based echo client/server here: https://github.com/bbangert/echo Ben, could you please tell us what sort of machine you are running this on? Is it Mac, Linux, or Windows?
It occurs on my Mac, and it occurs on the linux AWS AMI instances I've compiled/run it on.
I took your test case and hacked it down to eliminate some possible sources of error (I was suspicious of Handle for a while, also of an old space leak bug in "forever" which I think is fixed now):
I should note the use of ByteString's was intentional (re: dropping hGetLine in the server), this is because many of the networking libs use ByteString's and I'd prefer not to rewrite all the libs from the TCP layer up to the Websocket layer. This stack of layers.... TCP - HTTP - Websocket all introduce various objects as they slice/dice the bytes into websocket frames using ByteStrings. It's quite possible the issue I'm seeing is related to how GHC handles the introduction of huge amounts of ByteString's that then must all be GC'd (esp regarding how the profile shows memory being lost to fragmentation). Someone on the #haskell channel suggested that due to how ByteString's are implemented, GHC is unable to move them around in memory which can lead to the fragmentation.
The revised echoserver is fine on my machine (stable at 22MB resident) but the echo client leaks. Happens with/without -O2 on GHC 7.8.3 for OSX.
Yes, I noticed the client leak, though since I'm mainly interested in running the server that didn't concern me as much.
Kazu, I think there's a good chance this is a bug in the multicore IO manager, the test code is doing little more than write + read + threadDelay.
Thanks a lot for looking at this, hopefully the ByteString issue is known or possible to fix short of looking at the larger program I have that exhibits it (which is all open-source so I can post that as well, the 'simple' websocket server and websocket tester looks like this: https://gist.github.com/bbangert/6b7ef979963d7cb1838e) using the yesod-websocket code from the https://github.com/yesodweb/yesod/commit/66437453f57e6a2747ff7c2199aa7ad25db... changeset). I'll see about better packaging the more involved websocket one that leaks/fragments memory substantially faster. Cheers, Ben

Greg,
https://github.com/gregorycollins/ghc-echo-leak-bug
The revised echoserver is fine on my machine (stable at 22MB resident) but *the echo client leaks*. Happens with/without -O2 on GHC 7.8.3 for OSX.
I looked at your code very quickly. What happens if you replace "replicateM" and "mapM_" to recursions? (Especially I don't trust replicateM in IO.) Also, we need to confirm that atomicModifyIORef' does not really leak space.
Kazu, I think there's a good chance this is a bug in the multicore IO manager, the test code is doing little more than write + read + threadDelay.
If the space leak also happens with GHC 7.6.3, it is not specific to the multicore IO manager. But the old/new IO manager might have potential space leak. P.S. I'm running Mighty 3 (based on WAI) compiled with GHC 7.8.x for a long time. But I don't see any space leak at all. --Kazu

On Wed, Jul 16, 2014 at 3:04 AM, Kazu Yamamoto
I looked at your code very quickly. What happens if you replace "replicateM" and "mapM_" to recursions? (Especially I don't trust replicateM in IO.)
Also, we need to confirm that atomicModifyIORef' does not really leak space.
Those functions are only executed once on start (to fork the threads, wait
for them to finish, and to update the connected clients counter
respectively), they cannot account for steady-state memory increase from
the client loop --- the loop just does write + read + threadDelay.
G
--
Gregory Collins

On Jul 15, 2014, at 1:49 PM, Gregory Collins
I took your test case and hacked it down to eliminate some possible sources of error (I was suspicious of Handle for a while, also of an old space leak bug in "forever" which I think is fixed now):
https://github.com/gregorycollins/ghc-echo-leak-bug
The revised echoserver is fine on my machine (stable at 22MB resident) but the echo client leaks. Happens with/without -O2 on GHC 7.8.3 for OSX.
I've run the new server code, indeed it initially takes less memory. However, when I ctrl-C the testing client, it fails to properly close sockets now, they stay open forever (the original code I posted always closed the sockets when I killed the testing client). Did you try ctrl-c the test client and re-running it several times? I'm guessing that the recv call is blocking forever and failing somehow to note that the socket was actually closed under it, while the hGetline I was using properly detects this condition and closes the socket when it notices the other side has left. I should note that when asking for even 1000 echo clients, the test client as you've changed it dies when I try to launch it (it shows Clients Connected: 0, then just exits silently). I had to use my old testing client to test the server changes you made. If the sockets would properly close when terminating the client abruptly, then its quite possible memory usage would remain lower as it definitely took much much less memory for the first 2k echo clients. Cheers, Ben

On Wed, Jul 16, 2014 at 5:46 AM, Ben Bangert
I've run the new server code, indeed it initially takes less memory. However, when I ctrl-C the testing client, it fails to properly close sockets now, they stay open forever (the original code I posted always closed the sockets when I killed the testing client). Did you try ctrl-c the test client and re-running it several times?
Yes once I exhibited the leaking behaviour I stopped (and didn't implement cleanup properly). It should be ok now. I'm guessing that the recv call is blocking forever and failing somehow to
note that the socket was actually closed under it, while the hGetline I was using properly detects this condition and closes the socket when it notices the other side has left.
The Handle stuff will still end up calling recv() under the hood. I should note that when asking for even 1000 echo clients, the test client
as you've changed it dies when I try to launch it (it shows Clients Connected: 0, then just exits silently).
Are you running out of FDs?
I had to use my old testing client to test the server changes you made. If the sockets would properly close when terminating the client abruptly, then its quite possible memory usage would remain lower as it definitely took much much less memory for the first 2k echo clients.
I've also updated my copy of the the test code to work with the GHC 7.4.1
that's installed on my Ubuntu box at work (64-bit) --- and I *don't* see
the client leak there, resident heap plateaus at around 90MB.
G
--
Gregory Collins

OK, I've done some more investigation here, as much time as I can spare for
now:
- I'm not sure this program really is leaking forever after all, even on
latest GHC. Originally I thought it was, because I was running only 2 pings
/ client-second as you were. If you increase this to something like 20
pings per client-second, you see the same asymptotics at first but
eventually the client plateaus, at least on my machine. I left it running
for an hour. The question remains as to why this program exhibits such
strange GC behavior (I don't see any reason for it to slowly gobble RAM
until plateauing at an arbitrary figure), maybe Simon M can comment.
- The biggest thing you're spending RAM on here is stacks for the
threads you create. By default the stack chunk size is 32k, you can lower
this with +RTS -kcXX --- using 2kB stacks both programs use <40MB heap
resident on my machine. Counting the garbage being generated, the space
needed for buffers/etc, and the fact that the binaries themselves are 8MB,
I don't think 20kB per active client is unreasonable.
- You can reduce GC pressure somewhat by reusing the output buffer, the
"io-streams" branch at my copy of your test repo does this.
G
On Wed, Jul 16, 2014 at 1:00 PM, Gregory Collins
On Wed, Jul 16, 2014 at 5:46 AM, Ben Bangert
wrote: I've run the new server code, indeed it initially takes less memory. However, when I ctrl-C the testing client, it fails to properly close sockets now, they stay open forever (the original code I posted always closed the sockets when I killed the testing client). Did you try ctrl-c the test client and re-running it several times?
Yes once I exhibited the leaking behaviour I stopped (and didn't implement cleanup properly). It should be ok now.
I'm guessing that the recv call is blocking forever and failing somehow
to note that the socket was actually closed under it, while the hGetline I was using properly detects this condition and closes the socket when it notices the other side has left.
The Handle stuff will still end up calling recv() under the hood.
I should note that when asking for even 1000 echo clients, the test
client as you've changed it dies when I try to launch it (it shows Clients Connected: 0, then just exits silently).
Are you running out of FDs?
I had to use my old testing client to test the server changes you made. If the sockets would properly close when terminating the client abruptly, then its quite possible memory usage would remain lower as it definitely took much much less memory for the first 2k echo clients.
I've also updated my copy of the the test code to work with the GHC 7.4.1 that's installed on my Ubuntu box at work (64-bit) --- and I *don't* see the client leak there, resident heap plateaus at around 90MB.
G -- Gregory Collins
--
Gregory Collins

On Jul 16, 2014, at 6:51 AM, Gregory Collins
OK, I've done some more investigation here, as much time as I can spare for now: • I'm not sure this program really is leaking forever after all, even on latest GHC. Originally I thought it was, because I was running only 2 pings / client-second as you were. If you increase this to something like 20 pings per client-second, you see the same asymptotics at first but eventually the client plateaus, at least on my machine. I left it running for an hour. The question remains as to why this program exhibits such strange GC behavior (I don't see any reason for it to slowly gobble RAM until plateauing at an arbitrary figure), maybe Simon M can comment.
Yes, with the changes I don't see leaking behavior. As I mentioned in a separate email though, this isn't very useful because the various network libraries in the wild (Warp, Yesod, Websockets, etc) all use ByteStrings for reading frames, etc. on the way to/from the socket. These seem to cause memory fragmentation and issues reclaiming memory, which is the real issue I'm seeing. Your test has removed the usage of several components that were most likely part of the problem I've had, but which I can't really remove from the fully functioning application as it would require rewriting libraries all the way down the stack.
• The biggest thing you're spending RAM on here is stacks for the threads you create. By default the stack chunk size is 32k, you can lower this with +RTS -kcXX --- using 2kB stacks both programs use <40MB heap resident on my machine. Counting the garbage being generated, the space needed for buffers/etc, and the fact that the binaries themselves are 8MB, I don't think 20kB per active client is unreasonable. • You can reduce GC pressure somewhat by reusing the output buffer, the "io-streams" branch at my copy of your test repo does this.
Besides for not closing the sockets, your master branch had excellent memory usage. I unfortunately wasn't able to try your io-streams branch as I got this compile error: https://gist.github.com/bbangert/d26f28b410faaad4e8d2 I'll put together a minimal websocket server that better demonstrates the memory fragmentation issues today. Cheers, Ben

On Wed, Jul 16, 2014 at 5:28 PM, Ben Bangert
Yes, with the changes I don't see leaking behavior. As I mentioned in a separate email though, this isn't very useful because the various network libraries in the wild (Warp, Yesod, Websockets, etc) all use ByteStrings for reading frames, etc. on the way to/from the socket. These seem to cause memory fragmentation and issues reclaiming memory, which is the real issue I'm seeing.
OK good, that is a data point. BTW the master branch of my version of this test is still using bytestrings -- the main things that I replaced were the use of Handle from the high-level "Network" module and the use of async to run the threads --- not because I thought there was a problem with async but simply to rule out potentially confounding factors. Your issue may not be with bytestring fragmentation but that is one of the working hypothesis and now we are closer to discovering the issue. Personally I am suspicious that network's Handle API (which, by the way, there is almost no reason to use --- libraries like io-streams, pipes, and conduit provide a better user experience) might be the thing that is leaking RAM. (+johan) If the issue is GHC pinned heap fragmentation then you can try generating bytestrings with system malloc instead (which is what we are currently doing for network reads in io-streams and also IIRC what warp is doing). If doing that and linking your binary with e.g. tcmalloc fixes the issue, then I think that would probably be conclusive evidence. Your test has removed the usage of several components that were most likely
part of the problem I've had, but which I can't really remove from the fully functioning application as it would require rewriting libraries all the way down the stack.
Of course, but we are trying to find the leak here. Now we can re-add things to the test, starting with going back to Handle.
• The biggest thing you're spending RAM on here is stacks for the threads you create. By default the stack chunk size is 32k, you can lower this with +RTS -kcXX --- using 2kB stacks both programs use <40MB heap
BTW I lied about this, that parameter controls how much stacks grow if they overflow, the default stack size for threads is 1kB. Besides for not closing the sockets, your master branch had excellent
memory usage. I unfortunately wasn't able to try your io-streams branch as I got this compile error: https://gist.github.com/bbangert/d26f28b410faaad4e8d2
This thread is just full of fun, isn't it.
G
--
Gregory Collins

You guys seem to be doing a good job of narrowing this down. We know there are issues with fragmentation when using ByteStrings; in the worst case each ByteString can pin up to its size rounded up to 4Kbytes. If you're not keeping old ByteStrings around but are regularly recycling them, then you shouldn't see this problem. I believe we made some improvements (allegedly) in 7.6 to the fragmentation behaviour. Do let me know when you've narrowed it down to something you think I should look at. Cheers, Simon On 16/07/2014 14:51, Gregory Collins wrote:
OK, I've done some more investigation here, as much time as I can spare for now:
* I'm not sure this program really is leaking forever after all, even on latest GHC. Originally I thought it was, because I was running only 2 pings / client-second as you were. If you increase this to something like 20 pings per client-second, you see the same asymptotics at first but eventually the client plateaus, at least on my machine. I left it running for an hour. The question remains as to why this program exhibits such strange GC behavior (I don't see any reason for it to slowly gobble RAM until plateauing at an arbitrary figure), maybe Simon M can comment. * The biggest thing you're spending RAM on here is stacks for the threads you create. By default the stack chunk size is 32k, you can lower this with +RTS -kcXX --- using 2kB stacks both programs use <40MB heap resident on my machine. Counting the garbage being generated, the space needed for buffers/etc, and the fact that the binaries themselves are 8MB, I don't think 20kB per active client is unreasonable. * You can reduce GC pressure somewhat by reusing the output buffer, the "io-streams" branch at my copy of your test repo does this.
G
On Wed, Jul 16, 2014 at 1:00 PM, Gregory Collins
mailto:greg@gregorycollins.net> wrote: On Wed, Jul 16, 2014 at 5:46 AM, Ben Bangert
mailto:ben@groovie.org> wrote: I've run the new server code, indeed it initially takes less memory. However, when I ctrl-C the testing client, it fails to properly close sockets now, they stay open forever (the original code I posted always closed the sockets when I killed the testing client). Did you try ctrl-c the test client and re-running it several times?
Yes once I exhibited the leaking behaviour I stopped (and didn't implement cleanup properly). It should be ok now.
I'm guessing that the recv call is blocking forever and failing somehow to note that the socket was actually closed under it, while the hGetline I was using properly detects this condition and closes the socket when it notices the other side has left.
The Handle stuff will still end up calling recv() under the hood.
I should note that when asking for even 1000 echo clients, the test client as you've changed it dies when I try to launch it (it shows Clients Connected: 0, then just exits silently).
Are you running out of FDs?
I had to use my old testing client to test the server changes you made. If the sockets would properly close when terminating the client abruptly, then its quite possible memory usage would remain lower as it definitely took much much less memory for the first 2k echo clients.
I've also updated my copy of the the test code to work with the GHC 7.4.1 that's installed on my Ubuntu box at work (64-bit) --- and I *don't* see the client leak there, resident heap plateaus at around 90MB.
G -- Gregory Collins
mailto:greg@gregorycollins.net> -- Gregory Collins
mailto:greg@gregorycollins.net>

On Jul 17, 2014, at 11:56 AM, Simon Marlow
You guys seem to be doing a good job of narrowing this down. We know there are issues with fragmentation when using ByteStrings; in the worst case each ByteString can pin up to its size rounded up to 4Kbytes. If you're not keeping old ByteStrings around but are regularly recycling them, then you shouldn't see this problem. I believe we made some improvements (allegedly) in 7.6 to the fragmentation behaviour.
Do let me know when you've narrowed it down to something you think I should look at.
I made a slightly more complex version of the echo server that uses attoparsec to read into a basic datatype here: https://gist.github.com/bbangert/592e3dcc0253f275e9a3 I've been unable to make it leak using the socket handling code Gregory supplied, and its growth bounds seem fine (113Mb for 5k connections each sending 2 pings/sec). As I flesh out the HTTP/2 frame handling (which will result in substantial increases of temporary ByteStrings), I'll see if any fragmentation issues return. Thanks! Ben
participants (4)
-
Ben Bangert
-
Gregory Collins
-
Kazu Yamamoto
-
Simon Marlow