OK, I've done some more investigation here, as much time as I can spare for now:
- I'm not sure this program really is leaking forever after all, even on latest GHC. Originally I thought it was, because I was running only 2 pings / client-second as you were. If you increase this to something like 20 pings per client-second, you see the same asymptotics at first but eventually the client plateaus, at least on my machine. I left it running for an hour. The question remains as to why this program exhibits such strange GC behavior (I don't see any reason for it to slowly gobble RAM until plateauing at an arbitrary figure), maybe Simon M can comment.
- The biggest thing you're spending RAM on here is stacks for the threads you create. By default the stack chunk size is 32k, you can lower this with +RTS -kcXX --- using 2kB stacks both programs use <40MB heap resident on my machine. Counting the garbage being generated, the space needed for buffers/etc, and the fact that the binaries themselves are 8MB, I don't think 20kB per active client is unreasonable.
- You can reduce GC pressure somewhat by reusing the output buffer, the "io-streams" branch at my copy of your test repo does this.
G