RE: [Haskell-cafe] Optimizing a high-traffic network architecture

On 15 December 2005 10:21, Joel Reymont wrote:
Here are statistics that I gathered. I'm almost done modifying the program to use 1 timer thread instead of 1 per bot as well as writing to the socket from the writer thread. This should reduce the number of threads from 6k (2k x 3) to 2k plus change.
It appears that +RTS -k3k does make a difference. As per Simon, 2-4k avoids the thread being garbage collected because each thread gets its own block in the storage manager. Simon, did I get that right?
BTW, how does garbage-collecting a thread works in this scenario? My threads are very long-running.
The total is the number of bots launched, lobby is how many bots connected to the lobby. Failed is mostly due to connection reset by peer errors. The Windows C++ server uses IOCP and running a firewall was apparently interfering with that somehow. I hate Windows :-(.
--- Test#1 +RTS -k3k as per Simon. Keep-alive timeout of 9 minutes.
Total: 1961, Lobby: 1961, Failed: 0 Total: 2000, Lobby: 2000, Failed: 1
This test went smoothly and got to 2k connections very quickly. Maybe within 30 minutes or so. I did not gather CPU usage, etc. statistics.
--- Test #2, No thread stack increase, 1 minute keep-alive timeout, more network traffic
With a 1 minute timeout things run veeery slow. 86 physical and 158Mb of VM with 1k bots, CPU 50-60%. Data sent/received is 60-70 packets and 6-7kb/sec. Killed after a while.
The statistics are phys/VM, CPU usage in % and #packets/transfer speed
Total: 1345, Lobby: 1326, Failed: 0, 102/184, 50%, 90/8kb Total: 1395, Lobby: 1367, Failed: 2 Total: 1421, Lobby: 1394, Failed: 4 Total: 1490, Lobby: 1463, Failed: 4, 108/194, 50%, 110/11Kb Total: 1574, Lobby: 1546, Failed: 4, 113/202, 50%, 116/11kb
Hmm, your machine is spending 50% of its time doing nothing, and the network traffic is very low. I wouldn't expect 2k connections to pose any problem at all, so further investigation is definitely required. With 2k connections the overhead of select() is going to start to be a problem. You would notice the system time going up. -threaded may help with this, because it calls select() less often. If that's not the cause, we should find out what your app is doing while it's idle. If there are runnable threads (eg. the lauchner), then the app should not be spending any of its time idle. Cheers, Simon

On Dec 15, 2005, at 2:02 PM, Simon Marlow wrote:
The statistics are phys/VM, CPU usage in % and #packets/transfer speed
Total: 1345, Lobby: 1326, Failed: 0, 102/184, 50%, 90/8kb Total: 1395, Lobby: 1367, Failed: 2 Total: 1421, Lobby: 1394, Failed: 4 Total: 1490, Lobby: 1463, Failed: 4, 108/194, 50%, 110/11Kb Total: 1574, Lobby: 1546, Failed: 4, 113/202, 50%, 116/11kb
Hmm, your machine is spending 50% of its time doing nothing, and the network traffic is very low. I wouldn't expect 2k connections to pose any problem at all, so further investigation is definitely required.
That's CPU utilization by the program. My laptop is actually running a lot of other stuff as well, although the other stuff is not consuming much CPU.
With 2k connections the overhead of select() is going to start to be a problem. You would notice the system time going up. -threaded may help with this, because it calls select() less often.
I'm testing 4k connections now but I think the app is spending most of the time collecting garbage :-). Well, running handlers on those keep-alive packets as well to update internal state. I think I would need to profile next. I would love to see a report of data in drag/void state but it's impossible since I'm using STM. Unless I can hack support for STM into profiling myself (unlikely? any pointers?) I think I'll have to move away from STM just to profile the program. Joel -- http://wagerlabs.com/

Hello Joel, Thursday, December 15, 2005, 5:13:17 PM, you wrote:
The statistics are phys/VM, CPU usage in % and #packets/transfer speed
Total: 1345, Lobby: 1326, Failed: 0, 102/184, 50%, 90/8kb Total: 1395, Lobby: 1367, Failed: 2 Total: 1421, Lobby: 1394, Failed: 4 Total: 1490, Lobby: 1463, Failed: 4, 108/194, 50%, 110/11Kb Total: 1574, Lobby: 1546, Failed: 4, 113/202, 50%, 116/11kb
Hmm, your machine is spending 50% of its time doing nothing, and the network traffic is very low. I wouldn't expect 2k connections to pose any problem at all, so further investigation is definitely required.
JR> That's CPU utilization by the program. My laptop is actually running JR> a lot of other stuff as well, although the other stuff is not JR> consuming much CPU. if your program has something to do, but cpu usage is less that 100%, this means (at least in windows), that your program is just works in some system calls, which waits for hardware. for example, read from disk. your program may wait for network i/o, logging i/o. try to disable using these code parts and see how cpu utilization will change -- Best regards, Bulat mailto:bulatz@HotPOP.com

On Dec 15, 2005, at 2:02 PM, Simon Marlow wrote:
Hmm, your machine is spending 50% of its time doing nothing, and the network traffic is very low. I wouldn't expect 2k connections to pose any problem at all, so further investigation is definitely required.
With 2k connections the overhead of select() is going to start to be a problem. You would notice the system time going up. -threaded may help with this, because it calls select() less often.
I ran two more tests today after making a few changes. The end result is that increasing the thread stack space makes the program run significantly faster as it was able to launch 1,000 more bots within the same hour. Looking at the end of the 2nd test, 267Mb of physical memory and 423Mb of VM are something that I will need to really look into. 80% CPU utilization by the app is probably a combination of select on 4k sockets The 89 failures are all connections reset by peer, probable cause is my wireless LAN. I'm now using the threaded runtime. Worker threads write to the socket. There's one thread monitoring all the timers. Started about 12:30pm with no thread stack increase and full (very verbose) logging. It's running 5 OS threads pretty consistently. Total: 399, Lobby: 398, Failed: 0, 26/81, 10-20%, Total: 819, Lobby: 810, Failed: 0, 52/119, 20-30% Total: 1051, Lobby: 1048, Failed: 0, 63/136, 30-50% Total: 1229, Lobby: 1219, Failed: 0, 74/153, 30-50% Total: 1318, Lobby: 1299, Failed: 0, 76/157, 30-50% Total: 1448, Lobby: 1433, Failed: 0, 82/167, 40-60%, 13:06 Total: 1544, Lobby: 1526, Failed: 0, 86/174, 50-60%, 13:13 Total: 1672, Lobby: 1648, Failed: 0, 90/182, 50-60%, 13:23 Total: 1754, Lobby: 1727, Failed: 0, 91/186, 40-60%, 13:31 Total: 1824, Lobby: 1796, Failed: 0, 93/189, 50-60%, 13:40 With reduced logging and +RTS -k3k. Started at 13:42, 4 OS threads. Total: 367, Lobby: 363, Failed: 0, 24/76, 10%, 13:49 Total: 516, Lobby: 510, Failed: 14, 34/91, 10-20%, 13:52 Total: 841, Lobby: 836, Failed: 17, 49/116, 20% , 13:56 Total: 1450, Lobby: 1434, Failed: 34, 97/181, 20-50-80%, 14:08 Total: 2008, Lobby: 1999, Failed: 35, 133/234, 70-80%, 14:20 Total: 2318, Lobby: 2308, Failed: 35, 154/263, 70-85%, 14:29 Total: 2623, Lobby: 2613, Failed: 35, 174/293, 70-80%, 14:39 Total: 2862, Lobby: 2854, Failed: 35, 191/316, 70-80%, 14:47 Total: 3151, Lobby: 3142, Failed: 40, 214/347, 60-80%, 14:56 Total: 3364, Lobby: 3355, Failed: 40, 219/359, 60-80%, 15:03 Total: 3808, Lobby: 3744, Failed: 89, 247/398, 70-85%, 15:19 Total: 4000, Lobby: 3938, Failed: 89, 267/423, 80%, 15:27 The system has 120+Mb of free physical memory around 3pm but is not swapping heavily as the number of page outs is not increasing. There's a total of 1Gb of physical memory. 4 OS threads became 5 at some point. -- http://wagerlabs.com/

On Thu, Dec 15, 2005 at 02:02:02PM -0000, Simon Marlow wrote:
With 2k connections the overhead of select() is going to start to be a problem. You would notice the system time going up. -threaded may help with this, because it calls select() less often.
we should be using /dev/poll on systems that support it. it cuts down on the overhead a whole lot. 'poll(2)' is also mostly portable and usually better than select since there is no arbitrary file descriptor limit and it doesn't have to traverse the whole bitset. a few #ifdefs should let us choose the optimum one available on any given system. John -- John Meacham - ⑆repetae.net⑆john⑈

On 15.12 17:14, John Meacham wrote:
On Thu, Dec 15, 2005 at 02:02:02PM -0000, Simon Marlow wrote:
With 2k connections the overhead of select() is going to start to be a problem. You would notice the system time going up. -threaded may help with this, because it calls select() less often.
we should be using /dev/poll on systems that support it. it cuts down on the overhead a whole lot. 'poll(2)' is also mostly portable and usually better than select since there is no arbitrary file descriptor limit and it doesn't have to traverse the whole bitset. a few #ifdefs should let us choose the optimum one available on any given system.
To matters nontrivial all the *nix variants use a different more efficient replacement for poll. Solaris has /dev/poll *BSD (and OS X) has kqueue Linux has epoll Also on linux NPTL+blocking calls can actually be very fast with a suitable scenario. An additional problem is that these mechanisms depend on the version of the kernel running on the machine... Thus e.g. not all linux machines will have epoll. - Einar Karttunen

Einar Karttunen
To matters nontrivial all the *nix variants use a different more efficient replacement for poll.
Solaris has /dev/poll *BSD (and OS X) has kqueue Linux has epoll
Since this is 'cafe, here's a page has some performance testing of epoll: http://lse.sourceforge.net/epoll/
Thus e.g. not all linux machines will have epoll.
It is present in 2.6, but not 2.4? -k -- If I haven't seen further, it is by standing in the footprints of giants

On 12/16/05, Einar Karttunen
To matters nontrivial all the *nix variants use a different more efficient replacement for poll.
So we should find a library that offers a unified interface for all of them, or implement one ourselves. I am pretty sure such a library exists. It should fall back to select() or poll() on platforms that don't have better alternatives. Best regards Tomasz

On Fri, Dec 16, 2005 at 07:03:46AM +0100, Tomasz Zielonka wrote:
On 12/16/05, Einar Karttunen
wrote: To matters nontrivial all the *nix variants use a different more efficient replacement for poll.
So we should find a library that offers a unified interface for all of them, or implement one ourselves.
http://monkey.org/~provos/libevent/ See also http://www.kegel.com/c10k.html Andrew

On 16.12 07:03, Tomasz Zielonka wrote:
On 12/16/05, Einar Karttunen
wrote: To matters nontrivial all the *nix variants use a different more efficient replacement for poll.
So we should find a library that offers a unified interface for all of them, or implement one ourselves.
I am pretty sure such a library exists. It should fall back to select() or poll() on platforms that don't have better alternatives.
network-alt has select(2), epoll, blocking and very experimental kqueue (the last one is not yet committed but I can suply patches if someone is interested. - Einar

John Meacham wrote:
On Thu, Dec 15, 2005 at 02:02:02PM -0000, Simon Marlow wrote:
With 2k connections the overhead of select() is going to start to be a problem. You would notice the system time going up. -threaded may help with this, because it calls select() less often.
we should be using /dev/poll on systems that support it.
And kqueue for systems that support that. Much, much more efficient than select. -- Lennart
participants (9)
-
Andrew Pimlott
-
Bulat Ziganshin
-
Einar Karttunen
-
Joel Reymont
-
John Meacham
-
Ketil Malde
-
Lennart Augustsson
-
Simon Marlow
-
Tomasz Zielonka