RE: [Haskell-cafe] Optimizing a high-traffic network architecture

On 15 December 2005 10:21, Joel Reymont wrote:
Here are statistics that I gathered. I'm almost done modifying the program to use 1 timer thread instead of 1 per bot as well as writing to the socket from the writer thread. This should reduce the number of threads from 6k (2k x 3) to 2k plus change.
It appears that +RTS -k3k does make a difference. As per Simon, 2-4k avoids the thread being garbage collected because each thread gets its own block in the storage manager. Simon, did I get that right?
The 3k threads are still GC'd, but they are not actually *copied* during GC. It'll increase the memory overhead per thread from 2k (1k * 2 for copying) to 4k (4k block, no overhead for copying). Cheers, Simon

Hello Simon, Thursday, December 15, 2005, 4:53:27 PM, you wrote: SM> The 3k threads are still GC'd, but they are not actually *copied* during SM> GC. SM> It'll increase the memory overhead per thread from 2k (1k * 2 for SM> copying) to 4k (4k block, no overhead for copying). Simon, why not to include this in the "base package"? either change something so that a 1k-threads will be not copied during GC, or at least increment default stack size? this will improve performance of other hyper-threaded programs. memory expenses seems not so great -- Best regards, Bulat mailto:bulatz@HotPOP.com

Bulat Ziganshin wrote:
Hello Simon,
Thursday, December 15, 2005, 4:53:27 PM, you wrote:
SM> The 3k threads are still GC'd, but they are not actually *copied* during SM> GC.
SM> It'll increase the memory overhead per thread from 2k (1k * 2 for SM> copying) to 4k (4k block, no overhead for copying).
Simon, why not to include this in the "base package"? either change something so that a 1k-threads will be not copied during GC, or at least increment default stack size? this will improve performance of other hyper-threaded programs. memory expenses seems not so great
Because it doesn't always improve things. This is a slightly modified version of the "cheap concurrency" benchmark from the shootout, first without tweaking -k:
./threads003 10000 +RTS -sstderr ./threads003 10000 +RTS -sstderr 93,908,920 bytes allocated in the heap 159,724,208 bytes copied during GC (scavenged) 1,559,376 bytes copied during GC (not scavenged) 10,415,848 bytes maximum residency (4 sample(s))
177 collections in generation 0 ( 1.05s) 4 collections in generation 1 ( 0.02s) 21 Mb total memory in use INIT time 0.00s ( 0.00s elapsed) MUT time 1.28s ( 1.28s elapsed) GC time 1.06s ( 1.09s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 2.35s ( 2.37s elapsed) %GC time 45.3% (45.9% elapsed) Alloc rate 73,149,011 bytes per MUT second Productivity 54.7% of total user, 54.1% of total elapsed and now tweaking -k (using -k6k, because this is a 64-bit machine and storage manager blocks are 8k):
./threads003 10000 +RTS -sstderr -k6k ./threads003 10000 +RTS -sstderr -k6k 168,837,736 bytes allocated in the heap 109,203,160 bytes copied during GC (scavenged) 1,497,728 bytes copied during GC (not scavenged) 71,180,464 bytes maximum residency (2 sample(s))
156 collections in generation 0 ( 1.06s) 2 collections in generation 1 ( 0.01s) 86 Mb total memory in use INIT time 0.00s ( 0.00s elapsed) MUT time 2.48s ( 2.58s elapsed) GC time 1.08s ( 1.08s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 3.56s ( 3.65s elapsed) %GC time 30.3% (29.5% elapsed) Alloc rate 68,007,748 bytes per MUT second Productivity 69.7% of total user, 67.9% of total elapsed My hypothesis is that when we give each thread its own memory block, all the thread stacks occupy the same cache lines and we end up with a lot more cache misses (notice it's the MUT time that increased, not the GC time). Test program attached, if anyone's interested in digging further. Cheers, Simon -- $Id: message-ghc-2.code,v 1.3 2005/09/17 04:36:26 bfulgham Exp $ -- The Great Computer Language Shootout -- http://shootout.alioth.debian.org/ -- Contributed by Einar Karttunen -- Modified by Simon Marlow -- This is the shootout "cheap concurrency" benchmark, modified -- slightly. Modification noted below (***) to add more concurrency -- and make a speedup on multiple processors available. -- Creates 500 threads arranged in a sequence where each takes a value -- from the left, adds 1, and passes it to the right (via MVars). -- N more threads pump zeros in at the left. A sub-thread -- takes N values from the right and sums them. -- import Control.Concurrent import Control.Monad import System thread :: MVar Int -> MVar Int -> IO () thread inp out = do x <- takeMVar inp; putMVar out $! x+1; thread inp out spawn cur _ = do next <- newEmptyMVar forkIO $ thread cur next return next main = do n <- getArgs >>= readIO.head s <- newEmptyMVar e <- foldM spawn s [1..500] f <- newEmptyMVar forkIO $ replicateM n (takeMVar e) >>= putMVar f . sum replicateM n (forkIO $ putMVar s 0) -- *** replicateM n (putMVar s 0) takeMVar f -- vim: ts=4 ft=haskell

My apologies if this has been described somewhere but what is MUT time? Also, isn't 30% GC a bit high? This is something that totally surprised me when I first saw it as my program was spending 60-70% on GC. Is there a good low % number that should be used as a benchmark? Thanks, Joel On Jan 5, 2006, at 10:01 AM, Simon Marlow wrote:
INIT time 0.00s ( 0.00s elapsed) MUT time 2.48s ( 2.58s elapsed) GC time 1.08s ( 1.08s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 3.56s ( 3.65s elapsed)
%GC time 30.3% (29.5% elapsed)
Alloc rate 68,007,748 bytes per MUT second
Productivity 69.7% of total user, 67.9% of total elapsed
My hypothesis is that when we give each thread its own memory block, all the thread stacks occupy the same cache lines and we end up with a lot more cache misses (notice it's the MUT time that increased, not the GC time).

Joel Reymont wrote:
My apologies if this has been described somewhere but what is MUT time?
MUTator time, i.e. the time spent doing real work by your program. (the term "mutator" isn't used so much these days, but it comes from the view of a functional program as a graph, and the engine that "mutates" or "reduces" the graph is called the "mutator".)
Also, isn't 30% GC a bit high? This is something that totally surprised me when I first saw it as my program was spending 60-70% on GC.
It is a little high, yes. GC time on a 64-bit machine tends to be higher because there are twice as many bits to shuffle around.
Is there a good low % number that should be used as a benchmark?
As a rule of thumb I usually consider >30% to be a bit on the high side, it would be time to investigate with profiling and try to reduce residency or allocation rate. Cheers, Simon
participants (3)
-
Bulat Ziganshin
-
Joel Reymont
-
Simon Marlow