
Bulat Ziganshin wrote:
Hello Simon,
Thursday, December 15, 2005, 4:53:27 PM, you wrote:
SM> The 3k threads are still GC'd, but they are not actually *copied* during SM> GC.
SM> It'll increase the memory overhead per thread from 2k (1k * 2 for SM> copying) to 4k (4k block, no overhead for copying).
Simon, why not to include this in the "base package"? either change something so that a 1k-threads will be not copied during GC, or at least increment default stack size? this will improve performance of other hyper-threaded programs. memory expenses seems not so great
Because it doesn't always improve things. This is a slightly modified version of the "cheap concurrency" benchmark from the shootout, first without tweaking -k:
./threads003 10000 +RTS -sstderr ./threads003 10000 +RTS -sstderr 93,908,920 bytes allocated in the heap 159,724,208 bytes copied during GC (scavenged) 1,559,376 bytes copied during GC (not scavenged) 10,415,848 bytes maximum residency (4 sample(s))
177 collections in generation 0 ( 1.05s) 4 collections in generation 1 ( 0.02s) 21 Mb total memory in use INIT time 0.00s ( 0.00s elapsed) MUT time 1.28s ( 1.28s elapsed) GC time 1.06s ( 1.09s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 2.35s ( 2.37s elapsed) %GC time 45.3% (45.9% elapsed) Alloc rate 73,149,011 bytes per MUT second Productivity 54.7% of total user, 54.1% of total elapsed and now tweaking -k (using -k6k, because this is a 64-bit machine and storage manager blocks are 8k):
./threads003 10000 +RTS -sstderr -k6k ./threads003 10000 +RTS -sstderr -k6k 168,837,736 bytes allocated in the heap 109,203,160 bytes copied during GC (scavenged) 1,497,728 bytes copied during GC (not scavenged) 71,180,464 bytes maximum residency (2 sample(s))
156 collections in generation 0 ( 1.06s) 2 collections in generation 1 ( 0.01s) 86 Mb total memory in use INIT time 0.00s ( 0.00s elapsed) MUT time 2.48s ( 2.58s elapsed) GC time 1.08s ( 1.08s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 3.56s ( 3.65s elapsed) %GC time 30.3% (29.5% elapsed) Alloc rate 68,007,748 bytes per MUT second Productivity 69.7% of total user, 67.9% of total elapsed My hypothesis is that when we give each thread its own memory block, all the thread stacks occupy the same cache lines and we end up with a lot more cache misses (notice it's the MUT time that increased, not the GC time). Test program attached, if anyone's interested in digging further. Cheers, Simon -- $Id: message-ghc-2.code,v 1.3 2005/09/17 04:36:26 bfulgham Exp $ -- The Great Computer Language Shootout -- http://shootout.alioth.debian.org/ -- Contributed by Einar Karttunen -- Modified by Simon Marlow -- This is the shootout "cheap concurrency" benchmark, modified -- slightly. Modification noted below (***) to add more concurrency -- and make a speedup on multiple processors available. -- Creates 500 threads arranged in a sequence where each takes a value -- from the left, adds 1, and passes it to the right (via MVars). -- N more threads pump zeros in at the left. A sub-thread -- takes N values from the right and sums them. -- import Control.Concurrent import Control.Monad import System thread :: MVar Int -> MVar Int -> IO () thread inp out = do x <- takeMVar inp; putMVar out $! x+1; thread inp out spawn cur _ = do next <- newEmptyMVar forkIO $ thread cur next return next main = do n <- getArgs >>= readIO.head s <- newEmptyMVar e <- foldM spawn s [1..500] f <- newEmptyMVar forkIO $ replicateM n (takeMVar e) >>= putMVar f . sum replicateM n (forkIO $ putMVar s 0) -- *** replicateM n (putMVar s 0) takeMVar f -- vim: ts=4 ft=haskell