[Haskell-cafe] Re: Optimizing a high-traffic network architecture

5 Jan 2006

      Bulat Ziganshin wrote:
...
Hello Simon,
Thursday, December 15, 2005, 4:53:27 PM, you wrote:
SM> The 3k threads are still GC'd, but they are not actually *copied* during
SM> GC.
SM> It'll increase the memory overhead per thread from 2k (1k * 2 for
SM> copying) to 4k (4k block, no overhead for copying).
Simon, why not to include this in the "base package"? either change
something so that a 1k-threads will be not copied during GC, or at
least increment default stack size? this will improve performance of
other hyper-threaded programs. memory expenses seems not so great
Because it doesn't always improve things.  This is a slightly modified 
version of the "cheap concurrency" benchmark from the shootout, first 
without tweaking -k:
...
./threads003 10000 +RTS -sstderr
./threads003 10000 +RTS -sstderr
  93,908,920 bytes allocated in the heap
159,724,208 bytes copied during GC (scavenged)
   1,559,376 bytes copied during GC (not scavenged)
  10,415,848 bytes maximum residency (4 sample(s))
177 collections in generation 0 (  1.05s)
           4 collections in generation 1 (  0.02s)

          21 Mb total memory in use

   INIT  time    0.00s  (  0.00s elapsed)
   MUT   time    1.28s  (  1.28s elapsed)
   GC    time    1.06s  (  1.09s elapsed)
   EXIT  time    0.00s  (  0.00s elapsed)
   Total time    2.35s  (  2.37s elapsed)

   %GC time      45.3%  (45.9% elapsed)

   Alloc rate    73,149,011 bytes per MUT second

   Productivity  54.7% of total user, 54.1% of total elapsed

and now tweaking -k (using -k6k, because this is a 64-bit machine and 
storage manager blocks are 8k):
...
./threads003 10000 +RTS -sstderr -k6k
./threads003 10000 +RTS -sstderr -k6k
168,837,736 bytes allocated in the heap
109,203,160 bytes copied during GC (scavenged)
   1,497,728 bytes copied during GC (not scavenged)
  71,180,464 bytes maximum residency (2 sample(s))
156 collections in generation 0 (  1.06s)
           2 collections in generation 1 (  0.01s)

          86 Mb total memory in use

   INIT  time    0.00s  (  0.00s elapsed)
   MUT   time    2.48s  (  2.58s elapsed)
   GC    time    1.08s  (  1.08s elapsed)
   EXIT  time    0.00s  (  0.00s elapsed)
   Total time    3.56s  (  3.65s elapsed)

   %GC time      30.3%  (29.5% elapsed)

   Alloc rate    68,007,748 bytes per MUT second

   Productivity  69.7% of total user, 67.9% of total elapsed

My hypothesis is that when we give each thread its own memory block, all 
the thread stacks occupy the same cache lines and we end up with a lot 
more cache misses (notice it's the MUT time that increased, not the GC 
time).

Test program attached, if anyone's interested in digging further.

Cheers,
	Simon

-- $Id: message-ghc-2.code,v 1.3 2005/09/17 04:36:26 bfulgham Exp $
-- The Great Computer Language Shootout
-- http://shootout.alioth.debian.org/
-- Contributed by Einar Karttunen
-- Modified by Simon Marlow

-- This is the shootout "cheap concurrency" benchmark, modified
-- slightly.  Modification noted below (***) to add more concurrency
-- and make a speedup on multiple processors available.

-- Creates 500 threads arranged in a sequence where each takes a value
-- from the left, adds 1, and passes it to the right (via MVars).
-- N more threads pump zeros in at the left.  A sub-thread
-- takes N values from the right and sums them.
-- 

import Control.Concurrent
import Control.Monad
import System

thread :: MVar Int -> MVar Int -> IO ()
thread inp out = do x <- takeMVar inp; putMVar out $! x+1; thread inp out

spawn cur _ = do next <- newEmptyMVar
                 forkIO $ thread cur next
                 return next

main = do n <- getArgs >>= readIO.head
          s <- newEmptyMVar
          e <- foldM spawn s [1..500]
          f <- newEmptyMVar
          forkIO $ replicateM n (takeMVar e) >>= putMVar f . sum
          replicateM n (forkIO $ putMVar s 0)
-- ***    replicateM n (putMVar s 0)
          takeMVar f

-- vim: ts=4 ft=haskell