
Hello all, I decided to put together an entry for one of the shooutout categories: http://haskell.org/hawiki/ChameneosEntry It involves 4 threads. There is no current Haskell entry. Anyone have comments? Alterations? I can't have created the best code yet. I don't have the compilers for the other entries, so I can't compare speed on my powerbook (which is the wrong kind of CPU anyway -- they test on an AMD system). -- Chris

--- Chris Kuklewicz
which is the wrong kind of CPU anyway -- they test on an AMD system
"What machine are you running the programs on?" http://shootout.alioth.debian.org/gp4/faq.php#machine __________________________________ Yahoo! for Good - Make a difference this year. http://brand.yahoo.com/cybergivingweek2005/

Hello, I have posted two faster versions to http://haskell.org/hawiki/ChameneosEntry The faster of the two uses an extra manager thread, the slower uses STM. -- Chris Chris Kuklewicz wrote:
Hello all,
I decided to put together an entry for one of the shooutout categories:
http://haskell.org/hawiki/ChameneosEntry
It involves 4 threads. There is no current Haskell entry.
Anyone have comments? Alterations? I can't have created the best code yet.
I don't have the compilers for the other entries, so I can't compare speed on my powerbook (which is the wrong kind of CPU anyway -- they test on an AMD system).

Hello, Einar Kartunen sped up the code using a custom channel implementation. This increased speed by a factor of over 2. The wiki at http://haskell.org/hawiki/ChameneosEntry has the latest version. This makes me ponder one of the things that Joel was trying to do: efficiently pass data to a logging thread. It may be that a custom channel would be helpful for that as well. -- Chris Kuklewicz

Hello Chris, Tuesday, January 03, 2006, 12:20:26 AM, you wrote: CK> Einar Kartunen sped up the code using a custom channel implementation. CK> This increased speed by a factor of over 2. The wiki at CK> http://haskell.org/hawiki/ChameneosEntry has the latest version. can these channels be used in general-purpose code? CK> This makes me ponder one of the things that Joel was trying to do: CK> efficiently pass data to a logging thread. It may be that a custom CK> channel would be helpful for that as well. last variant of his code used just MVar-protected direct hPutStr operations -- Best regards, Bulat mailto:bulatz@HotPOP.com

Bulat Ziganshin wrote:
Hello Chris,
Tuesday, January 03, 2006, 12:20:26 AM, you wrote:
CK> Einar Kartunen sped up the code using a custom channel implementation. CK> This increased speed by a factor of over 2. The wiki at CK> http://haskell.org/hawiki/ChameneosEntry has the latest version.
can these channels be used in general-purpose code?
The latest Ch code is very very short:
{- Ch : fast unordered channel implementation -} newtype Ch a = Ch (MVar [a], MVar a)
newCh = liftM2 (,) (newMVar []) newEmptyMVar >>= return.Ch
readCh (Ch (w,r)) = takeMVar w >>= \lst -> case lst of (x:xs) -> putMVar w xs >> return x [] -> putMVar w [] >> takeMVar r
writeCh (Ch (w,r)) x = do ok <- tryPutMVar r x -- opportunistic, helps for this problem unless ok $ takeMVar w >>= \lst -> do ok <- tryPutMVar r x -- safe inside take/put putMVar w $ if ok then lst else (x:lst)
It could be used in general purpose code, note the parametric type "a" in "Ch a". It makes absolutely no guarantees about the order of values. That means that the order they are written is unlikely to be the order in which they are read. Writes to the channel are non-blocking and the "MVar [a]" holds some items waiting to be read (in LIFO order). The "MVar a" allows a reader to block and wait for an empty channel to get an item. A small amount of extra speed comes from the writer's "opportunistic" attempt to not take the w MVar unless it needs to. But note that readCh always takes the w MVar, and can ignore the r MVar. This choice was determined by benchmarking. Alternative, but slower for this case, functions for readCh and writeCh are
readCh' (Ch (w,r)) = do mx <- tryTakeMVar r case mx of Just x -> return x Nothing -> takeMVar w >>= \lst -> case lst of (x:xs) -> putMVar w xs >> return x [] -> putMVar w [] >> takeMVar r
writeCh' (Ch (w,r)) x = takeMVar w >>= \lst -> do ok <- tryPutMVar r x -- safe inside take/put putMVar w $ if ok then lst else (x:lst)
But in this instance, using either of these would be slower. The balance between readers (one here) and writers (four here) and their average speed is what determines the optimum readCh/writeCh code. Different usage would benefit from different choices.
CK> This makes me ponder one of the things that Joel was trying to do: CK> efficiently pass data to a logging thread. It may be that a custom CK> channel would be helpful for that as well.
last variant of his code used just MVar-protected direct hPutStr operations
My point was more that Haskell allows you to make your own channels and that it is possible to do better than the provided ones.

It seems like the real difference between TChan and the Ch code below is that TChan is, basically, [TVar a] whereas Ch is MVar [a], plus the order is guaranteed for a TChan. Now why would it matter so much speed-wise? This is the CVS code. newTChanIO is exported but undocumented in GHC 6.4.1. I'm not sure what purpose it serves. -- | 'TChan' is an abstract type representing an unbounded FIFO channel. data TChan a = TChan (TVar (TVarList a)) (TVar (TVarList a)) type TVarList a = TVar (TList a) data TList a = TNil | TCons a (TVarList a) newTChan :: STM (TChan a) newTChan = do hole <- newTVar TNil read <- newTVar hole write <- newTVar hole return (TChan read write) newTChanIO :: IO (TChan a) newTChanIO = do hole <- newTVarIO TNil read <- newTVarIO hole write <- newTVarIO hole return (TChan read write) writeTChan :: TChan a -> a -> STM () writeTChan (TChan _read write) a = do listend <- readTVar write -- listend == TVar pointing to TNil new_listend <- newTVar TNil writeTVar listend (TCons a new_listend) writeTVar write new_listend readTChan :: TChan a -> STM a readTChan (TChan read _write) = do listhead <- readTVar read head <- readTVar listhead case head of TNil -> retry TCons a tail -> do writeTVar read tail return a On Jan 3, 2006, at 11:25 AM, Chris Kuklewicz wrote:
The latest Ch code is very very short:
{- Ch : fast unordered channel implementation -} newtype Ch a = Ch (MVar [a], MVar a)
newCh = liftM2 (,) (newMVar []) newEmptyMVar >>= return.Ch
readCh (Ch (w,r)) = takeMVar w >>= \lst -> case lst of (x:xs) -> putMVar w xs >> return x [] -> putMVar w [] >> takeMVar r
writeCh (Ch (w,r)) x = do ok <- tryPutMVar r x -- opportunistic, helps for this problem unless ok $ takeMVar w >>= \lst -> do ok <- tryPutMVar r x -- safe inside take/put putMVar w $ if ok then lst else (x:lst)
It could be used in general purpose code, note the parametric type "a" in "Ch a". It makes absolutely no guarantees about the order of values. That means that the order they are written is unlikely to be the order in which they are read. Writes to the channel are non-blocking and the "MVar [a]" holds some items waiting to be read (in LIFO order).

Joel Reymont wrote:
It seems like the real difference between TChan and the Ch code below is that TChan is, basically, [TVar a] whereas Ch is MVar [a], plus the order is guaranteed for a TChan.
Now why would it matter so much speed-wise?
STM* is usually slower than IO/MVar. STM has to do the transactional record keeping and throws away work (i.e. CPU cycles and speed) when it aborts. The Chameneos benchmark has 4 writers working *very* quickly, so the contention is high. Taking the MVar acts like a mutex to serialize access without throwing away work. The [a] is a LIFO stack; pushing and popping the front element is fast. Also, I suspect the following may be true: If 4 threads block trying to take an MVar, and a 5th thread puts a value in the MVar, then *exactly one* of the 4 blocked threads is woken up and scheduled to run. If 4 threads retry while in STM (e.g. takeTMVar), and a 5th thread commits a change to that TMVar, then *all 4 threads* are woken up and rescheduled. This is what the Apache httpd server folks called the "thundering herd" problem when many processes are blocked waiting on socket 80. If 4000 threads were getting woken up when only 1 was needed, then performance would be poor. Certainly I found 4 writer thread and STM to be much slower for this shootout problem than the Einar's custom MVar channel. Could someone who knows the STM implementation comment on this?
This is the CVS code. newTChanIO is exported but undocumented in GHC 6.4.1. I'm not sure what purpose it serves.
I get tired of writing "do tv <- atomically $ newTVar foo" I bet this is just shorthand: "do tv <- newTVarIO foo" Same for newTChanIO. [ Of course, a type class could be used instead. But that seems tohave been judged overkill ]
-- | 'TChan' is an abstract type representing an unbounded FIFO channel. data TChan a = TChan (TVar (TVarList a)) (TVar (TVarList a))
type TVarList a = TVar (TList a) data TList a = TNil | TCons a (TVarList a)
newTChan :: STM (TChan a) newTChan = do hole <- newTVar TNil read <- newTVar hole write <- newTVar hole return (TChan read write)
newTChanIO :: IO (TChan a) newTChanIO = do hole <- newTVarIO TNil read <- newTVarIO hole write <- newTVarIO hole return (TChan read write)
writeTChan :: TChan a -> a -> STM () writeTChan (TChan _read write) a = do listend <- readTVar write -- listend == TVar pointing to TNil new_listend <- newTVar TNil writeTVar listend (TCons a new_listend) writeTVar write new_listend
readTChan :: TChan a -> STM a readTChan (TChan read _write) = do listhead <- readTVar read head <- readTVar listhead case head of TNil -> retry TCons a tail -> do writeTVar read tail return a
On Jan 3, 2006, at 11:25 AM, Chris Kuklewicz wrote:
The latest Ch code is very very short:
{- Ch : fast unordered channel implementation -} newtype Ch a = Ch (MVar [a], MVar a)
newCh = liftM2 (,) (newMVar []) newEmptyMVar >>= return.Ch
readCh (Ch (w,r)) = takeMVar w >>= \lst -> case lst of (x:xs) -> putMVar w xs >> return x [] -> putMVar w [] >> takeMVar r
writeCh (Ch (w,r)) x = do ok <- tryPutMVar r x -- opportunistic, helps for this problem unless ok $ takeMVar w >>= \lst -> do ok <- tryPutMVar r x -- safe inside take/put putMVar w $ if ok then lst else (x:lst)
It could be used in general purpose code, note the parametric type "a" in "Ch a". It makes absolutely no guarantees about the order of values. That means that the order they are written is unlikely to be the order in which they are read. Writes to the channel are non-blocking and the "MVar [a]" holds some items waiting to be read (in LIFO order).

On Jan 2, 2006, at 9:20 PM, Chris Kuklewicz wrote:
This makes me ponder one of the things that Joel was trying to do: efficiently pass data to a logging thread. It may be that a custom channel would be helpful for that as well.
I have not taken the time to analyze the Chameneos code but need to point out that my problem was not with efficiently passing data to the logging thread. The issue was with data accumulating in the channel and the logger thread not reading it out fast enough. The TChan implementation is a single-linked list implemented on top of TVar's. That would seem pretty efficient to me. Thanks, Joel -- http://wagerlabs.com/

On Tue, Jan 03, 2006 at 12:07:43AM +0000, Joel Reymont wrote:
On Jan 2, 2006, at 9:20 PM, Chris Kuklewicz wrote:
This makes me ponder one of the things that Joel was trying to do: efficiently pass data to a logging thread. It may be that a custom channel would be helpful for that as well.
I have not taken the time to analyze the Chameneos code but need to point out that my problem was not with efficiently passing data to the logging thread. The issue was with data accumulating in the channel and the logger thread not reading it out fast enough.
The TChan implementation is a single-linked list implemented on top of TVar's. That would seem pretty efficient to me.
It's simple and efficient but does nothing to prevent the channel from growing out of control. A slightly modified (custom) channel based on TChan, but enforcing a maximum size (blocking on insert if the channel is too full), probably would have solved the problem. I assume that Erlang either does that or increases the priority of threads with large event queues, or both. Thanks, Matt Harden

[ Deeply nested replies are starting to look similar to runListT $ runStateT $ runWriter .... ] matth@mindspring.com wrote:
On Tue, Jan 03, 2006 at 12:07:43AM +0000, Joel Reymont wrote:
On Jan 2, 2006, at 9:20 PM, Chris Kuklewicz wrote:
This makes me ponder one of the things that Joel was trying to do: efficiently pass data to a logging thread. It may be that a custom channel would be helpful for that as well.
I have not taken the time to analyze the Chameneos code but need to point out that my problem was not with efficiently passing data to the logging thread. The issue was with data accumulating in the channel and the logger thread not reading it out fast enough.
The TChan implementation is a single-linked list implemented on top of TVar's. That would seem pretty efficient to me.
It's simple and efficient but does nothing to prevent the channel from growing out of control. A slightly modified (custom) channel based on TChan, but enforcing a maximum size (blocking on insert if the channel is too full), probably would have solved the problem.
I assume that Erlang either does that or increases the priority of threads with large event queues, or both.
Thanks, Matt Harden
Given that actually controlling priorities is not an option, adding blocking like that makes sense. One can make a ring buffer instead of a singly linked list very easily. In fact, I have that code lying around (now attached). It has not been speed optimized, but I did like being able to express:
type Node a = [TMVar a]
make :: (Integral k) => k -> STM (Node a) make k = liftM cycle $ sequence $ genericReplicate k newEmptyTMVar
It has the usual operations, but you need to pass a fixed size to new/newEmpty and you also have an isFull test. It has no operations to resize the ring buffer created by "make". module ProdCons (PC,new,newEmpty, put,ProdCons.take,ProdCons.read, tryPut,tryTake,tryRead, isEmpty,isFull) where {- Fixed bounded-buffer size solution of producer/consumer problem. Acts like a FIFO TMVar, blocking when capacity is reached. So a capacity of 1 behaves like a TMVar. For arbitrary capacity just use a TChan. -} import Control.Concurrent.STM import Control.Concurrent import Control.Monad.Fix import Control.Monad import Data.List(cycle,genericReplicate) type Node a = [TMVar a] newtype PC a = PC (TVar (Node a),TVar (Node a)) newEmpty :: (Integral k) => k -> IO (PC a) newEmpty k | k <=0 = error "Need capacity > 0" | otherwise = do node <- atomically $ make k atomically $ do tv1 <- newTVar node tv2 <- newTVar node return (PC (tv1,tv2)) new :: (Integral k) => k -> a -> IO (PC a) new k v | k <=0 = error "Need capacity > 0" v | otherwise = do pc <- newEmpty k atomically $ put pc v return pc put ::PC a -> a -> STM () put (PC (tvar,_)) value = do (tmvar:next) <- readTVar tvar putTMVar tmvar value writeTVar tvar next take :: PC a -> STM a take (PC (_,tvar)) = do (tmvar:next) <- readTVar tvar value <- takeTMVar tmvar writeTVar tvar next return value read :: PC a -> STM a read (PC (_,tvar)) = do (tmvar:_) <- readTVar tvar readTMVar tmvar tryTake :: PC a -> STM (Maybe a) tryTake pc = (ProdCons.take pc >>= return.Just) `orElse` (return Nothing) tryRead :: PC a -> STM (Maybe a) tryRead pc = (ProdCons.read pc >>= return.Just) `orElse` (return Nothing) tryPut :: PC a -> a -> STM Bool tryPut pc v = (put pc v >> return True) `orElse` (return False) isEmpty :: PC a -> STM Bool isEmpty (PC (_,tvar)) = do (tmvar:_) <- readTVar tvar isEmptyTMVar tmvar isFull (PC (tvar,_)) = do (tmvar:_) <- readTVar tvar empty <- isEmptyTMVar tmvar return (not empty) -- -- -- Internal -- -- -- make :: (Integral k) => k -> STM (Node a) make k = liftM cycle $ sequence $ genericReplicate k newEmptyTMVar
participants (5)
-
Bulat Ziganshin
-
Chris Kuklewicz
-
Isaac Gouy
-
Joel Reymont
-
matth@mindspring.com