
I was writing some parallel code (asynchronous database writes for an event logger, but that's besides the point), and it seemed like the parallelized version (i.e. compiled with -threaded -with-rtsopts=-N2) wasn't running fast enough. I boiled it down to a dead-simple test: import Control.Concurrent import Data.Time.Clock.POSIX import System.Environment main :: IO () main = do n <- getArgs >>= return . read . head t1 <- getPOSIXTime work n t2 <- getPOSIXTime putStrLn $ show $ t2 - t1 putStrLn $ show $ (fromIntegral n :: Double) / (fromRational . toRational $ t2 - t1) work :: Integer -> IO () work n = do forkIO $ putStrLn $ seq (fact n) "Done" putStrLn $ seq (fact n) "Done" fact :: Integer -> Integer fact 1 = 1 fact n = n * fact (n - 1) (I know this is not the best way to time things but I think it suffices for this test.) Compiled with ghc --make -O3 test.hs, ./test 500000 runs for 74 seconds. Compiling with ghc --make -O3 -threaded -with-rtsopts=-N, ./test 500000 runs for 82 seconds (and seems to be using 2 cpu cores instead of just 1, on a 4-core machine). What gives? Mike S Craig

Sounds like you're paying a x2 cost for loading up the threaded runtime (compare -N1 -threaded with no flags.) Not really sure why, it looks like you're getting killed by GC. Are you sure you want to be doing factorial on Integers? Edward Excerpts from Michael Craig's message of Thu Dec 01 00:50:15 -0500 2011:
I was writing some parallel code (asynchronous database writes for an event logger, but that's besides the point), and it seemed like the parallelized version (i.e. compiled with -threaded -with-rtsopts=-N2) wasn't running fast enough. I boiled it down to a dead-simple test:
import Control.Concurrent import Data.Time.Clock.POSIX import System.Environment
main :: IO () main = do n <- getArgs >>= return . read . head t1 <- getPOSIXTime work n t2 <- getPOSIXTime putStrLn $ show $ t2 - t1 putStrLn $ show $ (fromIntegral n :: Double) / (fromRational . toRational $ t2 - t1)
work :: Integer -> IO () work n = do forkIO $ putStrLn $ seq (fact n) "Done" putStrLn $ seq (fact n) "Done"
fact :: Integer -> Integer fact 1 = 1 fact n = n * fact (n - 1)
(I know this is not the best way to time things but I think it suffices for this test.)
Compiled with ghc --make -O3 test.hs, ./test 500000 runs for 74 seconds. Compiling with ghc --make -O3 -threaded -with-rtsopts=-N, ./test 500000 runs for 82 seconds (and seems to be using 2 cpu cores instead of just 1, on a 4-core machine). What gives?
Mike S Craig

With regards to your original concurrent code (asynchronous database writes), if the API given to you truly is asynchronous (i.e. it's a file descriptor that could be monitored with epoll/kqueue/folks), consider integrating it with the IO manager, so that you don't need to tie up real OS threads on blocking FFI calls (though I'm not sure what database or what access mechanism is.) You really shouldn't need the threaded runtime for a task like this. Maybe if you give more details we can give more specific advice. Edward Excerpts from Michael Craig's message of Thu Dec 01 00:50:15 -0500 2011:
I was writing some parallel code (asynchronous database writes for an event logger, but that's besides the point), and it seemed like the parallelized version (i.e. compiled with -threaded -with-rtsopts=-N2) wasn't running fast enough. I boiled it down to a dead-simple test:
import Control.Concurrent import Data.Time.Clock.POSIX import System.Environment
main :: IO () main = do n <- getArgs >>= return . read . head t1 <- getPOSIXTime work n t2 <- getPOSIXTime putStrLn $ show $ t2 - t1 putStrLn $ show $ (fromIntegral n :: Double) / (fromRational . toRational $ t2 - t1)
work :: Integer -> IO () work n = do forkIO $ putStrLn $ seq (fact n) "Done" putStrLn $ seq (fact n) "Done"
fact :: Integer -> Integer fact 1 = 1 fact n = n * fact (n - 1)
(I know this is not the best way to time things but I think it suffices for this test.)
Compiled with ghc --make -O3 test.hs, ./test 500000 runs for 74 seconds. Compiling with ghc --make -O3 -threaded -with-rtsopts=-N, ./test 500000 runs for 82 seconds (and seems to be using 2 cpu cores instead of just 1, on a 4-core machine). What gives?
Mike S Craig

Simon Marlow investigated, and we got this patch out:
commit 6d18141d880d55958c3392f6a7ae621dc33ee5c1
Author: Simon Marlow
I was writing some parallel code (asynchronous database writes for an event logger, but that's besides the point), and it seemed like the parallelized version (i.e. compiled with -threaded -with-rtsopts=-N2) wasn't running fast enough. I boiled it down to a dead-simple test:
import Control.Concurrent import Data.Time.Clock.POSIX import System.Environment
main :: IO () main = do n <- getArgs >>= return . read . head t1 <- getPOSIXTime work n t2 <- getPOSIXTime putStrLn $ show $ t2 - t1 putStrLn $ show $ (fromIntegral n :: Double) / (fromRational . toRational $ t2 - t1)
work :: Integer -> IO () work n = do forkIO $ putStrLn $ seq (fact n) "Done" putStrLn $ seq (fact n) "Done"
fact :: Integer -> Integer fact 1 = 1 fact n = n * fact (n - 1)
(I know this is not the best way to time things but I think it suffices for this test.)
Compiled with ghc --make -O3 test.hs, ./test 500000 runs for 74 seconds. Compiling with ghc --make -O3 -threaded -with-rtsopts=-N, ./test 500000 runs for 82 seconds (and seems to be using 2 cpu cores instead of just 1, on a 4-core machine). What gives?
Mike S Craig

Excellent! Glad this has been sorted out upstream. Edward, to answer your question regarding blocking database calls: I'm using mongoDB to log events that come into a WAI webapp. The writes to mongo are blocking, so I'd like to run them in parallel with the webapp. The webapp would push the data into a Chan and the mongo writer would read from the Chan and make the writes sequentially (using the Chan as a FIFO between parallel threads). This would allow for request rates to temporarily rise above mongo's write rate (of course with an expanded memory footprint during those bursts). Mike S Craig On Thu, Dec 1, 2011 at 12:10 PM, Felipe Almeida Lessa < felipe.lessa@gmail.com> wrote:
On Thu, Dec 1, 2011 at 2:40 PM, Edward Z. Yang
wrote: Simon Marlow investigated, and we got this patch out:
Nice work, guys! Hope it gets included in the glourious GHC 7.4 =D.
Cheers,
-- Felipe.

OK. A common mistake when using channels is forgetting to make sure all of the informaiton is fully evaluated in the worker thread, which then causes the writer thread to spend all its time evaluating thunks. Edward Excerpts from Michael Craig's message of Thu Dec 01 13:17:57 -0500 2011:
Excellent! Glad this has been sorted out upstream.
Edward, to answer your question regarding blocking database calls: I'm using mongoDB to log events that come into a WAI webapp. The writes to mongo are blocking, so I'd like to run them in parallel with the webapp. The webapp would push the data into a Chan and the mongo writer would read from the Chan and make the writes sequentially (using the Chan as a FIFO between parallel threads). This would allow for request rates to temporarily rise above mongo's write rate (of course with an expanded memory footprint during those bursts).
Mike S Craig
On Thu, Dec 1, 2011 at 12:10 PM, Felipe Almeida Lessa < felipe.lessa@gmail.com> wrote:
On Thu, Dec 1, 2011 at 2:40 PM, Edward Z. Yang
wrote: Simon Marlow investigated, and we got this patch out:
Nice work, guys! Hope it gets included in the glourious GHC 7.4 =D.
Cheers,
-- Felipe.

Thanks, I'll keep that in mind.
I wrote a better example of what I'm trying to do:
https://gist.github.com/1420742 It runs worse with multiple threads than it
does with one (both time-wise and memory-wise), I think due to the bug
Simon described in that commit.
But that bug aside, is it possible to write a multithreaded application in
Haskell that sends large amounts of data from one thread to the other
without getting crushed by GC? I've looked into the garbage collector a
bit, and this seems problematic.
Mike S Craig
On Thu, Dec 1, 2011 at 2:38 PM, Edward Z. Yang
OK. A common mistake when using channels is forgetting to make sure all of the informaiton is fully evaluated in the worker thread, which then causes the writer thread to spend all its time evaluating thunks.
Edward
Excerpts from Michael Craig's message of Thu Dec 01 13:17:57 -0500 2011:
Excellent! Glad this has been sorted out upstream.
Edward, to answer your question regarding blocking database calls: I'm using mongoDB to log events that come into a WAI webapp. The writes to mongo are blocking, so I'd like to run them in parallel with the webapp. The webapp would push the data into a Chan and the mongo writer would read from the Chan and make the writes sequentially (using the Chan as a FIFO between parallel threads). This would allow for request rates to temporarily rise above mongo's write rate (of course with an expanded memory footprint during those bursts).
Mike S Craig
On Thu, Dec 1, 2011 at 12:10 PM, Felipe Almeida Lessa < felipe.lessa@gmail.com> wrote:
On Thu, Dec 1, 2011 at 2:40 PM, Edward Z. Yang
wrote: Simon Marlow investigated, and we got this patch out:
Nice work, guys! Hope it gets included in the glourious GHC 7.4 =D.
Cheers,
-- Felipe.

On 02/12/2011 00:08, Michael Craig wrote:
Thanks, I'll keep that in mind.
I wrote a better example of what I'm trying to do: https://gist.github.com/1420742 It runs worse with multiple threads than it does with one (both time-wise and memory-wise), I think due to the bug Simon described in that commit.
To be clear, the fix was for the problem where your program ran much slower with -threaded than without. It probably won't have much effect on scaling. In the program you linked above, I'm not terribly surprised if it doesn't scale well - it is basically a communication benchmark, and those perform best when running both threads on a single core. In fact, I would go so far as to pin them both to the same core with forkOnIO. Perhaps you might expect that, since the channel is unbounded, the writer thread should be able to produce data in batches that is then slurped up by the reader in batches. Sure - but when running on two cores the data has to be moved from the cache of one core to the other, and that's a killer. Much better to keep it in the cache of one core.
But that bug aside, is it possible to write a multithreaded application in Haskell that sends large amounts of data from one thread to the other without getting crushed by GC? I've looked into the garbage collector a bit, and this seems problematic.
I don't think it's so much the GC as it is the cost of moving data across the memory bus. Communication should be nice and fast if you keep it all on one core. If I'm wrong please send me the code and I'll look into it! Cheers, Simon
Mike S Craig
On Thu, Dec 1, 2011 at 2:38 PM, Edward Z. Yang
mailto:ezyang@mit.edu> wrote: OK. A common mistake when using channels is forgetting to make sure all of the informaiton is fully evaluated in the worker thread, which then causes the writer thread to spend all its time evaluating thunks.
Edward
Excerpts from Michael Craig's message of Thu Dec 01 13:17:57 -0500 2011 tel:57%20-0500%202011: > Excellent! Glad this has been sorted out upstream. > > Edward, to answer your question regarding blocking database calls: I'm > using mongoDB to log events that come into a WAI webapp. The writes to > mongo are blocking, so I'd like to run them in parallel with the webapp. > The webapp would push the data into a Chan and the mongo writer would read > from the Chan and make the writes sequentially (using the Chan as a FIFO > between parallel threads). This would allow for request rates to > temporarily rise above mongo's write rate (of course with an expanded > memory footprint during those bursts). > > Mike S Craig > > > On Thu, Dec 1, 2011 at 12:10 PM, Felipe Almeida Lessa < > felipe.lessa@gmail.com mailto:felipe.lessa@gmail.com> wrote: > > > On Thu, Dec 1, 2011 at 2:40 PM, Edward Z. Yang
mailto:ezyang@mit.edu> wrote: > > > Simon Marlow investigated, and we got this patch out: > > > > Nice work, guys! Hope it gets included in the glourious GHC 7.4 =D. > > > > Cheers, > > > > -- > > Felipe. > >
participants (4)
-
Edward Z. Yang
-
Felipe Almeida Lessa
-
Michael Craig
-
Simon Marlow