Strange parallel behaviour with Ubuntu Karmic / GHC 6.10.4

Hello, I'm currently developing some applications with explicit threading using forkIO and have strange behaviour on my freshly installed Ubuntu Karmic 9.10 (Kernel 2.6.31-14 SMP). Setup: Machine A: Quadcore, Ubuntu 9.04, Kernel 2.6.28-13 SMP Machine B: AMD Opteron 875, 8 cores, 2.6.18-164 SMP- (some redhat) Machine C: Dual-Core, Ubuntu 9.10, Kernel 2.6.31-14 SMP Compiler on all machines: ghc 6.10.4 (downloaded from GHCs official website) Program, Compilation, Execution A simple taskqueue with independent tasks and explicit parallelization (hence should deliver more or less perfect speedup). For one core wall-times around 16 are ok, for 2 a bit more than 8 seconds. Since I used the same sources and Makefiles on all machines all files were compiled with -threaded and started with +RTS -N2 -RTS. Testing: Machine A: Ok (meaning works and delivers the expected speedup) Machine B: Ok Machine C: Not ok (with -N2 wall times around 14-15 seconds) Looking at the core usage, for example with htop, I see that the second core is not really used on C. Executing OpenMP programs shows the expected speedup and usage of both cores, hence I do not think its a kind of general linux configuration problem. So, after all the testing I think its either the Linux Kernel or some other component of Ubuntu 9.10. But: Ubuntu is often used and I did not found any information regarding this problem. The simple solution of installing the old version of Ubuntu would probably help but should not be the way to go, should it? I'd be glad for any hints or comments, Michael -- Dipl.-Inf. Michael C. Lesniak University of Kassel Programming Languages / Methodologies Research Group Department of Computer Science and Electrical Engineering Wilhelmshöher Allee 73 34121 Kassel Phone: +49-(0)561-804-6269

Michael Lesniak wrote:
Hello,
I'm currently developing some applications with explicit threading using forkIO and have strange behaviour on my freshly installed Ubuntu Karmic 9.10 (Kernel 2.6.31-14 SMP).
Setup: Machine A: Quadcore, Ubuntu 9.04, Kernel 2.6.28-13 SMP Machine B: AMD Opteron 875, 8 cores, 2.6.18-164 SMP- (some redhat) Machine C: Dual-Core, Ubuntu 9.10, Kernel 2.6.31-14 SMP Compiler on all machines: ghc 6.10.4 (downloaded from GHCs official website)
Hi, I have a dual-core Ubuntu 9.10 machine (running whatever GHC comes with the distro -- 6.10.x), so if you put your test code somewhere that I can get at, I can run it and see if I get the same effect. Thanks, Neil.

Hello, I've written a smaller example which reproduces the unusual behaviour. Should I open a GHC-Ticket, too? -- A small working example which describes the problems (I have) with GHC -- 6.10.4, Ubuntu Karmic 9.10, explicit threading and core usage. -- -- See http://www.haskell.org/pipermail/haskell-cafe/2009-November/069144.html -- for the general description of the problem. -- -- For comparsion: -- Compilation on both machines with -- -- ghc --make -O2 -threaded Example.hs -o e -Wall -- -- -- 1. Machine B: (Quadcore, Ubuntu 9.04) -- a. With 1 thread: -- time e +RTS -N1 -RTS 16 -- e +RTS -N1 -RTS 16 11,00s user 5,00s system 100% cpu 16,004 total -- -- b. With 2 threads: -- time e +RTS -N2 -RTS 16 -- e +RTS -N2 -RTS 16 11,44s user 4,58s system 197% cpu 8,102 total -- -- -- 2. Machine C: (Dualcore, Ubuntu 9.10) -- a. With 1 thread: -- time e +RTS -N1 -RTS 16 -- -- real 0m16.414s -- user 0m11.360s -- sys 0m4.650s -- -- b. With 2 threads: -- time e +RTS -N2 -RTS 16 -- -- real 0m18.484s -- user 0m14.320s -- sys 0m5.940s -- ------------------------------------------------------------------------------- module Main where import GHC.Conc import Control.Concurrent import Control.Monad import System.Posix.Clock import System.Environment ------------------------------------------------------------------------------- main :: IO () main = do -- Configuration args <- getArgs let threads = numCapabilities -- number of threads determined by -N<...> taskDur = 1.0 -- seconds each task takes taskNum = (read . head) args -- Number of tasks is 1st parameter -- Generate a channel for the tasks to do and fill it with uniform and -- independent tasks. The other channel receives a message for each task -- which is finished. queue <- newChan finished <- newChan writeList2Chan queue (replicate taskNum taskDur) -- Fork threads replicateM_ threads (forkIO (thread queue finished)) -- Wait until the queue is empty replicateM_ taskNum (readChan finished) ------------------------------------------------------------------------------- thread :: Chan Double -> Chan Int -> IO () thread queue finished = forever $ do task <- readChan queue workFor task writeChan finished 1 ------------------------------------------------------------------------------- -- | Generates work for @s@ seconds. workFor :: Double -> IO () workFor s = do now <- getTime ThreadCPUTime repeat (time2Double now + s) where repeat fs = do now <- nSqrt 10000 `pseq` getTime ThreadCPUTime let f = time2Double now unless (f >= fs) $ repeat fs time2Double t = fromIntegral (sec t) + (fromIntegral (nsec t) / 1000000000) -- Calculates the sqrt of 2^1000. The parameter n is to ensure -- that GHC does not optimize it away. -- (In fact, I'm not sure this is needed...) nSqrt n = let sqs = map (\_ -> iterate sqrt (2^1000) !! 50) [1..n] in foldr seq 1 sqs

Michael Lesniak wrote:
Hello,
I've written a smaller example which reproduces the unusual behaviour. Should I open a GHC-Ticket, too?
Hi, I get these results: $ time ./Temp +RTS -N1 -RTS 16 real 0m16.010s user 0m10.869s sys 0m5.144s $ time ./Temp +RTS -N2 -RTS 16 real 0m12.794s user 0m13.341s sys 0m7.136s Looking at top, the second version used ~160% CPU time (i.e. it was using both cores fairly well). So I don't think I get the same bad behaviour as you. Those sys times look high by the way -- I guess it's all the calls to getTime? I wonder if that number might be causing the problem; can you replicate it with lower sys times? Thanks, Neil.

Hello,
getTime? I wonder if that number might be causing the problem; can you replicate it with lower sys times? That was it! Thanks Neil!
When I'm using some number crunching without getTime it works (with more or less the expected speedup and usage of two cores) on my Ubuntu 9.10, too. Out of curiosity, the question is still open: Why does the old example (using getTime) work so much better on an older version of Ubuntu/RedHat and not on the new ones? Kind regards, Michael -- Dipl.-Inf. Michael C. Lesniak University of Kassel Programming Languages / Methodologies Research Group Department of Computer Science and Electrical Engineering Wilhelmshöher Allee 73 34121 Kassel Phone: +49-(0)561-804-6269

Michael Lesniak wrote:
Hello,
getTime? I wonder if that number might be causing the problem; can you replicate it with lower sys times?
That was it! Thanks Neil!
When I'm using some number crunching without getTime it works (with more or less the expected speedup and usage of two cores) on my Ubuntu 9.10, too.
Out of curiosity, the question is still open: Why does the old example (using getTime) work so much better on an older version of Ubuntu/RedHat and not on the new ones?
Your kernels were: Setup: Machine A: Quadcore, Ubuntu 9.04, Kernel 2.6.28-13 SMP Machine B: AMD Opteron 875, 8 cores, 2.6.18-164 SMP- (some redhat) Machine C: Dual-Core, Ubuntu 9.10, Kernel 2.6.31-14 SMP Looking at the implementation of getTime ThreadCPUTime in the clock package, it calls clock_gettime(CLOCK_THREAD_CPUTIME_ID,..). According to this page (http://www.h-online.com/open/news/item/Kernel-Log-What-s-new-in-2-6-29-Part-...), the changes in 2.6.29 (changes which only your Ubuntu 9.10 machine has) included a patch (http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=...) which altered the implementation of that function. Perhaps on some multi-processor machines the new implementation effectively serialises the code? I know there used to be issues of whether some of the timers were synchronised across processors/cores (to stop them appearing to go backwards), so maybe something with the timers and their synchronisations effectively stops your program running in parallel. If it helps, my machine is: "Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz" according to /proc/cpuinfo. Thanks, Neil.
participants (2)
-
Michael Lesniak
-
Neil Brown