SMP parallelism gains inferior than expected

Hi, I have a program that computes a matrix of Floats of m rows by n columns. Computing each Float is relatively expensive. Each line is completely independent of the others, so I thought I'd try some simple SMP parallelism on this code: myFun :: FilePath -> IO () myFun fp = do fs <- readDataDir fp let process f = readFile' f >>= parse printLine = putStrLn . foldr (\a b -> show a ++ "\t" ++ b) "" runDiff l = [ [ diff x y | y <- l ] | (x,i) <- zip l (map getId fs), myFilter i ] ps <- mapM process fs sequence_ [ printLine x | x <- runDiff ps *`using` parList rdeepseq* ] So, I'm using parList to evaluate the rows in parallel, and fully evaluating each row. Here are the timings on a Dual Quad Core AMD 2378 @2.4 GHz, ghc-6.12.3, parallel-2.2.0.1: -N time (ms) none 1m50 2 1m33 3 1m35 4 1m22 5 1m11 6 1m06 7 1m45 The increase at 7 is justified by the fact that there were two other processes running. I don't know how to justify the small increase at N3, though, but that doesn't matter too much. The problem is that I am not getting the gains I expected (halving at N2, a third at N3, etc.). Is this the best one can achieve with this implicit parallelism, or am I doing something wrong? In particular, is the way I'm printing the results at the end destroying potential parallel gains? Any insights on this are appreciated. Thanks, Pedro

(Forwarding to haskell-cafe) Hi, I have a program that computes a matrix of Floats of m rows by n columns. Computing each Float is relatively expensive. Each line is completely independent of the others, so I thought I'd try some simple SMP parallelism on this code: myFun :: FilePath -> IO () myFun fp = do fs <- readDataDir fp let process f = readFile' f >>= parse printLine = putStrLn . foldr (\a b -> show a ++ "\t" ++ b) "" runDiff l = [ [ diff x y | y <- l ] | (x,i) <- zip l (map getId fs), myFilter i ] ps <- mapM process fs sequence_ [ printLine x | x <- runDiff ps *`using` parList rdeepseq* ] So, I'm using parList to evaluate the rows in parallel, and fully evaluating each row. Here are the timings on a Dual Quad Core AMD 2378 @2.4 GHz, ghc-6.12.3, parallel-2.2.0.1: -N time (ms) none 1m50 2 1m33 3 1m35 4 1m22 5 1m11 6 1m06 7 1m45 The increase at 7 is justified by the fact that there were two other processes running. I don't know how to justify the small increase at N3, though, but that doesn't matter too much. The problem is that I am not getting the gains I expected (halving at N2, a third at N3, etc.). Is this the best one can achieve with this implicit parallelism, or am I doing something wrong? In particular, is the way I'm printing the results at the end destroying potential parallel gains? Any insights on this are appreciated. Thanks, Pedro

On 24/02/2011 13:26, José Pedro Magalhães wrote:
(Forwarding to haskell-cafe)
Hi,
I have a program that computes a matrix of Floats of m rows by n columns. Computing each Float is relatively expensive. Each line is completely independent of the others, so I thought I'd try some simple SMP parallelism on this code:
myFun :: FilePath -> IO () myFun fp = do fs <- readDataDir fp let process f = readFile' f >>= parse printLine = putStrLn . foldr (\a b -> show a ++ "\t" ++ b) "" runDiff l = [ [ diff x y | y <- l ] | (x,i) <- zip l (map getId fs), myFilter i ] ps <- mapM process fs sequence_ [ printLine x | x <- runDiff ps _`using` parList rdeepseq_ ]
So, I'm using parList to evaluate the rows in parallel, and fully evaluating each row. Here are the timings on a Dual Quad Core AMD 2378 @2.4 GHz, ghc-6.12.3, parallel-2.2.0.1:
-N time (ms) none 1m50 2 1m33 3 1m35 4 1m22 5 1m11 6 1m06 7 1m45
The increase at 7 is justified by the fact that there were two other processes running. I don't know how to justify the small increase at N3, though, but that doesn't matter too much. The problem is that I am not getting the gains I expected (halving at N2, a third at N3, etc.). Is this the best one can achieve with this implicit parallelism, or am I doing something wrong? In particular, is the way I'm printing the results at the end destroying potential parallel gains?
Any insights on this are appreciated.
I haven't investigated your program in detail, but here's how I would go about analysing it: - first compile with -eventlog, run with +RTS -ls, and analyse it with ThreadScope. This will tell you whether the program is actually using more than one core, and for what portion of its runtime. Often you find that it is sequential for a large fraction of the execution time, such as reading in the input, or writing the output. - run with +RTS -s, and check the SPARKS. Make sure you aren't generating too many (more than 4096 will be discarded) or too few (not enough parallelism). - generating large amounts of live data in the heap will tend to hurt parallelism, because GC doesn't parallelise well, especially when the heap data is linear (i.e. lists rather than trees). Look for a lot of time being spent in GC, and a figure for "balance" significantly less than the -N value. Cheers, Simon
participants (2)
-
José Pedro Magalhães
-
Simon Marlow