Problem parallelising simple algorithm

The following code simply computes the Euler constant by binary splitting. I'm struggling to get any speed-up on multi-core. Why do I get a fair number of fizzled? Why don't I get any real speed-up even if it claims to run tasks in parallel? Could I have exhausted the integer arithmetic unit(s) on my chip (i7-4790)? How would I verify that? The following is the code and typical runs on 1 and 4 cores respectively. --------- import Control.Parallel.Strategies import Control.DeepSeq import Data.Ratio divConq :: (NFData b) => (a -> b) -> a -> (a -> Bool) -> (b -> b -> b) -> (a -> Maybe (a,a)) -> b divConq f arg threshold conquer divide = go arg where go arg = case divide arg of Nothing -> f arg Just (l0,r0) -> conquer l1 r1 `using` strat where l1 = go l0 r1 = go r0 strat x = do r l1; r r1; return x where r | threshold arg = rdeepseq | otherwise = rpar pqCombine :: (Integer, Integer) -> (Integer, Integer) -> (Integer, Integer) pqCombine (pl, ql) (pr, qr) = (pl*qr+pr, ql*qr) pq :: (Integer, Integer) -> (Integer, Integer) pq (a, b) = (\t -> (sum t, last t)) $ scanl1 (*) [b,b-1..a+1] euler :: Integer -> Rational euler n = let (p,q) = divConq pq (0,n) (\(a,b) -> b-a < 10000) pqCombine (\(a,b) -> if b-a > 5 then let m = (a+b+1) `div` 2 in Just ((a,m), (m, b)) else Nothing) in p%q main = print $ euler 320000 `seq` () ----------
./BinSplit +RTS -s -N1 () 178,375,880 bytes allocated in the heap 2,452,040 bytes copied during GC 3,222,696 bytes maximum residency (7 sample(s)) 883,040 bytes maximum slop 11 MB total memory in use (2 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause Gen 0 333 colls, 0 par 0.004s 0.002s 0.0000s 0.0000s Gen 1 7 colls, 0 par 0.001s 0.001s 0.0001s 0.0003s TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1) SPARKS: 126 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 126 fizzled) INIT time 0.001s ( 0.000s elapsed) MUT time 0.928s ( 0.936s elapsed) GC time 0.005s ( 0.003s elapsed) EXIT time 0.001s ( 0.000s elapsed) Total time 0.935s ( 0.939s elapsed) Alloc rate 192,215,387 bytes per MUT second Productivity 99.4% of total user, 98.9% of total elapsed gc_alloc_block_sync: 0 whitehole_spin: 0 gen[0].sync: 0 gen[1].sync: 0
./BinSplit +RTS -s -N4 () 178,727,480 bytes allocated in the heap 3,506,488 bytes copied during GC 3,650,032 bytes maximum residency (7 sample(s)) 934,976 bytes maximum slop 12 MB total memory in use (1 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause Gen 0 141 colls, 141 par 0.009s 0.002s 0.0000s 0.0001s Gen 1 7 colls, 6 par 0.003s 0.001s 0.0001s 0.0001s Parallel GC work balance: 38.80% (serial 0%, perfect 100%) TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4) SPARKS: 126 (12 converted, 0 overflowed, 0 dud, 0 GC'd, 114 fizzled) INIT time 0.002s ( 0.002s elapsed) MUT time 2.104s ( 0.946s elapsed) GC time 0.012s ( 0.003s elapsed) EXIT time 0.001s ( 0.000s elapsed) Total time 2.119s ( 0.951s elapsed) Alloc rate 84,946,520 bytes per MUT second Productivity 99.3% of total user, 221.3% of total elapsed gc_alloc_block_sync: 600 whitehole_spin: 0 gen[0].sync: 9 gen[1].sync: 1073

So, I tried to use the Par monad instead of the Eval monad (only splitting
the work into two parts, see below).
I now get (with 1 and 4 cores):
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.000s ( 0.001s elapsed)
MUT time 0.994s ( 0.998s elapsed)
GC time 0.048s ( 0.051s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 1.042s ( 1.050s elapsed)
------------
TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.000s ( 0.001s elapsed)
MUT time 1.083s ( 0.922s elapsed)
GC time 0.116s ( 0.038s elapsed)
EXIT time 0.001s ( 0.000s elapsed)
Total time 1.200s ( 0.961s elapsed)
So, some slight gain, but not a lot. Admittedly, the call to fromRational
seems to take a good chunk of time.
Any comments?
Jens
---------------
import Control.Monad.Par hiding (parMap)
import Data.Ratio
pqCombine :: (Integer, Integer) -> (Integer, Integer) -> (Integer, Integer)
pqCombine (pl, ql) (pr, qr) = (pl*qr+pr, ql*qr)
pq :: (Integer, Integer) -> (Integer, Integer)
pq (a, b)
| d > 5 = let m = (a+b+1) `div` 2
pql = pq (a, m)
pqr = pq (m, b)
in pqCombine pql pqr
| otherwise = (sum $ scanl1 (*) [b,b-1..a+1], product [a+1..b])
where d = b - a
main = print . flip seq () . (\(p,q) -> fromRational (p%q)) . runPar $ do
i <- new
j <- new
fork (put i (pq (0,160000)))
fork (put j (pq (160000,320000)))
a <- get i
b <- get j
return (pqCombine a b)
On Fri, 14 Aug 2015 at 10:31 Jens Blanck
The following code simply computes the Euler constant by binary splitting. I'm struggling to get any speed-up on multi-core.
Why do I get a fair number of fizzled? Why don't I get any real speed-up even if it claims to run tasks in parallel?
Could I have exhausted the integer arithmetic unit(s) on my chip (i7-4790)? How would I verify that?
The following is the code and typical runs on 1 and 4 cores respectively.
---------
import Control.Parallel.Strategies import Control.DeepSeq
import Data.Ratio
divConq :: (NFData b) => (a -> b) -> a -> (a -> Bool) -> (b -> b -> b) -> (a -> Maybe (a,a)) -> b divConq f arg threshold conquer divide = go arg where go arg = case divide arg of Nothing -> f arg Just (l0,r0) -> conquer l1 r1 `using` strat where l1 = go l0 r1 = go r0 strat x = do r l1; r r1; return x where r | threshold arg = rdeepseq | otherwise = rpar
pqCombine :: (Integer, Integer) -> (Integer, Integer) -> (Integer, Integer) pqCombine (pl, ql) (pr, qr) = (pl*qr+pr, ql*qr)
pq :: (Integer, Integer) -> (Integer, Integer) pq (a, b) = (\t -> (sum t, last t)) $ scanl1 (*) [b,b-1..a+1]
euler :: Integer -> Rational euler n = let (p,q) = divConq pq (0,n) (\(a,b) -> b-a < 10000) pqCombine (\(a,b) -> if b-a > 5 then let m = (a+b+1) `div` 2 in Just ((a,m), (m, b)) else Nothing) in p%q
main = print $ euler 320000 `seq` ()
----------
./BinSplit +RTS -s -N1 () 178,375,880 bytes allocated in the heap 2,452,040 bytes copied during GC 3,222,696 bytes maximum residency (7 sample(s)) 883,040 bytes maximum slop 11 MB total memory in use (2 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause Gen 0 333 colls, 0 par 0.004s 0.002s 0.0000s 0.0000s Gen 1 7 colls, 0 par 0.001s 0.001s 0.0001s 0.0003s
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
SPARKS: 126 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 126 fizzled)
INIT time 0.001s ( 0.000s elapsed) MUT time 0.928s ( 0.936s elapsed) GC time 0.005s ( 0.003s elapsed) EXIT time 0.001s ( 0.000s elapsed) Total time 0.935s ( 0.939s elapsed)
Alloc rate 192,215,387 bytes per MUT second
Productivity 99.4% of total user, 98.9% of total elapsed
gc_alloc_block_sync: 0 whitehole_spin: 0 gen[0].sync: 0 gen[1].sync: 0
./BinSplit +RTS -s -N4 () 178,727,480 bytes allocated in the heap 3,506,488 bytes copied during GC 3,650,032 bytes maximum residency (7 sample(s)) 934,976 bytes maximum slop 12 MB total memory in use (1 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause Gen 0 141 colls, 141 par 0.009s 0.002s 0.0000s 0.0001s Gen 1 7 colls, 6 par 0.003s 0.001s 0.0001s 0.0001s
Parallel GC work balance: 38.80% (serial 0%, perfect 100%)
TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)
SPARKS: 126 (12 converted, 0 overflowed, 0 dud, 0 GC'd, 114 fizzled)
INIT time 0.002s ( 0.002s elapsed) MUT time 2.104s ( 0.946s elapsed) GC time 0.012s ( 0.003s elapsed) EXIT time 0.001s ( 0.000s elapsed) Total time 2.119s ( 0.951s elapsed)
Alloc rate 84,946,520 bytes per MUT second
Productivity 99.3% of total user, 221.3% of total elapsed
gc_alloc_block_sync: 600 whitehole_spin: 0 gen[0].sync: 9 gen[1].sync: 1073
participants (1)
-
Jens Blanck