So, I tried to use the Par monad instead of the Eval monad (only splitting the work into two parts, see below).
I now get (with 1 and 4 cores):

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.001s elapsed)
  MUT     time    0.994s  (  0.998s elapsed)
  GC      time    0.048s  (  0.051s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    1.042s  (  1.050s elapsed)

------------

  TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.001s elapsed)
  MUT     time    1.083s  (  0.922s elapsed)
  GC      time    0.116s  (  0.038s elapsed)
  EXIT    time    0.001s  (  0.000s elapsed)
  Total   time    1.200s  (  0.961s elapsed)


So, some slight gain, but not a lot. Admittedly, the call to fromRational seems to take a good chunk of time.

Any comments?

Jens
---------------
import Control.Monad.Par hiding (parMap)
import Data.Ratio

pqCombine :: (Integer, Integer) -> (Integer, Integer) -> (Integer, Integer)
pqCombine (pl, ql) (pr, qr) = (pl*qr+pr, ql*qr)

pq :: (Integer, Integer) -> (Integer, Integer)
pq (a, b)
    | d >   5 = let m = (a+b+1) `div` 2
                    pql = pq (a, m)
                    pqr = pq (m, b)
                in pqCombine pql pqr
    | otherwise = (sum $ scanl1 (*) [b,b-1..a+1], product [a+1..b])
  where d = b - a

main = print . flip seq () . (\(p,q) -> fromRational (p%q)) . runPar $ do
  i <- new
  j <- new
  fork (put i (pq (0,160000)))
  fork (put j (pq (160000,320000)))
  a <- get i
  b <- get j
  return (pqCombine a b)


On Fri, 14 Aug 2015 at 10:31 Jens Blanck <jens.blanck@gmail.com> wrote:
The following code simply computes the Euler constant by binary
splitting. I'm struggling to get any speed-up on multi-core.

Why do I get a fair number of fizzled?
Why don't I get any real speed-up even if it claims to run tasks in 
parallel?

Could I have exhausted the integer arithmetic unit(s) on my chip
(i7-4790)?
How would I verify that?

The following is the code and typical runs on 1 and 4 cores respectively.

---------

import Control.Parallel.Strategies
import Control.DeepSeq

import Data.Ratio

divConq :: (NFData b) => (a -> b)
        -> a
        -> (a -> Bool)
        -> (b -> b -> b)
        -> (a -> Maybe (a,a))
        -> b
divConq f arg threshold conquer divide = go arg
    where
      go arg = 
          case divide arg of
            Nothing -> f arg
            Just (l0,r0) -> conquer l1 r1 `using` strat
                where
                  l1 = go l0
                  r1 = go r0
                  strat x = do r l1; r r1; return x
                      where r | threshold arg = rdeepseq
                              | otherwise     = rpar

pqCombine :: (Integer, Integer) -> (Integer, Integer) -> (Integer, Integer)
pqCombine (pl, ql) (pr, qr) = (pl*qr+pr, ql*qr)

pq :: (Integer, Integer) -> (Integer, Integer)
pq (a, b)  = (\t -> (sum t, last t)) $ scanl1 (*) [b,b-1..a+1]

euler :: Integer -> Rational
euler n =
    let (p,q) = divConq pq
                        (0,n)
                        (\(a,b) -> b-a < 10000)
                        pqCombine
                        (\(a,b) -> if b-a > 5
                                   then let m = (a+b+1) `div` 2 in Just ((a,m), (m, b))
                                   else Nothing)
    in p%q

main = print $ euler 320000 `seq` ()

----------

> ./BinSplit +RTS -s -N1
()
     178,375,880 bytes allocated in the heap
       2,452,040 bytes copied during GC
       3,222,696 bytes maximum residency (7 sample(s))
         883,040 bytes maximum slop
              11 MB total memory in use (2 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       333 colls,     0 par    0.004s   0.002s     0.0000s    0.0000s
  Gen  1         7 colls,     0 par    0.001s   0.001s     0.0001s    0.0003s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 126 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 126 fizzled)

  INIT    time    0.001s  (  0.000s elapsed)
  MUT     time    0.928s  (  0.936s elapsed)
  GC      time    0.005s  (  0.003s elapsed)
  EXIT    time    0.001s  (  0.000s elapsed)
  Total   time    0.935s  (  0.939s elapsed)

  Alloc rate    192,215,387 bytes per MUT second

  Productivity  99.4% of total user, 98.9% of total elapsed

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0



> ./BinSplit +RTS -s -N4
()
     178,727,480 bytes allocated in the heap
       3,506,488 bytes copied during GC
       3,650,032 bytes maximum residency (7 sample(s))
         934,976 bytes maximum slop
              12 MB total memory in use (1 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       141 colls,   141 par    0.009s   0.002s     0.0000s    0.0001s
  Gen  1         7 colls,     6 par    0.003s   0.001s     0.0001s    0.0001s

  Parallel GC work balance: 38.80% (serial 0%, perfect 100%)

  TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)

  SPARKS: 126 (12 converted, 0 overflowed, 0 dud, 0 GC'd, 114 fizzled)

  INIT    time    0.002s  (  0.002s elapsed)
  MUT     time    2.104s  (  0.946s elapsed)
  GC      time    0.012s  (  0.003s elapsed)
  EXIT    time    0.001s  (  0.000s elapsed)
  Total   time    2.119s  (  0.951s elapsed)

  Alloc rate    84,946,520 bytes per MUT second

  Productivity  99.3% of total user, 221.3% of total elapsed

gc_alloc_block_sync: 600
whitehole_spin: 0
gen[0].sync: 9
gen[1].sync: 1073