A problem with par and modules boundaries...

I'll cut to the chase. The short program below works perfectly: when I compile it with -O2 -threaded and run with +RTS -N2 command-line options, I get a nearly 50% real-time improvement: $ time ./primes-test +RTS -N2 5001 real 0m9.307s user 0m16.581s sys 0m0.200s However, if I move the `parallelize' definition into another module and import that module, the performance is completely lost: $ time ./primes-test +RTS -N2 5001 real 0m15.282s user 0m15.165s sys 0m0.080s I'm confused. I know that `par` must be able work across modules boundaries, because Control.Parallel.Strategies is a module and presumably it works. What am I doing wrong?
module Main where
import Control.Parallel import Data.List (find) import Data.Maybe (maybe)
--import Parallelizable parallelize a b = a `par` (b `pseq` (a, b))
test :: Integer -> Integer -> Integer test n1 n2 = let (p1, p2) = parallelize (product $ factors $ product [1..n1]) (product $ factors $ product [1..n2]) in p2 `div` p1
factors n = maybe [n] (\k-> (k : factors (n `div` k))) (find (\k-> n `mod` k == 0) [2 .. n - 1])
main = print (test 5000 5001)

Am Freitag 22 Mai 2009 04:59:51 schrieb Mario Blažević:
I'll cut to the chase. The short program below works perfectly: when I compile it with -O2 -threaded and run with +RTS -N2 command-line options, I get a nearly 50% real-time improvement:
$ time ./primes-test +RTS -N2 5001
real 0m9.307s user 0m16.581s sys 0m0.200s
However, if I move the `parallelize' definition into another module and import that module, the performance is completely lost:
$ time ./primes-test +RTS -N2 5001
real 0m15.282s user 0m15.165s sys 0m0.080s
I'm confused. I know that `par` must be able work across modules boundaries, because Control.Parallel.Strategies is a module and presumably it works. What am I doing wrong?
You forgot {-# INLINE parallelize #-} For me, that works.
module Main where
import Control.Parallel import Data.List (find) import Data.Maybe (maybe)
--import Parallelizable parallelize a b = a `par` (b `pseq` (a, b))
test :: Integer -> Integer -> Integer test n1 n2 = let (p1, p2) = parallelize (product $ factors $ product [1..n1]) (product $ factors $ product [1..n2]) in p2 `div` p1
factors n = maybe [n] (\k-> (k : factors (n `div` k))) (find (\k-> n `mod` k == 0) [2 .. n - 1])
main = print (test 5000 5001)

Answer recorded at: http://haskell.org/haskellwiki/Performance/Parallel daniel.is.fischer:
Am Freitag 22 Mai 2009 04:59:51 schrieb Mario Blažević:
I'll cut to the chase. The short program below works perfectly: when I compile it with -O2 -threaded and run with +RTS -N2 command-line options, I get a nearly 50% real-time improvement:
$ time ./primes-test +RTS -N2 5001
real 0m9.307s user 0m16.581s sys 0m0.200s
However, if I move the `parallelize' definition into another module and import that module, the performance is completely lost:
$ time ./primes-test +RTS -N2 5001
real 0m15.282s user 0m15.165s sys 0m0.080s
I'm confused. I know that `par` must be able work across modules boundaries, because Control.Parallel.Strategies is a module and presumably it works. What am I doing wrong?
You forgot
{-# INLINE parallelize #-}
For me, that works.
module Main where
import Control.Parallel import Data.List (find) import Data.Maybe (maybe)
--import Parallelizable parallelize a b = a `par` (b `pseq` (a, b))
test :: Integer -> Integer -> Integer test n1 n2 = let (p1, p2) = parallelize (product $ factors $ product [1..n1]) (product $ factors $ product [1..n2]) in p2 `div` p1
factors n = maybe [n] (\k-> (k : factors (n `div` k))) (find (\k-> n `mod` k == 0) [2 .. n - 1])
main = print (test 5000 5001)
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Fri, 2009-05-22 at 05:30 -0700, Don Stewart wrote:
Answer recorded at:
I have to complain, this answer doesn't explain anything. This isn't like straight-line performance, there's no reason as far as I can see that inlining should change the operational behaviour of parallel evaluation, unless there's some mistake in the original such as accidentally relying on an unspecified evaluation order. Now, I tried the example using two versions of ghc and I get different behaviour from what other people are seeing. With the original code, (ie parallelize function in the same module) with ghc-6.10.1 I get no speedup at all from -N2 and with 6.11 I get a very good speedup (though single threaded performance is slightly lower in 6.11) Original code ghc-6.10.1, -N1 -N2 real 0m9.435s 0m9.328s user 0m9.369s 0m9.249s ghc-6.11, -N1 -N2 real 0m10.262s 0m6.117s user 0m10.161s 0m11.093s With the parallelize function moved into another module I get no change whatsoever. Indeed even when I force it *not* to be inlined with {-# NOINLINE parallelize #-} then I still get no change in behaviour (as indeed I expected). So I view this advice to force inlining with great suspicion (at worst it encourages people not to think and to look at it as magic). That said, why it does not get any speedup with ghc-6.10 is also a mystery to me (there's very little GC going on). Don: can we change the advice on the wiki please? It currently makes it look like a known and understood issue. If anything we should suggest using a later ghc version. Duncan

duncan.coutts:
On Fri, 2009-05-22 at 05:30 -0700, Don Stewart wrote:
Answer recorded at:
I have to complain, this answer doesn't explain anything. This isn't like straight-line performance, there's no reason as far as I can see that inlining should change the operational behaviour of parallel evaluation, unless there's some mistake in the original such as accidentally relying on an unspecified evaluation order.
Now, I tried the example using two versions of ghc and I get different behaviour from what other people are seeing. With the original code, (ie parallelize function in the same module) with ghc-6.10.1 I get no speedup at all from -N2 and with 6.11 I get a very good speedup (though single threaded performance is slightly lower in 6.11)
Original code ghc-6.10.1, -N1 -N2 real 0m9.435s 0m9.328s user 0m9.369s 0m9.249s
ghc-6.11, -N1 -N2 real 0m10.262s 0m6.117s user 0m10.161s 0m11.093s
With the parallelize function moved into another module I get no change whatsoever. Indeed even when I force it *not* to be inlined with {-# NOINLINE parallelize #-} then I still get no change in behaviour (as indeed I expected).
So I view this advice to force inlining with great suspicion (at worst it encourages people not to think and to look at it as magic). That said, why it does not get any speedup with ghc-6.10 is also a mystery to me (there's very little GC going on).
Don: can we change the advice on the wiki please? It currently makes it look like a known and understood issue. If anything we should suggest using a later ghc version.
Please do so. Especially if GHC HEAD *does the right thing*. Then the advice should be first: upgrade to GHC HEAD.

Daniel Fischer wrote:
Am Freitag 22 Mai 2009 04:59:51 schrieb Mario Blažević:
... I'm confused. I know that `par` must be able work across modules boundaries, because Control.Parallel.Strategies is a module and presumably it works. What am I doing wrong?
You forgot
{-# INLINE parallelize #-}
For me, that works.
That's great, thank you. I am still baffled, though. Must every exported function that uses `par' be INLINEd? Does every exported caller of such a function need the same treatment? Is `par' really a macro, rather than a function?
module Main where
import Control.Parallel import Data.List (find) import Data.Maybe (maybe)
--import Parallelizable parallelize a b = a `par` (b `pseq` (a, b))
test :: Integer -> Integer -> Integer test n1 n2 = let (p1, p2) = parallelize (product $ factors $ product [1..n1]) (product $ factors $ product [1..n2]) in p2 `div` p1
factors n = maybe [n] (\k-> (k : factors (n `div` k))) (find (\k-> n `mod` k == 0) [2 .. n - 1])
main = print (test 5000 5001)
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Mario Blazevic mblazevic@stilo.com Stilo Corporation This message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure, copying, or distribution is strictly prohibited. If you are not the intended recipient(s) please contact the sender by reply email and destroy all copies of the original message and any attachments.

Am Freitag 22 Mai 2009 15:11:32 schrieb Mario Blazevic:
Daniel Fischer wrote:
Am Freitag 22 Mai 2009 04:59:51 schrieb Mario Blažević:
... I'm confused. I know that `par` must be able work across modules boundaries, because Control.Parallel.Strategies is a module and presumably it works. What am I doing wrong?
You forgot
{-# INLINE parallelize #-}
For me, that works.
That's great, thank you. I am still baffled, though. Must every exported function that uses `par' be INLINEd? Does every exported caller of such a function need the same treatment? Is `par' really a macro, rather than a function?
I'm not an expert in either parallelism/concurrency or GHC, so my interpretation may be wrong. The functions par and pseq are defined in GHC.Conc: ============================================================ -- Nota Bene: 'pseq' used to be 'seq' -- but 'seq' is now defined in PrelGHC -- -- "pseq" is defined a bit weirdly (see below) -- -- The reason for the strange "lazy" call is that -- it fools the compiler into thinking that pseq and par are non-strict in -- their second argument (even if it inlines pseq at the call site). -- If it thinks pseq is strict in "y", then it often evaluates -- "y" before "x", which is totally wrong. {-# INLINE pseq #-} pseq :: a -> b -> b pseq x y = x `seq` lazy y {-# INLINE par #-} par :: a -> b -> b par x y = case (par# x) of { _ -> lazy y } ============================================================ As far as I understand, par doesn't guarantee that both arguments are evaluated in parallel, it's just a suggestion to the compiler, and if whatever heuristics the compiler uses say it may be favourable to do it in parallel, it will produce code to calculate it in parallel (given appropriate compile- and run-time flags), otherwise it produces purely sequential code. With parallelize in a separate module, when compiling that, the compiler has no way to see whether parallelizing the computation may be beneficial, so doesn't produce (potentially) parallel code. At the use site, in the other module, it doesn't see the 'par', so has no reason to even consider producing parallel code. If parallelize is defined in the module where it's used, it will be inlined anyway since it is small, so the compiler sees the 'par' (actually par#) when compiling the use site and can employ the heuristics to decide whether to produce parallel code. If you place an INLINE pragma near the definition of parallelize, it will be inlined when compiling the importing module, so again the compiler sees the opportunity to parallelize. So, if I got it right (or nearly right), yes, every exported function that uses par should be INLINEd [1], and have a simple enough body that it will indeed be inlined. The same holds for callers of such functions, if the compiler can't see at the definition that parallelism is good, let the function be inlined so that it may be spotted at the call site. [1] Well, I suppose for function x = (expensive1 `par` expensive2) `seq` x and such, if expensive1/2 are defined in the same module, it may not be necessary.

On Fri, 2009-05-22 at 16:34 +0200, Daniel Fischer wrote:
That's great, thank you. I am still baffled, though.
I'm baffled too! I don't see the same behaviour at all (see the other email).
Must every exported function that uses `par' be INLINEd? Does every exported caller of such a function need the same treatment?
It really should not be necessary.
Is `par' really a macro, rather than a function?
It's a function.
As far as I understand, par doesn't guarantee that both arguments are evaluated in parallel, it's just a suggestion to the compiler, and if whatever heuristics the compiler uses say it may be favourable to do it in parallel, it will produce code to calculate it in parallel (given appropriate compile- and run-time flags), otherwise it produces purely sequential code.
With parallelize in a separate module, when compiling that, the compiler has no way to see whether parallelizing the computation may be beneficial, so doesn't produce (potentially) parallel code. At the use site, in the other module, it doesn't see the 'par', so has no reason to even consider producing parallel code.
I don't think this is right. As I understand it, par always creates a spark. It has nothing to do with heuristics. Whether the spark actually gets evaluated in parallel depends on the runtime system and whether the spark "fizzles" before it gets a chance to run. Of course when using the single threaded rts then the sparks are never evaluated in parallel. With the threaded rts and given enough CPUs, the rts will try to schedule the sparks onto idle CPUs. This business of getting sparks running on other CPUs has improved significantly since ghc-6.10. The current development version uses a better concurrent queue data structure to manage the spark pool. That's probably the underlying reason for why the example works well in ghc-6.11 but works badly in 6.10. I'm afraid I'm not sure of what exactly is going wrong that means it doesn't work well in 6.10. Generally I'd expect the effect of par to be pretty insensitive to inlining. I'm cc'ing the ghc users list so perhaps we'll get some expert commentary. Duncan

Am Samstag 23 Mai 2009 13:06:04 schrieb Duncan Coutts:
On Fri, 2009-05-22 at 16:34 +0200, Daniel Fischer wrote:
That's great, thank you. I am still baffled, though.
I'm baffled too! I don't see the same behaviour at all (see the other email).
Must every exported function that uses `par' be INLINEd? Does every exported caller of such a function need the same treatment?
It really should not be necessary.
Is `par' really a macro, rather than a function?
It's a function.
As far as I understand, par doesn't guarantee that both arguments are evaluated in parallel, it's just a suggestion to the compiler, and if whatever heuristics the compiler uses say it may be favourable to do it in parallel, it will produce code to calculate it in parallel (given appropriate compile- and run-time flags), otherwise it produces purely sequential code.
With parallelize in a separate module, when compiling that, the compiler has no way to see whether parallelizing the computation may be beneficial, so doesn't produce (potentially) parallel code. At the use site, in the other module, it doesn't see the 'par', so has no reason to even consider producing parallel code.
I don't think this is right. As I understand it, par always creates a spark. It has nothing to do with heuristics.
Quite possible. I was only guessing from the fact that sometimes par evaluates things in parallel and sometimes not, plus when thinking what might cause the described behaviour, cross-module inlining came to mind, I tried adding an INLINE pragma and it worked - or so it seemed. Then I threw together an explanation of the observed behaviour. That explanation must be wrong, though, see below.
Whether the spark actually gets evaluated in parallel depends on the runtime system and whether the spark "fizzles" before it gets a chance to run. Of course when using the single threaded rts then the sparks are never evaluated in parallel. With the threaded rts and given enough CPUs, the rts will try to schedule the sparks onto idle CPUs. This business of getting sparks running on other CPUs has improved significantly since ghc-6.10. The current development version uses a better concurrent queue data structure to manage the spark pool. That's probably the underlying reason for why the example works well in ghc-6.11 but works badly in 6.10. I'm afraid I'm not sure of what exactly is going wrong that means it doesn't work well in 6.10.
I have tried with 6.10.3 and 6.10.1, with parallelize in the same module and in a separate module - with no pragma - with an INLINE pragma - with a NOINLINE pragma 6.10.1 did not parallelize in any of these settings 6.10.3 parallelized in all these settings except "separate module, no pragma". Then I tried a few other settigns with 6.10.3, got parallel evaluation if there's an INLINE or a NOINLINE pragma on parallelize, or the module header of Main is module Main (main) where, not if Main exports all top level definitions and parallelize is neither INLINEd nor NOINLINEd. Weird.
Generally I'd expect the effect of par to be pretty insensitive to inlining. I'm cc'ing the ghc users list so perhaps we'll get some expert commentary.
That would be good.
Duncan
Daniel
participants (5)
-
Daniel Fischer
-
Don Stewart
-
Duncan Coutts
-
Mario Blazevic
-
Mario Blažević