Performance with do notation, mwc-random and unboxed vector

Hello everyone. I wonder why using do notation with `<-` can ruin the performance. In essence the problem is that, for some action `f :: m Double`, running the code (in my case, `standard` from mwc-random). f for million times is fast but the code do v <- f return v is slower about a hundred times. Consider this simple source where we generate an unboxed vector with million pseudo-random numbers: ---- 8< ----- import qualified Data.Vector.Unboxed as VU import System.Random.MWC import System.Random.MWC.Distributions (standard) count = 1000000 main = do g <- create e' <- VU.replicateM count $ standard g return () ---- >8 ----- Being compiled with -O2, this runs for 0.052 s on my machine. Changing the replicateM line to use do notation brings the runtime down to 11.257 s! See below: ---- 8< ----- import qualified Data.Vector.Unboxed as VU import System.Random.MWC import System.Random.MWC.Distributions (standard) count = 1000000 main = do g <- create e' <- VU.replicateM count $ do v <- standard g return v return () ---- >8 ----- I don't quite understand why this happens. I'm using GHC 7.4.1 on Linux x86_64 system. Compiling *both* versions with profiling enabled changes runtime to 5.673 sec, which is exactly half the runtime of slow version without profiling, and this is awkward (double calculations occuring in do block?). Does anybody have an idea if this is a problem with my do, or with mwc-random, or with vector (my notation disallowing efficient unboxing?).

Well, it's not "do" notation, since replacing "standard g" with "standard g >>= return" gives the same poor performance. I wonder if it has something to do with error checking. On 11 Jun 2012, at 13:38, Dmitry Dzhus wrote:
Hello everyone.
I wonder why using do notation with `<-` can ruin the performance.
In essence the problem is that, for some action `f :: m Double`, running the code (in my case, `standard` from mwc-random).
f
for million times is fast but the code
do v <- f return v
is slower about a hundred times.
Consider this simple source where we generate an unboxed vector with million pseudo-random numbers:
---- 8< ----- import qualified Data.Vector.Unboxed as VU
import System.Random.MWC import System.Random.MWC.Distributions (standard)
count = 1000000
main = do g <- create e' <- VU.replicateM count $ standard g return () ---- >8 -----
Being compiled with -O2, this runs for 0.052 s on my machine.
Changing the replicateM line to use do notation brings the runtime down to 11.257 s! See below:
---- 8< ----- import qualified Data.Vector.Unboxed as VU
import System.Random.MWC import System.Random.MWC.Distributions (standard)
count = 1000000
main = do g <- create e' <- VU.replicateM count $ do v <- standard g return v return () ---- >8 -----
I don't quite understand why this happens. I'm using GHC 7.4.1 on Linux x86_64 system.
Compiling *both* versions with profiling enabled changes runtime to 5.673 sec, which is exactly half the runtime of slow version without profiling, and this is awkward (double calculations occuring in do block?).
Does anybody have an idea if this is a problem with my do, or with mwc-random, or with vector (my notation disallowing efficient unboxing?).
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On 11 Jun 2012, at 10:38, Dmitry Dzhus wrote:
main = do g <- create e' <- VU.replicateM count $ standard g return ()
In all likelhood, ghc is spotting that the value e' is not used, and that there are no side-effects, so it does not do anything at runtime. If you expand the action argument to replicateM, such that it uses do-notation instead, perhaps ghc can no longer prove the lack of side-effects, and so actually runs the computation before throwing away its result. When writing toy benchmarks in a lazy language, it is always important to understand to what extent your program _uses_ the data from a generator, or you are bound to get misleading performance measurements. Regards, Malcolm

11.06.2012, 14:17, "Malcolm Wallace"
that there are no side-effects
There are — PRNG state is updated for RealWorld, that's why monadic replicateM is used. You can add something like print $ (VU.!) e 500000 after e is bound and still get 0.057 sec with do-less version. This quite matches the performance claimed by mwc-random package and seems reasonable since modern hardware shouldn't have any problem with generating twenty million random variates in a second with one execution thread. Your note on laziness would be correct in case like ------ 8< ------ import qualified Data.Vector.Unboxed as VU import Data.Functor import System.Random.MWC import System.Random.MWC.Distributions (standard) count = 100000000 main = do g <- create e <- return $ VU.replicate count (212.8506 :: Double) return () ------ >8 ------- Where unused `e` is truly left unevaluated (you could force it by matching with `!e` for example). Profiling indicates that random number sampling really occurs for both of original versions with `replicateM`, expectedly taking most of time: Mon Jun 11 14:24 2012 Time and Allocation Profiling Report (Final) slow-mwc-vector +RTS -p -RTS total time = 5.45 secs (5453 ticks @ 1000 us, 1 processor) total alloc = 3,568,827,856 bytes (excludes profiling overheads) COST CENTRE MODULE %time %alloc uniform2 System.Random.MWC 45.0 53.7 uniformWord32 System.Random.MWC 31.3 31.5 standard.loop System.Random.MWC.Distributions 4.1 1.1 uniform1 System.Random.MWC 3.9 4.5 nextIndex System.Random.MWC 3.6 1.4 uniform System.Random.MWC 2.8 3.3 uniform System.Random.MWC 2.5 1.4 wordsToDouble System.Random.MWC 2.1 0.5 I could drop do notation and go with the simpler version if I wanted just a vector of variates. But in reality I want a vector of tuples with random components: ------ 8< ------ import qualified Data.Vector.Unboxed as VU import Control.Monad import System.Random.MWC import System.Random.MWC.Distributions (standard) count = 1000000 main = do g <- create e <- VU.replicateM count $ do v1 <- standard g v2 <- standard g v3 <- standard g return (v1, v2, v3) return () ------ >8 ------- which runs for the same 11.412 seconds. Since three times more variates are generated and run time stays the same, this implies that perhaps some optimizations of vector package interfere with mwc-random — can this be the case? This becomes quite a bottleneck in my application. On the other hand, mwc-random has `normal` function implemented as follows: ------ 8< ------ normal m s gen = do x <- standard gen return $! m + s * x ------ >8 ------- which again uses explicit `do`. Both standard and normal are marked with INLINE. Now if I try to write ------ 8< ------ e <- VU.replicateM count $ normal 0 1 g ------ >8 ------- in my test case, quite expectedly I get horrible performance of 11 seconds, even though I'm not using do myself.

On 11/06/2012, at 10:38, Dmitry Dzhus wrote:
Consider this simple source where we generate an unboxed vector with million pseudo-random numbers:
---- 8< ----- import qualified Data.Vector.Unboxed as VU
import System.Random.MWC import System.Random.MWC.Distributions (standard)
count = 1000000
main = do g <- create e' <- VU.replicateM count $ standard g return () ---- >8 -----
Being compiled with -O2, this runs for 0.052 s on my machine.
Changing the replicateM line to use do notation brings the runtime down to 11.257 s! See below:
---- 8< ----- import qualified Data.Vector.Unboxed as VU
import System.Random.MWC import System.Random.MWC.Distributions (standard)
count = 1000000
main = do g <- create e' <- VU.replicateM count $ do v <- standard g return v return () ---- >8 -----
The former essentially generates this: replicateM n ((letrec f = ... in f) `cast` ...) and the latter this: replicateM n (\(s :: State# RealWorld) -> (letrec f = ... in f s) `cast` ...) I'd look further into this but mwc-random just inlines too much stuff. Could you perhaps find a smaller example that doesn't use mwc-random? In any case, it looks like a GHC bug, perhaps the state hack is getting in the way. Roman

On 12 Jun 2012, at 12:52, Dmitry Dzhus
12.06.2012, 01:08, "Roman Leshchinskiy"
: perhaps the state hack is getting in the way.
I don't quite understand the internals of this yet, but `-fno-state-hack` leads to great performance in both cases! How safe is that?
It doesn't change the semantics of your program but it can make it significantly slower (or faster, as in this case). The various state hack related tickets on trac might give you an idea of what is happening here. We really need some proper arity analysis! Roman

On Wed, Jun 13, 2012 at 12:56 AM, Roman Leshchinskiy
It doesn't change the semantics of your program but it can make it significantly slower (or faster, as in this case). The various state hack related tickets on trac might give you an idea of what is happening here.
I filed a bug: http://hackage.haskell.org/trac/ghc/ticket/6166 (I'd CC myself on an existing bug, but trac's search feature gives me tons of irrelevant hits.)
participants (5)
-
Bryan O'Sullivan
-
Dmitry Dzhus
-
Malcolm Wallace
-
MigMit
-
Roman Leshchinskiy