Massive slowdown in mwc-random after switching to use of primitive package

Hi, Roman — Recently, I switched the mwc-random package ( http://hackage.haskell.org/package/mwc-random) over from running in the ST monad to using your primitive package. I didn't notice initially, but this caused a huge performance regression. mwc-random 0.4.1.1 uses ST internally, and runs the benchmarks/Quickie benchmark in 0.017 seconds. mwc-random 0.5.1.3 uses PrimMonad internally, and takes 1.328 seconds to run the same benchmark. That's about a factor of 80 slowdown. This causes a massive knock-on performance loss in packages such as criterion that need fast PRNGs. The problem is very easy to reproduce: cabal install mwc-random-0.4.1.1 cabal install mwc-random-0.5.1.3 darcs get http://darcs.serpentine.com/mwc-random/ cd mwc-random/benchmarks ghc -fforce-recomp -O -package mwc-random-0.4.1.1 --make Quickie -o quickie-411 ghc -fforce-recomp -O -package mwc-random-0.5.1.3 --make Quickie -o quickie-513 time ./quickie-411 time ./quickie-513 If you could shed any light, I'd be most grateful, as this has me a bit confounded.

On Sat, Jul 10, 2010 at 7:09 PM, Bryan O'Sullivan
If you could shed any light, I'd be most grateful, as this has me a bit confounded.
As a further data point, the attached patch replaces uses of PrimMonad with ST, and speeds performance back to 0.015 seconds, so it's clearly the use of primitive, and not the switch from uvector to vector, that causes the performance loss.

On Saturday 10 July 2010 2:09:48 pm Bryan O'Sullivan wrote:
Recently, I switched the mwc-random package ( http://hackage.haskell.org/package/mwc-random) over from running in the ST monad to using your primitive package. I didn't notice initially, but this caused a huge performance regression.
You're using GHC 6.12.x presumably? There are known performance problems with using abstract PrimMonads in that version (and, actually, just using IO as well). The problem, as far as I (and Roman, I believe) have been able to ascertain, is a lack of some sort of dictionary inlining/specialization. So, when you use PrimMonad or IO, You get repeated casts between the state types. In the IO case for instance, checking the core tends to reveal lots of casting between RealWorld and PrimState IO, despite the fact that those should be identical. Whether this is directly responsible, or just prevents other optimizations from firing, I'm not totally sure. For some reason, ST isn't affected, possibly because it holds its state thread type in a parameter. That means one workaround is to keep all your computations in ST, and as long as they aren't inlined into an stToIO or something, which would trigger the problem by specializing the code to ST RealWorld, you should have decent performance. The good news is that, last I checked, this isn't a problem in HEAD. The optimizer does better there somehow. The bad news is that 6.14 isn't out yet, so you probably have to wait for a real fix. -- Dan

On Sun, Jul 11, 2010 at 12:59 AM, Dan Doel
You're using GHC 6.12.x presumably?
That's right.
There are known performance problems with using abstract PrimMonads in that version (and, actually, just using IO as well).
Ah, that's a shame. I'm surprised it would be affecting IO too!
In the IO case for instance, checking the core tends to reveal lots of casting
between RealWorld and PrimState IO, despite the fact that those should be
identical.
I'd certainly noticed that the Core for the PrimMonad code was huge and almost impossible to follow for the enormous amount of casting that was taking place. The good news is that, last I checked, this isn't a problem in HEAD. The
optimizer does better there somehow. The bad news is that 6.14 isn't out yet, so you probably have to wait for a real fix.
I might revert both mwc-random and statistics back to using plain ST for now, then, as having them run slightly faster than 1% of their former speed is a bit painful :-(

On 11/07/2010, at 22:49, Bryan O'Sullivan wrote:
On Sun, Jul 11, 2010 at 12:59 AM, Dan Doel
wrote: You're using GHC 6.12.x presumably?
That's right.
There are known performance problems with using abstract PrimMonads in that version (and, actually, just using IO as well).
Ah, that's a shame. I'm surprised it would be affecting IO too!
FWIW, I tried a couple of different designs when writing primitive and the current one seems to works best with 6.12, at least for the things I'm using it for. Really, 6.12 just seems to be rather hopeless here.
In the IO case for instance, checking the core tends to reveal lots of casting between RealWorld and PrimState IO, despite the fact that those should be identical.
I'd certainly noticed that the Core for the PrimMonad code was huge and almost impossible to follow for the enormous amount of casting that was taking place.
The head has -dsuppress-coercions which omits coercion terms when pretty printing Core. It would be easy to backport that to 6.12.
I might revert both mwc-random and statistics back to using plain ST for now, then, as having them run slightly faster than 1% of their former speed is a bit painful :-(
I fear that's the only sensible solution. I wouldn't throw away your current code, though, as 6.14 won't have these problems (hopefully). Roman

| The head has -dsuppress-coercions which omits coercion terms when pretty | printing Core. It would be easy to backport that to 6.12. The HEAD also has a coercion optimiser that dramatically shrinks some large coercion terms. | > I might revert both mwc-random and statistics back to using plain ST for | now, then, as having them run slightly faster than 1% of their former speed | is a bit painful :-( | | I fear that's the only sensible solution. I wouldn't throw away your current | code, though, as 6.14 won't have these problems (hopefully). Yes, I hope that's possible because I really really don't want to have to dig into 6.12's inliner. Partly we'd then need to make yet another release, and partly it's just delicate -- that's why I redid it for 6.14. It's very helpful to have the test, though, so I can ensure 6.14 stays good. Thanks for that Simon

| > caused a huge performance regression. | | You're using GHC 6.12.x presumably? There are known performance problems with | using abstract PrimMonads in that version (and, actually, just using IO as | well). ... | The good news is that, last I checked, this isn't a problem in HEAD. The | optimizer does better there somehow. The bad news is that 6.14 isn't out yet, This is the first I've heard of this. Do you have a test case that shows up the problem? Then we can put it in the regression tests so it won't go wrong again. Simon

On Sunday 11 July 2010 1:31:23 pm Simon Peyton-Jones wrote:
This is the first I've heard of this. Do you have a test case that shows up the problem? Then we can put it in the regression tests so it won't go wrong again.
That depends on what dependencies you're willing to accept. I think all the mutable algorithms in vector-algorithms are affected (I first discovered the problem after porting my benchmarks), so if you don't mind depending on vector and vector-algorithms, it shouldn't be hard to write one. If something stand-alone is required, I (at least) don't have anything prepared at the moment. -- Dan

Well, fewer dependencies make it much easier to test and reproduce. But something concrete is certainly better than nothing. Please submit a Trac ticket. (Worth checking with Roman -- perhaps he already has?) Simon | -----Original Message----- | From: Dan Doel [mailto:dan.doel@gmail.com] | Sent: 12 July 2010 01:56 | To: Simon Peyton-Jones | Cc: glasgow-haskell-users@haskell.org | Subject: Re: Massive slowdown in mwc-random after switching to use of | primitive package | | On Sunday 11 July 2010 1:31:23 pm Simon Peyton-Jones wrote: | > This is the first I've heard of this. Do you have a test case that shows | > up the problem? Then we can put it in the regression tests so it won't go | > wrong again. | | That depends on what dependencies you're willing to accept. I think all the | mutable algorithms in vector-algorithms are affected (I first discovered the | problem after porting my benchmarks), so if you don't mind depending on | vector | and vector-algorithms, it shouldn't be hard to write one. | | If something stand-alone is required, I (at least) don't have anything | prepared at the moment. | | -- Dan

On Mon, Jul 12, 2010 at 9:12 AM, Simon Peyton-Jones
Well, fewer dependencies make it much easier to test and reproduce. But something concrete is certainly better than nothing. Please submit a Trac ticket. (Worth checking with Roman -- perhaps he already has?)
I can probably cook up a smaller repro than Dan, since I depend on neither package for a standalone test case. Let me get cracking on it promptly, and I'll file a ticket later.

On Mon, Jul 12, 2010 at 10:40 AM, Bryan O'Sullivan
I can probably cook up a smaller repro than Dan, since I depend on neither package for a standalone test case. Let me get cracking on it promptly, and I'll file a ticket later.
I filed http://hackage.haskell.org/trac/ghc/ticket/4184 just now, which includes two workarounds that I found for my particular version of the problem, the second of which I stumbled upon quite by accident while creating the standalone repro.

Thank you! I am utterly buried just now (POPL) but will get to this.
Simon
From: glasgow-haskell-users-bounces@haskell.org [mailto:glasgow-haskell-users-bounces@haskell.org] On Behalf Of Bryan O'Sullivan
Sent: 12 July 2010 13:58
To: Simon Peyton-Jones
Cc: glasgow-haskell-users@haskell.org
Subject: Re: Massive slowdown in mwc-random after switching to use of primitive package
On Mon, Jul 12, 2010 at 10:40 AM, Bryan O'Sullivan
participants (4)
-
Bryan O'Sullivan
-
Dan Doel
-
Roman Leshchinskiy
-
Simon Peyton-Jones