Re: Performance degradation when factoring out common code

9 Sep 2017

      I could pinpoint one part of the problem. Please see the ticket:
https://ghc.haskell.org/trac/ghc/ticket/14208. Here is the description that
I wrote in the ticket:

In this particular case -O2 is 2x slower than -O0 and -O0 is 2x slower than
runghc. Please see the github repo: 
https://github.com/harendra-kumar/ghc-perf to reproduce the issue. Readme
file in the repo has instructions to reproduce.

The issue seems to occur when the code is placed in a different module.
When all the code is in the same module the problem does not occur. In that
case -O2 is faster than -O0. However, when the code is split into two
modules the performance gets inverted.

Also, it does not occur always, when I tried to change the code to make it
simpler for repro the problem did not occur.

-harendra

On 9 September 2017 at 14:08, Harendra Kumar 
wrote:
...
The code is at: https://github.com/harendra-kumar/asyncly. The benchmark
code is in "benchmark/Main.hs".  The relevant function is "asyncly_basic".
If you want to run it, you can use the following steps to reproduce the
behavior I reported below:
1) Run "stack build"
2) Run "stack runghc benchmark/Main.hs" for runghc figures
3) Run "stack ghc benchmark/Main.hs && benchmark/Main" to compile and run
normally
4) Run "stack ghc -- -O2 benchmark/Main.hs && benchmark/Main" to compile
and run with -O2 flag
Just look at the first benchmark (asyncly-serial), you can comment out all
others if you want to. Note that the library gets compiled without any
optimization flags (see the ghc options in the cabal file). So what we are
seeing here is just the effect of -O2 on compiling benchmarks/Main.hs.
I am also trying to isolate the problem to a minimal case. I tried
removing all the INLINE pragmas in the library to make sure that I am not
screwing it up by asking the compiler to inline aggressively, but that does
not seem to make any difference to the situation. Let me know if you need
any information from me or help in running it.
There are three issues that I am trying to get answers for:
1) Why runghc is faster? It means that there is a possibility for the
program to run as fast as runghc runs it. How do I get that performance or
an explanation of it?
2) Why -O1/O2 degrades performance so much by 4-5x.
3) The third one is the original problem that I posted in this thread,
compiler is unable to match manual inlining. It is possible that this is an
issue only when -O1/O2 is used and not when -O0 is used.
Thanks for the help.
-harendra
On 9 September 2017 at 13:30, Matthew Pickering <
matthewtpickering@gmail.com> wrote:
...
Do you have the code?
...
While trying to come up with a minimal example I discovered one more
puzzling thing. runghc is fastest, ghc is slower, ghc with optimization
is
slowest. This is completely reverse of the expected order.
ghc -O1 (-O2 is similar):
time                 15.23 ms   (14.72 ms .. 15.73 ms)
ghc -O0:
time                 3.612 ms   (3.548 ms .. 3.728 ms)
runghc:
time                 2.250 ms   (2.156 ms .. 2.348 ms)
I am grokking it further. Any pointers will be helpful. I understand
...
-O2 can sometimes be slower e.g. aggressive inlining can sometimes be
counterproductive. But 4x variation is a lot and this is the case with
-O1
as well which should be relatively safer than -O2 in general. Worst of
all
runghc is significantly faster than ghc. What's going on?
-harendra
On 8 September 2017 at 18:49, Harendra Kumar 
wrote:
...
I will try creating a minimal example and open a ticket for the
inlining
...
problem, the one I am sure about.
-harendra
On 8 September 2017 at 18:35, Simon Peyton Jones <
simonpj@microsoft.com>
wrote:
...
I know that this is not an easy request, but can either of you
...
...
...
small example that demonstrates your problem?   If so, please open a
ticket.
I don’t like hearing about people having to use trial and error  with
INLINE or SPECIALISE pragmas.  But I can’t even begin to solve the
...
...
...
unless I can reproduce it.
Simon
From: ghc-devs [mailto:ghc-devs-bounces@haskell.org] On Behalf Of
Harendra Kumar
Sent: 08 September 2017 13:50
To: Mikolaj Konarski 
Cc: ghc-devs@haskell.org
Subject: Re: Performance degradation when factoring out common code
I should also point out that I saw performance improvements by
manually
factoring out and propagating some common expressions to outer loops
in
performance sensitive paths. Now I have made this a habit to do this
manually. Not sure if something like this has also been fixed with
...
...
...
ticket or some other ticket.
-harendra
On 8 September 2017 at 17:34, Harendra Kumar <
harendra.kumar@gmail.com>
wrote:
Thanks Mikolaj! I have seen some surprising behavior quite a few times
recently and I was wondering whether GHC should do better. In one
case I had
to use SPECIALIZE very aggressively, in another version of the same
code it
worked well without that. I have been doing a lot of trial and error
with
the INLINE/NOINLINE pragmas to figure out what the right combination
is.
Sometimes it just feels like black magic, because I cannot find a
rationale
to explain the behavior. I am not sure if there are any more such
...
...
...
lurking in, perhaps this is an area where some improvement looks
...
...
...
-harendra
On 8 September 2017 at 17:10, Mikolaj Konarski
 wrote:
Hello,
I've had a similar problem that's been fixed in 8.2.1:
https://ghc.haskell.org/trac/ghc/ticket/12603
You can also use some extreme global flags, such as
ghc-options: -fexpose-all-unfoldings -fspecialise-aggressively
to get most the GHC subtlety and shyness out of the way
when experimenting.
Good luck
Mikolaj
On Fri, Sep 8, 2017 at 11:21 AM, Harendra Kumar
 wrote:
...
Hi,
I have this code snippet for the bind implementation of a Monad:
AsyncT m >>= f = AsyncT $ \_ stp yld ->
        let run x = (runAsyncT x) Nothing stp yld
            yield a _ Nothing  = run $ f a
            yield a _ (Just r) = run $ f a <> (r >>= f)
        in m Nothing stp yield
I want to have multiple versions of this implementation
On Sat, Sep 9, 2017 at 6:05 AM, Harendra Kumar 
wrote:
that
produce a
problem
that
problems
possible.
parameterized
...
...
...
...
by a
function, like this:
bindWith k (AsyncT m) f = AsyncT $ \_ stp yld ->
    let run x = (runAsyncT x) Nothing stp yld
        yield a _ Nothing  = run $ f a
        yield a _ (Just r) = run $ f a `k` (bindWith k r f)
    in m Nothing stp yield
And then the bind function becomes:
(>>=) = bindWith (<>)
But this leads to a performance degradation of more than 10%.
inlining
does
not help, I tried INLINE pragma as well as the "inline" GHC
builtin. I
thought this should be a more or less straightforward replacement
making the
second version equivalent to the first one. But apparently there is
something going on here that makes it perform worse.
I did not look at the core, stg or asm yet. Hoping someone can
quickly
comment on it. Any ideas why is it so? Can this be worked around
somehow?
Thanks,
Harendra
...
_______________________________________________
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
_______________________________________________
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs