Optimization beyond the Module Border

Hi all, I have noticed that there is a great difference between optimizing modules separately and all at once, e.g., with -fforce-recomp. I have had examples factors up to 15 in run time (and even different behavior in context with unsafePerformIO). Is there any option that makes ghc write out that intermediate optimization data he seems to use in order to get the same efficiency in a module-wise compilation? Greetings Bernd

| I have noticed that there is a great difference between optimizing | modules separately and all at once, e.g., with -fforce-recomp. I have | had examples factors up to 15 in run time (and even different behavior | in context with unsafePerformIO). GHC does a lot of cross-module inlining already, and *does* write stuff into interface files, provided you use -O. I'm always interested in performance differences of a factor of 15 though! Can you supply an example (as small as poss) for us to look at? Thanks Simon

Simon Peyton-Jones wrote:
GHC does a lot of cross-module inlining already, and *does* write stuff into interface files, provided you use -O.
I used -O4. Is that the bad thing?
I'm always interested in performance differences of a factor of 15 though! Can you supply an example (as small as poss) for us to look at?
Yes certainly, although small will be a big problem, I guess. I admit that the factor 15 is a bit dubious since the fast run-time was so small (1.88 sec vs. 0.112 sec). I will see what I can do on the morrow.

bbr:
Simon Peyton-Jones wrote:
GHC does a lot of cross-module inlining already, and *does* write stuff into interface files, provided you use -O.
I used -O4. Is that the bad thing?
There's nothing about -O2 However, I think that's ok -- it clamps -ON | N>2 to -O2
I'm always interested in performance differences of a factor of 15 though! Can you supply an example (as small as poss) for us to look at?
Yes certainly, although small will be a big problem, I guess. I admit that the factor 15 is a bit dubious since the fast run-time was so small (1.88 sec vs. 0.112 sec).
I will see what I can do on the morrow.
I'd be interested in any progress here -- we noticed issues with optimisations in the stream fusion package across module boundaries that we never tracked down. If there's some key things not firing, that would be good to know.

I'd be interested in any progress here -- we noticed issues with optimisations in the stream fusion package across module boundaries that we never tracked down. If there's some key things not firing, that would be good to know.
I suspect that if all modules are compiled -O0, then you recompile one module with -O2, high up in the dependency graph (i.e. it depends on many lower-level modules), plus all things that in turn depend on it (--make), you will not get the good performance you expect. None of the lower-level functions will have exported inlinings or fusion rules into the interface file. _All_ modules must be recompiled with -O2, especially the bottom of the dependency chain, to get the best benefit from optimisation. Regards, Malcolm

| > I'd be interested in any progress here -- we noticed issues with | > optimisations in the stream fusion package across module boundaries | > that we never tracked down. If there's some key things not firing, | > that would be good to know. | | I suspect that if all modules are compiled -O0, then you recompile one | module with -O2, high up in the dependency graph (i.e. it depends on | many lower-level modules), plus all things that in turn depend on it | (--make), you will not get the good performance you expect. None of the | lower-level functions will have exported inlinings or fusion rules into | the interface file. _All_ modules must be recompiled with -O2, | especially the bottom of the dependency chain, to get the best benefit | from optimisation. Absolutely correct. Should this be better documented? If so, would someone like to think where in GHC's user manual they would have looked (or did look), and send me some text that would have helped them, had it been there? As it were. Simon

I suspect that if all modules are compiled -O0, then you recompile one module with -O2, high up in the dependency graph (i.e. it depends on many lower-level modules), plus all things that in turn depend on it (--make), you will not get the good performance you expect. None of the lower-level functions will have exported inlinings or fusion rules into the interface file. _All_ modules must be recompiled with -O2, especially the bottom of the dependency chain, to get the best benefit from optimisation.
Regards, Malcolm
I am very sorry, I think what Malcolm describes might be exactly what had happened. Now that I tried to blow up the example from 0.122 msec to get a more significant result, I can't reproduce the effect. Funny thing though, as I was pretty keen on doing a thorough job as it was all about measuring the quality of the work of the previous fortnight. Now I find that - after all - I did a much better job than it seemed yesterday :o) So there may be two (minor) issues left if you would be interested. Firstly, about profiling in connection with optimization. When I compiled things with -O2 AND -prof -auto-all no profile would be written. Now you might think that having both at once is a silly idea, the side effects of profiling might be the first to be "optimized" away. But I think it was not so silly after all as I had introduced a lot of overhead into my programs which I was pretty sure could be optimized away. Hence, I was not at all interested in the unoptimized profile. And I think it is not so unusual to want to improve only those things that the compiler cannot improve by itself. Couldn't the profiling things be added AFTER all optimization was done? And then, secondly, about the connection of optimization with side effects. I had programs behave differently when compiling them all in one go or module-wise. (And if I am not able to reproduce that effect as well I will do a little merry dance!) Is this also interesting? Might it be connected with what Don mentioned about the stream fusion package? (Although I cannot remember mentioning any side effects in Duncan's talk in Freiburg.) Thanks for your time and sorry once again for using the system all wrong! Bernd

On Thu, Mar 20, 2008 at 09:47:28AM +0100, Bernd Brassel wrote:
compiled things with -O2 AND -prof -auto-all no profile would be written.
This should work, for the reasons that you give. Did you use options like +RTS -p when running the program? If so, please give us an example to reproduce the problem.
And then, secondly, about the connection of optimization with side effects. I had programs behave differently when compiling them all in one go or module-wise.
Again, please tell us how to reproduce this. Thanks Ian

bbr:
I suspect that if all modules are compiled -O0, then you recompile one module with -O2, high up in the dependency graph (i.e. it depends on many lower-level modules), plus all things that in turn depend on it (--make), you will not get the good performance you expect. None of the lower-level functions will have exported inlinings or fusion rules into the interface file. _All_ modules must be recompiled with -O2, especially the bottom of the dependency chain, to get the best benefit from optimisation.
Regards, Malcolm
I am very sorry, I think what Malcolm describes might be exactly what had happened. Now that I tried to blow up the example from 0.122 msec to get a more significant result, I can't reproduce the effect. Funny thing though, as I was pretty keen on doing a thorough job as it was all about measuring the quality of the work of the previous fortnight. Now I find that - after all - I did a much better job than it seemed yesterday :o)
So there may be two (minor) issues left if you would be interested. Firstly, about profiling in connection with optimization. When I compiled things with -O2 AND -prof -auto-all no profile would be written. Now you might think that having both at once is a silly idea, the side effects of profiling might be the first to be "optimized" away.
You almost always want to profile with full optimisations on. Otherwise its not even close to measuring the kind of code you're actually running.
But I think it was not so silly after all as I had introduced a lot of overhead into my programs which I was pretty sure could be optimized away. Hence, I was not at all interested in the unoptimized profile. And I think it is not so unusual to want to improve only those things that the compiler cannot improve by itself. Couldn't the profiling things be added AFTER all optimization was done?
And then, secondly, about the connection of optimization with side effects. I had programs behave differently when compiling them all in one go or module-wise. (And if I am not able to reproduce that effect as well I will do a little merry dance!) Is this also interesting? Might it be connected with what Don mentioned about the stream fusion package? (Although I cannot remember mentioning any side effects in Duncan's talk in Freiburg.)
No, I can't think of any issue there. -- Don

Don Stewart wrote:
You almost always want to profile with full optimisations on. Otherwise its not even close to measuring the kind of code you're actually running.
Ian Lynagh wrote:
This should work, for the reasons that you give. Did you use options like +RTS -p when running the program?
Yes guys, you are sooo right... And I think it is really time for the easter holidays! Not that I forgot to turn on the RTS option but something close to as stupid if not worse: I looked into the wrong file. :o(( My gosh, before I stay to get the trophy of the most stupid mail of the list, I think I will go home now... Have a happy Easter! Bernd
participants (5)
-
Bernd Brassel
-
Don Stewart
-
Ian Lynagh
-
Malcolm Wallace
-
Simon Peyton-Jones