inter module optimizations

I had posted some data on inter-module optimizations that I had calculated when splitting my program from one computational module to many different ones. Tim Chevalier suggested that my calculation could be interesting to the people here. So I made the effort of preparing the various versions of my code and re doing the analysis better. Unfortunately I had already began renaming things without doing a darcs record, so in the split version some function names are different. I have a tar.bz archive of 21KB, but I did not know if it is considered rude to send attachments, but if someone is interested I can send him the file. Basically it mainly boils down to non-inlining of some important functions on a newtype ( type LatLocI = Word32 newtype LatLoc = LatLoc LatLocI deriving (Eq,Ord) ), because specialization should not be an issue as I had already given specific signatures to my functions. Also worth noting is that using the profiling with -O2 compilation makes one thing that inlining (or using a single module) makes the program slower, whereas the opposite is true. I think that the profiling overhead are incorrectly evaluated. I know that with -O2 one cannot expect profiling to be good, but it would be nice if it wouldn't be so misleading Here some data (obtained with a script that is also in the tar.bz archive) ******** allInOne: original program, monolithic main computational module * timings of -O2 executable 7.67user 0.00system 0:07.69elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+894minor)pagefaults 0swaps * timings of the executable with profiling total time = 15.25 secs (305 ticks @ 50 ms) total alloc = 5,888,786,120 bytes (excludes profiling overheads) ******** splitModule NoReexport NoInline directives: split computational module, no export list for split modules * timings of -O2 executable 10.14user 0.01system 0:10.17elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+901minor)pagefaults 0swaps * timings of the executable with profiling total time = 11.85 secs (237 ticks @ 50 ms) total alloc = 5,888,780,912 bytes (excludes profiling overheads) ******** splitModule Reexport NoInline directives: computational module, no export list for split modules, old module reexport using export list * timings of -O2 executable 8.88user 0.00system 0:08.90elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+901minor)pagefaults 0swaps * timings of the executable with profiling total time = 12.20 secs (244 ticks @ 50 ms) total alloc = 5,888,780,912 bytes (excludes profiling overheads) ******** splitModule NoReexport Inline directives: split computational module, no export list for split modules, explicit inline directives * timings of -O2 executable 6.44user 0.01system 0:06.46elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+895minor)pagefaults 0swaps * timings of the executable with profiling total time = 18.80 secs (376 ticks @ 50 ms) total alloc = 5,374,883,312 bytes (excludes profiling overheads) ************* Fawzi

fmohamed:
I had posted some data on inter-module optimizations that I had calculated when splitting my program from one computational module to many different ones.
Tim Chevalier suggested that my calculation could be interesting to the people here.
So I made the effort of preparing the various versions of my code and re doing the analysis better. Unfortunately I had already began renaming things without doing a darcs record, so in the split version some function names are different.
I have a tar.bz archive of 21KB, but I did not know if it is considered rude to send attachments, but if someone is interested I can send him the file.
Basically it mainly boils down to non-inlining of some important functions on a newtype ( type LatLocI = Word32 newtype LatLoc = LatLoc LatLocI deriving (Eq,Ord) ), because specialization should not be an issue as I had already given specific signatures to my functions.
Also worth noting is that using the profiling with -O2 compilation makes one thing that inlining (or using a single module) makes the program slower, whereas the opposite is true. I think that the profiling overhead are incorrectly evaluated. I know that with -O2 one cannot expect profiling to be good, but it would be nice if it wouldn't be so misleading
Here some data (obtained with a script that is also in the tar.bz archive)
******** allInOne: original program, monolithic main computational module * timings of -O2 executable 7.67user 0.00system 0:07.69elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+894minor)pagefaults 0swaps * timings of the executable with profiling total time = 15.25 secs (305 ticks @ 50 ms) total alloc = 5,888,786,120 bytes (excludes profiling overheads) ******** splitModule NoReexport NoInline directives: split computational module, no export list for split modules * timings of -O2 executable 10.14user 0.01system 0:10.17elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+901minor)pagefaults 0swaps * timings of the executable with profiling total time = 11.85 secs (237 ticks @ 50 ms) total alloc = 5,888,780,912 bytes (excludes profiling overheads) ******** splitModule Reexport NoInline directives: computational module, no export list for split modules, old module reexport using export list * timings of -O2 executable 8.88user 0.00system 0:08.90elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+901minor)pagefaults 0swaps * timings of the executable with profiling total time = 12.20 secs (244 ticks @ 50 ms) total alloc = 5,888,780,912 bytes (excludes profiling overheads) ******** splitModule NoReexport Inline directives: split computational module, no export list for split modules, explicit inline directives * timings of -O2 executable 6.44user 0.01system 0:06.46elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+895minor)pagefaults 0swaps * timings of the executable with profiling total time = 18.80 secs (376 ticks @ 50 ms) total alloc = 5,374,883,312 bytes (excludes profiling overheads) *************
Fawzi
To really understand what is going on, I suggest looking at the -ddump-simpl output as you change the inlining settings. Then you'll see how GHC is moving code about. -- Don (who's spent the last 2 weeks playing the simplifer/inliner game)

Donald Bruce Stewart wrote:
[..] To really understand what is going on, I suggest looking at the -ddump-simpl output as you change the inlining settings. Then you'll see how GHC is moving code about.
-- Don (who's spent the last 2 weeks playing the simplifer/inliner game)
Thanks, but actually (with Jeremy's and your suggestion on haskell-cafe about the INLINE directive) I have got back the performance that I had (actually even better than before), I don't want to *really* understand it ;-). I was thinking that maybe someone else here would have liked to understand it. Especially the fact that profiling gives exactly the opposite trend as the executable without profiling (the fastest program becomes the slowest and vice versa) when using -O2 is annoying. Fawzi
participants (2)
-
dons@cse.unsw.edu.au
-
Fawzi Mohamed