
fmohamed:
I had posted some data on inter-module optimizations that I had calculated when splitting my program from one computational module to many different ones.
Tim Chevalier suggested that my calculation could be interesting to the people here.
So I made the effort of preparing the various versions of my code and re doing the analysis better. Unfortunately I had already began renaming things without doing a darcs record, so in the split version some function names are different.
I have a tar.bz archive of 21KB, but I did not know if it is considered rude to send attachments, but if someone is interested I can send him the file.
Basically it mainly boils down to non-inlining of some important functions on a newtype ( type LatLocI = Word32 newtype LatLoc = LatLoc LatLocI deriving (Eq,Ord) ), because specialization should not be an issue as I had already given specific signatures to my functions.
Also worth noting is that using the profiling with -O2 compilation makes one thing that inlining (or using a single module) makes the program slower, whereas the opposite is true. I think that the profiling overhead are incorrectly evaluated. I know that with -O2 one cannot expect profiling to be good, but it would be nice if it wouldn't be so misleading
Here some data (obtained with a script that is also in the tar.bz archive)
******** allInOne: original program, monolithic main computational module * timings of -O2 executable 7.67user 0.00system 0:07.69elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+894minor)pagefaults 0swaps * timings of the executable with profiling total time = 15.25 secs (305 ticks @ 50 ms) total alloc = 5,888,786,120 bytes (excludes profiling overheads) ******** splitModule NoReexport NoInline directives: split computational module, no export list for split modules * timings of -O2 executable 10.14user 0.01system 0:10.17elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+901minor)pagefaults 0swaps * timings of the executable with profiling total time = 11.85 secs (237 ticks @ 50 ms) total alloc = 5,888,780,912 bytes (excludes profiling overheads) ******** splitModule Reexport NoInline directives: computational module, no export list for split modules, old module reexport using export list * timings of -O2 executable 8.88user 0.00system 0:08.90elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+901minor)pagefaults 0swaps * timings of the executable with profiling total time = 12.20 secs (244 ticks @ 50 ms) total alloc = 5,888,780,912 bytes (excludes profiling overheads) ******** splitModule NoReexport Inline directives: split computational module, no export list for split modules, explicit inline directives * timings of -O2 executable 6.44user 0.01system 0:06.46elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+895minor)pagefaults 0swaps * timings of the executable with profiling total time = 18.80 secs (376 ticks @ 50 ms) total alloc = 5,374,883,312 bytes (excludes profiling overheads) *************
Fawzi
To really understand what is going on, I suggest looking at the -ddump-simpl output as you change the inlining settings. Then you'll see how GHC is moving code about. -- Don (who's spent the last 2 weeks playing the simplifer/inliner game)