
I decided to cleanup my program by splitting it in different modules. As I was curious about the cost of splitting it, or dually the efficiency of the intermodule optimization I timed it before and after the split. These are the results (ghc-6.6.20070129 on Linux AMD64): Original: 3 Modules with the computational one with an export list real 0m0.385s user 0m0.378s sys 0m0.003s time:100% Variant 1: split computational module in 6 submodules without export list, kept computational module reexporting the stuff with an export list real 0m0.467s user 0m0.464s sys 0m0.003s time: 122% Variant 2: like Variant 1 but removing the old computational module (no reexport list) 134% real 0m0.513s user 0m0.506s sys 0m0.003s All the functions in my modules have type specialized signatures, so non-specialization should almost never be an issue. The module with the Main function was not the exported one. So 20% speed hit, I had hoped for 0, but it is not unbearable (even if my code has to be as fast as possible, this was a very short run), probably I will keep it split to have cleaner code. Not I will have to write an export list for each module a see it the things improve, and how much. If someone has an idea on how else I can improve timings please tell me. Fawzi

On 3/27/07, Fawzi Mohamed
I decided to cleanup my program by splitting it in different modules. As I was curious about the cost of splitting it, or dually the efficiency of the intermodule optimization I timed it before and after the split. These are the results (ghc-6.6.20070129 on Linux AMD64):
Original: 3 Modules with the computational one with an export list
real 0m0.385s user 0m0.378s sys 0m0.003s time:100%
Variant 1: split computational module in 6 submodules without export list, kept computational module reexporting the stuff with an export list
real 0m0.467s user 0m0.464s sys 0m0.003s time: 122%
Variant 2: like Variant 1 but removing the old computational module (no reexport list) 134% real 0m0.513s user 0m0.506s sys 0m0.003s
All the functions in my modules have type specialized signatures, so non-specialization should almost never be an issue. The module with the Main function was not the exported one.
So 20% speed hit, I had hoped for 0, but it is not unbearable (even if my code has to be as fast as possible, this was a very short run), probably I will keep it split to have cleaner code. Not I will have to write an export list for each module a see it the things improve, and how much. If someone has an idea on how else I can improve timings please tell me.
For starters, I'd question whether those results are statistically significant; your program doesn't run for very long. 20% of less than .5 seconds is short enough that the 20% hit you're seeing could be affected by random noise. If there's a way to adjust the input to your program so that it runs for more than a few seconds, you may want to see what results you get that way. Even so, intuitively I'd also expect to see a performance hit when splitting a program into multiple modules, as GHC's optimizer is designed with separate compilation as a consideration. As always, you probably need to do profiling in order to figure whether it's worth bothering about. Cheers, Tim -- Tim Chevalier * chevalier@alum.wellesley.edu * Often in error, never in doubt Confused? See http://catamorphism.org/transition.html

At Tue, 27 Mar 2007 23:10:21 +0200, Fawzi Mohamed wrote:
If someone has an idea on how else I can improve timings please tell me.
I believe you are seeing a speed decrease, because GHC is not inlining functions as much when you split them into modules. If you add explicit inline statements, I think you should be able to get back to your original timings. Below is an example of how to add INLINE statements. someFunction :: (Word8 -> Maybe Word8 -> Bool) -> B.ByteString -> B.ByteString someFunction = ... {-# INLINE someFunction #-} I am not sure what the downsides are. It probably makes the resulting binary bigger, and takes longer to compile. Though, probably not any worse than if you just put everything in one module? It might be the case that you only need to INLINE one or two functions to get most of the speed back. j.

jeremy.shaw:
At Tue, 27 Mar 2007 23:10:21 +0200, Fawzi Mohamed wrote:
If someone has an idea on how else I can improve timings please tell me.
I believe you are seeing a speed decrease, because GHC is not inlining functions as much when you split them into modules. If you add explicit inline statements, I think you should be able to get back to your original timings.
Below is an example of how to add INLINE statements.
someFunction :: (Word8 -> Maybe Word8 -> Bool) -> B.ByteString -> B.ByteString someFunction = ... {-# INLINE someFunction #-}
I am not sure what the downsides are. It probably makes the resulting binary bigger, and takes longer to compile. Though, probably not any worse than if you just put everything in one module?
It might be the case that you only need to INLINE one or two functions to get most of the speed back.
Yes, INLINE and compile with -O2. -- Don

On 3/27/07, Jeremy Shaw
At Tue, 27 Mar 2007 23:10:21 +0200, Fawzi Mohamed wrote:
If someone has an idea on how else I can improve timings please tell me.
I believe you are seeing a speed decrease, because GHC is not inlining functions as much when you split them into modules. If you add explicit inline statements, I think you should be able to get back to your original timings.
It could be inlining or it could be other optimizations. From the data the OP gives, I don't think it's possible to conclude which ones.
Below is an example of how to add INLINE statements.
someFunction :: (Word8 -> Maybe Word8 -> Bool) -> B.ByteString -> B.ByteString someFunction = ... {-# INLINE someFunction #-}
I am not sure what the downsides are. It probably makes the resulting binary bigger, and takes longer to compile. Though, probably not any worse than if you just put everything in one module?
That's not necessarily so; what if you end up adding an INLINE pragma for a function with several call sites that GHC wouldn't have inlined at all in the original single-module program?
It might be the case that you only need to INLINE one or two functions to get most of the speed back.
Yes, which is why it's a good idea to do profiling before sprinkling INLINE pragmas wantonly around your code. Cheers, Tim -- Tim Chevalier * chevalier@alum.wellesley.edu * Often in error, never in doubt Confused? See http://catamorphism.org/transition.html

Thanks ! Il giorno Mar 28, 2007, alle ore 12:04 AM, Tim Chevalier ha scritto:
On 3/27/07, Jeremy Shaw
wrote: At Tue, 27 Mar 2007 23:10:21 +0200, Fawzi Mohamed wrote:
If someone has an idea on how else I can improve timings please tell me.
I believe you are seeing a speed decrease, because GHC is not inlining functions as much when you split them into modules. If you add explicit inline statements, I think you should be able to get back to your original timings.
It could be inlining or it could be other optimizations. From the data the OP gives, I don't think it's possible to conclude which ones.
I did longer runs (all compiled with -O2 as before) with the same results. and indeed with a couple of {-# INLINE function #-} I was able to recover the previous performance and actually even get a better performace than before. Thanks! An interesting thing is that the profiler actully was saying that the non inlined version (and for the matter also the the split version) were faster than the inlined or single module versions. It would seem that the profiling ovehead for the inlined functions are not correctly accounted for, so that they appear as more expensive than the plain version when profiling. Fawzi

On 3/27/07, Fawzi Mohamed
I did longer runs (all compiled with -O2 as before) with the same results. and indeed with a couple of {-# INLINE function #-} I was able to recover the previous performance and actually even get a better performace than before.
If you had time to come up with a small test case that shows significantly better performance with the {-# INLINE #-} pragmas, I'm guessing that people on the glasgow-haskell-users mailing list might be interested in seeing it. (Of course, inlining is a black art, so it's also possible that with your example, you'll always know better than the compiler does.)
Thanks!
An interesting thing is that the profiler actully was saying that the non inlined version (and for the matter also the the split version) were faster than the inlined or single module versions. It would seem that the profiling ovehead for the inlined functions are not correctly accounted for, so that they appear as more expensive than the plain version when profiling.
Profiling and optimization interact in ways that can result in profiling information being misleading, optimizations being effectively disabled, or both -- but, the overall numbers for time and space usage should always be accurate regardless of inlining or other optimizations, AFAIK. If you can explain what's happening in more detail (i.e., by showing the profiler output in the different cases), that might also be a topic for a ghc-users post. Cheers, Tim -- Tim Chevalier * chevalier@alum.wellesley.edu * Often in error, never in doubt Confused? See http://catamorphism.org/transition.html

| I believe you are seeing a speed decrease, because GHC is not inlining | functions as much when you split them into modules. If you add | explicit inline statements, I think you should be able to get back to | your original timings. Generally speaking GHC will inline *across* modules just as much as it does *within* modules, with a single large exception. If GHC sees that a function 'f' is called just once, it inlines it regardless of how big 'f' is. But once 'f' is exported, GHC can never see that it's called exactly once, even if that later turns out to be the case. This inline-once optimisation is pretty important in practice. So: do not export functions that are not used outside the module (i.e. use an explicit export list, and keep it as small as possible). Simon

Fawzi Mohamed wrote:
I decided to cleanup my program by splitting it in different modules. As I was curious about the cost of splitting it, or dually the efficiency of the intermodule optimization I timed it before and after the split. These are the results (ghc-6.6.20070129 on Linux AMD64):
A long long time ago, Hal Daume III made Haskell All-in-one which takes a Haskell program and puts all the modules into one file. The difference in efficiency was discussed on one of these mailinglists then. Google should be able to turn up something (albeit it seems to no longer index the haskell.org mailinglists directly).

http://www.cs.utah.edu/~hal/HAllInOne/index.html
On 3/27/07, Derek Elkins
Fawzi Mohamed wrote:
I decided to cleanup my program by splitting it in different modules. As I was curious about the cost of splitting it, or dually the efficiency of the intermodule optimization I timed it before and after the split. These are the results (ghc-6.6.20070129 on Linux AMD64):
A long long time ago, Hal Daume III made Haskell All-in-one which takes a Haskell program and puts all the modules into one file. The difference in efficiency was discussed on one of these mailinglists then. Google should be able to turn up something (albeit it seems to no longer index the haskell.org mailinglists directly). _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Ricardo Guimarães Herrmann "Any sufficiently complicated C or Fortran program contains an ad hoc, informally specified, bug-ridden, slow implementation of half of Common Lisp" "Any sufficiently complicated Lisp or Ruby program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Haskell"
participants (7)
-
Derek Elkins
-
dons@cse.unsw.edu.au
-
Fawzi Mohamed
-
Jeremy Shaw
-
Ricardo Herrmann
-
Simon Peyton-Jones
-
Tim Chevalier