Library-vs-local performance

Hi, I've got a library that I'm in the process of uploading to hackage (waiting for account) but the darcs repo is here: http://graphics.cs.ucdavis.edu/~sdillard/Vec I notice a slight drop in performance when I install the library using cabal. Maybe 10-20%, on one particular function. This is in comparison to when the library is 'local', as in, the source files are in the same directory as the client application. What could be causing the performance drop? The function in question requires impractical amounts of inlining (This is something of an experiment), but I don't see how installing it as a library affects that. The functions to be inlined are small, surely available in the .hi files. Its only when they are applied do they agglomerate into a big mess - 80-200K lines of core. The function in question is invertMany in examples/Examples.hs. Scott

sedillard:
Hi,
I've got a library that I'm in the process of uploading to hackage (waiting for account) but the darcs repo is here:
[1]http://graphics.cs.ucdavis.edu/~sdillard/Vec
I notice a slight drop in performance when I install the library using cabal. Maybe 10-20%, on one particular function. This is in comparison to when the library is 'local', as in, the source files are in the same directory as the client application.
Lack of unfolding and inlining when compiled in a package? Try compiling with -O2, for maximum unfolding.
What could be causing the performance drop? The function in question requires impractical amounts of inlining (This is something of an experiment), but I don't see how installing it as a library affects that. The functions to be inlined are small, surely available in the .hi files.
You can check this via -show-iface
Its only when they are applied do they agglomerate into a big mess - 80-200K lines of core.
The function in question is invertMany in examples/Examples.hs.
-- Don

dons:
sedillard:
Hi,
I've got a library that I'm in the process of uploading to hackage (waiting for account) but the darcs repo is here:
[1]http://graphics.cs.ucdavis.edu/~sdillard/Vec
I notice a slight drop in performance when I install the library using cabal. Maybe 10-20%, on one particular function. This is in comparison to when the library is 'local', as in, the source files are in the same directory as the client application.
Lack of unfolding and inlining when compiled in a package? Try compiling with -O2, for maximum unfolding.
What could be causing the performance drop? The function in question requires impractical amounts of inlining (This is something of an experiment), but I don't see how installing it as a library affects that. The functions to be inlined are small, surely available in the .hi files.
You can check this via -show-iface
Its only when they are applied do they agglomerate into a big mess - 80-200K lines of core.
The function in question is invertMany in examples/Examples.hs.
Some general remarks, GHC isn't (yet) a whole program compiler by default. So it doesn't, by default, inling entire packages across package boundaries. So you can lose some specialisation/inlining, sometimes, by breaking things across module boundaries. That said, it's entirely possible to program libraries in a way to specifically allow full inlining of the libraries. The Data.Binary and Data.Array.Vector libraries are written in this style for example, which means lots of {-# INLINE #-} pragmas, maximum unfolding and high optimisation levels. -- Don

That said, it's entirely possible to program libraries in a way to specifically allow full inlining of the libraries. The Data.Binary and Data.Array.Vector libraries are written in this style for example, which means lots of {-# INLINE #-} pragmas, maximum unfolding and high optimisation levels.
-- Don
Every function has an inline pragma. Adding -O2 -funfolding-use-threshold999 -funfolding-creation-threshold999 does not significantly change the produced .hi files (--show-iface produces roughly the same files, just different headers) This makes sense to me because the library doesn't actually _do_ anything. There are no significant compiled functions, everything should be inlined. And since the .hi files are the same, I don't see why they wouldn't be. The two scenarios are these: 1) Library is installed via cabal. 2) Library source lives in the same directory as the application, so that ghc --make Examples.hs also builds the library. When compiling the application I set all knobs to 11. In case 1, ./Examples 3000000 runs in 6.9s, case 2 in 5.2s. The module structure is the same in both cases, so I don't know what inlining across module boundaries has to do with it. By the way, the library is now on hackage, http://hackage.haskell.org/cgi-bin/hackage-scripts/package/Vec but the documentation does not show up. What do I have to do to make this happen? Scott

sedillard:
That said, it's entirely possible to program libraries in a way to specifically allow full inlining of the libraries. The Data.Binary and Data.Array.Vector libraries are written in this style for example, which means lots of {-# INLINE #-} pragmas, maximum unfolding and high optimisation levels. -- Don
Every function has an inline pragma. Adding -O2 -funfolding-use-threshold999 -funfolding-creation-threshold999 does not significantly change the produced .hi files (--show-iface produces roughly the same files, just different headers) This makes sense to me because the library doesn't actually _do_ anything. There are no significant compiled functions, everything should be inlined. And since the .hi files are the same, I don't see why they wouldn't be. The two scenarios are these:
1) Library is installed via cabal. 2) Library source lives in the same directory as the application, so that ghc --make Examples.hs also builds the library.
That's compiling Examples with full access to the source though! So ghc has the entire source available. Once you've installed the library, however, only what is exposed via the .hi files can be used for optimisation purposes. So *something* is not being inlined fully (or some other optimisation is interferring) Boiling this down to the smallest test case you can would be *really* useful!!
When compiling the application I set all knobs to 11. In case 1, ./Examples 3000000 runs in 6.9s, case 2 in 5.2s. The module structure is the same in both cases, so I don't know what inlining across module boundaries has to do with it.
By the way, the library is now on hackage, [1]http://hackage.haskell.org/cgi-bin/hackage-scripts/package/Vec but the documentation does not show up. What do I have to do to make this happen?
Oh assuming haddock can process it, it'll appear in a few hours. Hadock is run periodically.

On Tue, Jun 24, 2008 at 02:01:58PM -0700, Donald Bruce Stewart wrote:
1) Library is installed via cabal. 2) Library source lives in the same directory as the application, so that ghc --make Examples.hs also builds the library.
That's compiling Examples with full access to the source though! So ghc has the entire source available.
That shouldn't make any difference. I suspect a flag difference is to blame - giving cabal build the -v flag will show which flags it is using. Thanks Ian

I can't reproduce the behavior on any of the less egregiously inlined
functions. For everything else the running times are the same using either
local packages or installed libraries.
On Tue, Jun 24, 2008 at 3:16 PM, Ian Lynagh
1) Library is installed via cabal. 2) Library source lives in the same directory as the application, so
On Tue, Jun 24, 2008 at 02:01:58PM -0700, Donald Bruce Stewart wrote: that
ghc --make Examples.hs also builds the library.
That's compiling Examples with full access to the source though! So ghc has the entire source available.
That shouldn't make any difference. I suspect a flag difference is to blame - giving cabal build the -v flag will show which flags it is using.
I've taken all optimization flags out of the .cabal file. They don't have any effect. My understanding of things is this: (please correct if wrong) All functions have inline pragmas, and all are small (1 or 2 lines) so their definitions are all spewed into the .hi files. So in both scenarios (library vs local) GHC can "see" the whole library. Since every function is inlined, it doesn't matter what flags the library is compiled with. That compiled code will never be used so long as the application is compiled with optimization on. Now the particulars of the situation are this: the function in question is inlined very deeply, it has many instance constraints, and during simplification the core blows up to _ridiculous_ sizes. (Compilation with -ddump-simpl is taking about 5-10 min.) I think I'm pushing the compiler to unreasonable limits, and I think maybe something non-obvious is going on inside. On the otherhand, pushing the compiler in this way gets me a 3x speedup, which is nothing to sneeze at. In the meantime I'll see what I can do to make this function (gaussian elimination) more amenable to simplification. The rest of the library works great. Scott

sedillard:
I can't reproduce the behavior on any of the less egregiously inlined functions. For everything else the running times are the same using either local packages or installed libraries.
On Tue, Jun 24, 2008 at 3:16 PM, Ian Lynagh <[1]igloo@earth.li> wrote:
On Tue, Jun 24, 2008 at 02:01:58PM -0700, Donald Bruce Stewart wrote: > > > > 1) Library is installed via cabal. > > 2) Library source lives in the same directory as the application, so that > > ghc --make Examples.hs also builds the library. > > That's compiling Examples with full access to the source though! > So ghc has the entire source available.
That shouldn't make any difference. I suspect a flag difference is to blame - giving cabal build the -v flag will show which flags it is using.
I've taken all optimization flags out of the .cabal file. They don't have any effect. My understanding of things is this: (please correct if wrong) All functions have inline pragmas, and all are small (1 or 2 lines) so their definitions are all spewed into the .hi files. So in both scenarios (library vs local) GHC can "see" the whole library. Since every function is inlined, it doesn't matter what flags the library is compiled with. That compiled code will never be used so long as the application is compiled with optimization on.
Now the particulars of the situation are this: the function in question is inlined very deeply, it has many instance constraints, and during simplification the core blows up to _ridiculous_ sizes. (Compilation with -ddump-simpl is taking about 5-10 min.) I think I'm pushing the compiler to unreasonable limits, and I think maybe something non-obvious is going on inside.
On the otherhand, pushing the compiler in this way gets me a 3x speedup, which is nothing to sneeze at. In the meantime I'll see what I can do to make this function (gaussian elimination) more amenable to simplification. The rest of the library works great.
You might want to give the simplifier enough time to unwind things. I use, e.g. -O2 -fvia-C -optc-O2 -fdicts-cheap -fno-method-sharing -fmax-simplifier-iterations10 -fliberate-case-threshold100 in my ghc-options for 'whole program' libraries. Raise these limits if you find they're having an effect -- Don

On Tue, Jun 24, 2008 at 3:51 PM, Don Stewart
I've taken all optimization flags out of the .cabal file. They don't
have
any effect. My understanding of things is this: (please correct if wrong) All functions have inline pragmas, and all are small (1 or 2 lines) so their definitions are all spewed into the .hi files. So in both scenarios (library vs local) GHC can "see" the whole library. Since every function is inlined, it doesn't matter what flags the library is compiled with. That compiled code will never be used so long as the application is compiled with optimization on.
You might want to give the simplifier enough time to unwind things. I use, e.g.
-O2 -fvia-C -optc-O2 -fdicts-cheap -fno-method-sharing -fmax-simplifier-iterations10 -fliberate-case-threshold100
in my ghc-options for 'whole program' libraries.
Raise these limits if you find they're having an effect
-- Don
Yeah I saw those in your uvector library and was going to ask: what do they do? Are they documented anywhere? I can't find any info on them. Speicifcally, what is the case liberation threshold? (Can't even find that on google.) That sounds germane because the function in question is one of the few with branches. And what effect does -fvia-C -optc-O2 have? Those refer to the generation of machine code, do they not? If the library is essentially a core-only library, why use them? As far as I can tell, even -O2 is ineffectual when compiling the library. 'Compiling' here is even a misnomer. We're just transliterating from haskell to core. Scott

sedillard:
On Tue, Jun 24, 2008 at 3:51 PM, Don Stewart <[1]dons@galois.com> wrote:
> > I've taken all optimization flags out of the .cabal file. They don't have > any effect. My understanding of things is this: (please correct if wrong) > All functions have inline pragmas, and all are small (1 or 2 lines) so > their definitions are all spewed into the .hi files. So in both scenarios > (library vs local) GHC can "see" the whole library. Since every function > is inlined, it doesn't matter what flags the library is compiled with. > That compiled code will never be used so long as the application is > compiled with optimization on.
You might want to give the simplifier enough time to unwind things. I use, e.g.
-O2 -fvia-C -optc-O2 -fdicts-cheap -fno-method-sharing -fmax-simplifier-iterations10 -fliberate-case-threshold100
in my ghc-options for 'whole program' libraries.
Raise these limits if you find they're having an effect -- Don
Yeah I saw those in your uvector library and was going to ask: what do they do? Are they documented anywhere? I can't find any info on them. Speicifcally, what is the case liberation threshold? (Can't even find that on google.) That sounds germane because the function in question is one of the few with branches.
And what effect does -fvia-C -optc-O2 have? Those refer to the generation of machine code, do they not? If the library is essentially a core-only library, why use them? As far as I can tell, even -O2 is ineffectual when compiling the library. 'Compiling' here is even a misnomer. We're just transliterating from haskell to core.
Nope, there's a lot of optimisations taking place on the core-to-core phase, to ensure the core that gets unfolded into your .hi files is as nice as possible. And then still there's things that actually stay as calls into your compiled library -- for those, you'll want direct jumps and so forth, which you get with -fvia-C -optc-O2 and above. See my recent post on micro optimisations. -- Don

On Tue, Jun 24, 2008 at 4:15 PM, Don Stewart
Nope, there's a lot of optimisations taking place on the core-to-core phase, to ensure the core that gets unfolded into your .hi files is as nice as possible. And then still there's things that actually stay as calls into your compiled library -- for those, you'll want direct jumps and so forth, which you get with -fvia-C -optc-O2 and above.
See my recent post on micro optimisations.
-- Don
Fair enough, but I don't think that's whats going on here specifically. I can't get ghc-options to effect any change, one way or the other. I guess its a mystery for now. Thanks for the replies. Scott
participants (3)
-
Don Stewart
-
Ian Lynagh
-
Scott Dillard