vector stream fusion, inlining and compilation time

hi, two questions in one post: i've been hunting down some performance problems in DSP code using vector and the single most important transformation seems to be throwing in INLINE pragmas for any function that uses vector combinators and is to be called from higher-level code. failing to do so seems to prevent vector operations from being fused and results in big performance hits (the good news is that the optimized code is quite competitive). does anybody have some more info about the do's and don'ts when programming with vector? the downside after adding the INLINE pragmas is that now some of my modules take _really_ long to compile (up to a couple of minutes); any ideas where i can start looking to bring the compilation times down again? i'm compiling with -O2 -funbox-strict-fields instead of -Odph (with ghc 6.10.4 on OSX 10.4), because it's faster for some of my code, but -O2 vs. -Odph doesn't make a noticable difference in compilation time. many thanks, <sk>

sk:
hi,
two questions in one post:
i've been hunting down some performance problems in DSP code using vector and the single most important transformation seems to be throwing in INLINE pragmas for any function that uses vector combinators and is to be called from higher-level code. failing to do so seems to prevent vector operations from being fused and results in big performance hits (the good news is that the optimized code is quite competitive). does anybody have some more info about the do's and don'ts when programming with vector?
Always inline any combination of things that are expressed in terms of vector combinators, so that the combination of your code can hope to fuse as well.
the downside after adding the INLINE pragmas is that now some of my modules take _really_ long to compile (up to a couple of minutes); any ideas where i can start looking to bring the compilation times down again?
I'm not sure there's much we can do there.
i'm compiling with -O2 -funbox-strict-fields instead of -Odph (with ghc 6.10.4 on OSX 10.4), because it's faster for some of my code, but -O2 vs. -Odph doesn't make a noticable difference in compilation time.
-Odph should make it easier for some things to fuse -- and get better code. But Roman can say more. -- Don

On 05/03/2010, at 04:34, stefan kersten wrote:
i've been hunting down some performance problems in DSP code using vector and the single most important transformation seems to be throwing in INLINE pragmas for any function that uses vector combinators and is to be called from higher-level code. failing to do so seems to prevent vector operations from being fused and results in big performance hits (the good news is that the optimized code is quite competitive). does anybody have some more info about the do's and don'ts when programming with vector?
This is a general problem when working with RULES-based optimisations. Here is an example of what happens: suppose we have foo :: Vector Int -> Vector Int foo xs = map (+1) xs Now, GHC will generate a nice tight loop for this but if in a different module, we have something like this: bar xs = foo (foo xs) then this won't fuse because (a) foo won't be inlined and (b) even if GHC did inline here, it would inline the nice tight loop which can't possibly fuse instead of the original map which can. By slapping an INLINE pragma on foo, you're telling GHC to (almost) always inline the function and to use the original definition for inlining, thus giving it a chance to fuse. GHC could be a bit cleverer here (perhaps by noticing that the original definition is small enough to inline and keeping it) but in general, you have to add INLINE pragmas in such cases if you want to be sure your code fuses. A general-purpose mechanism for handling situations like this automatically would be great but we haven't found a good one so far.
the downside after adding the INLINE pragmas is that now some of my modules take _really_ long to compile (up to a couple of minutes); any ideas where i can start looking to bring the compilation times down again?
Alas, stream fusion (and fusion in general, I guess) requires what I would call whole loop compilation - you need to inline everything into loops. That tends to be slow. I don't know what your code looks like but you could try to control inlining a bit more. For instance, if you have something like this: foo ... = ... map f xs ... where f x = ... you could tell GHC not to inline f until fairly late in the game by adding {-# INLINE [0] f #-} to the where clause. This helps sometimes.
i'm compiling with -O2 -funbox-strict-fields instead of -Odph (with ghc 6.10.4 on OSX 10.4), because it's faster for some of my code, but -O2 vs. -Odph doesn't make a noticable difference in compilation time.
If you're *really* interested in performance, I would suggest using GHC head. It really is much better for this kind of code (although not necessarily faster wrt to compilation times). This is what -Odph does: -- -Odph is equivalent to -- -- -O2 optimise as much as possible -- -fno-method-sharing sharing specialisation defeats fusion -- sometimes -- -fdicts-cheap always inline dictionaries -- -fmax-simplifier-iterations20 this is necessary sometimes -- -fsimplifier-phases=3 we use an additional simplifier phase -- for fusion -- -fno-spec-constr-threshold run SpecConstr even for big loops -- -fno-spec-constr-count SpecConstr as much as possible I'm surprised -Odph doesn't produce faster code than -O2. In any case, you could try turning these flags on individually (esp. -fno-method-sharing and the spec-constr flags) to see how they affect performance and compilation times. Roman

This is a general problem when working with RULES-based optimisations. Here is an example of what happens: suppose we have
foo :: Vector Int -> Vector Int foo xs = map (+1) xs
Now, GHC will generate a nice tight loop for this but if in a different module, we have something like this:
bar xs = foo (foo xs)
then this won't fuse because (a) foo won't be inlined and (b) even if GHC did inline here, it would inline the nice tight loop which can't possibly fuse instead of the original map which can. By slapping an INLINE pragma on foo, you're telling GHC to (almost) always inline the function and to use the original definition for inlining, thus giving it a chance to fuse.
thanks for the insight, roman!
the downside after adding the INLINE pragmas is that now some of my modules take _really_ long to compile (up to a couple of minutes); any ideas where i can start looking to bring the compilation times down again?
Alas, stream fusion (and fusion in general, I guess) requires what I would call whole loop compilation - you need to inline everything into loops. That tends to be slow. I don't know what your code looks like but you could try to control inlining a bit more. For instance, if you have something like this:
foo ... = ... map f xs ... where f x = ...
you could tell GHC not to inline f until fairly late in the game by adding
{-# INLINE [0] f #-}
to the where clause. This helps sometimes.
thanks, i'll check it out.
I'm surprised -Odph doesn't produce faster code than -O2. In any case, you could try turning these flags on individually (esp. -fno-method-sharing and the spec-constr flags) to see how they affect performance and compilation times.
in the end it turned out that i had forgotten another INLINE pragma and in my crude benchmarks -O2 and -Odph give basically the same results, -O2 being a little faster. i hope i'll have time next week to do proper benchmarks, and i also want to try ghc HEAD with the llvm patches. conv_1 conv_2 conv_3 -Odph 1.004 2.715 1.096 -O2 1.000 2.710 1.097 i'm still curious, though, why my three versions of direct convolution perform so differently (see attached file). in particular, i somehow expected conv_3 to be the slowest and conv_2 to perform similar to conv_1. any ideas? i haven't had a look at the core yet, mainly because i'm lacking the expertise ... <sk>

On 06/03/2010, at 03:10, stefan kersten wrote:
i'm still curious, though, why my three versions of direct convolution perform so differently (see attached file). in particular, i somehow expected conv_3 to be the slowest and conv_2 to perform similar to conv_1. any ideas? i haven't had a look at the core yet, mainly because i'm lacking the expertise ...
Hmm, one problem is that the current definition of reverse is suboptimal to say the least. I'll fix that. Could you perhaps send me your complete benchmark? Roman

I've run into an issue with inlining that I'm not sure how to work around. I am instantiating some pre-existing type classes with Vector-based types. There already exist generic functions in modules I do not control that use this type class, and they are not tagged with the INLINE pragma. I am doubtful, but I figure it is worth at least asking: is there some practical workaround for this kind of situation that anybody knows about? - Jake

jake.mcarthur:
I've run into an issue with inlining that I'm not sure how to work around. I am instantiating some pre-existing type classes with Vector-based types. There already exist generic functions in modules I do not control that use this type class, and they are not tagged with the INLINE pragma. I am doubtful, but I figure it is worth at least asking: is there some practical workaround for this kind of situation that anybody knows about?
I don't know of a way, other than patching the library code. If it makes a difference to you, it probably makes a difference to lots of people. -- Don

I think Jake is referring to my vector-space package. He did the work of
writing 171 INLINE pragmas, covering lots of methods and standalone function
defs. I'm simultaneously grateful for the effort and repelled by the added
syntactic noise. Also concerned about the impact of all these directives on
other uses of vector-space. If all this inlining is a uniform win, I'd
rather ghc did it for me. If the ghc implementers have reasons not to do so
in general, then I'd expect that some of those reasons apply to vector-space
(and many other libraries) as well. It's not like I'm applying some kind of
domain-specific understanding that ghc doesn't have access to.
I'm raising my concerns here in the hopes of stimulating some creative
thinking and suggestions about addressing this sort of situation more
generally.
- Conal
On Sun, Mar 7, 2010 at 9:23 PM, Don Stewart
jake.mcarthur:
I've run into an issue with inlining that I'm not sure how to work around. I am instantiating some pre-existing type classes with Vector-based types. There already exist generic functions in modules I do not control that use this type class, and they are not tagged with the INLINE pragma. I am doubtful, but I figure it is worth at least asking: is there some practical workaround for this kind of situation that anybody knows about?
I don't know of a way, other than patching the library code. If it makes a difference to you, it probably makes a difference to lots of people.
-- Don _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Tue, Mar 9, 2010 at 11:53 AM, Conal Elliott
I think Jake is referring to my vector-space package. He did the work of writing 171 INLINE pragmas, covering lots of methods and standalone function defs. I'm simultaneously grateful for the effort and repelled by the added syntactic noise. Also concerned about the impact of all these directives on other uses of vector-space. If all this inlining is a uniform win, I'd rather ghc did it for me.
Alas, it very much is not easy to predict. The unfortunate thing about inline directives is that each individual one really can have a substantial, but not necessarily predictable, effect on the performance of an application. I have seen large improvements in performance, large drops in performance, nothing at all, and everything in between, and I have yet to develop a consistently successful intuition about what will work well, and when.

Here's a transcript from a conversation I had with Conal on IRC. tl;dr: <conal> cross-module inlining is only possible because ghc stashes a definition in a .hi, iuuc. i'm suggesting that the stashed definition either (a) never include further inlinings, or (b) be augmented by such a definition. Full transcript: 11:23:10 <conal> jmcarthur: i'm wondering what to do about INLINE pragmas for vector-space and other libraries. i love optimizability and clean/elegant/terse code. and i don't know how to resolve that tension. 11:23:21 <jmcarthur> conal: yeah, me either. it's annoying 11:23:41 <jmcarthur> conal: a compiler feature to do it more succinctly would be nice, if we can come up with one 11:23:52 <conal> jmcarthur: i'm thinking exactly the same 11:24:04 <conal> jmcarthur: a ghc flag that does what you did manually 11:24:41 <jmcarthur> conal: yeah, but we still need to do better than inlining *all* functions. we need to be able to tell it we want it to inline all functions satisfying some predicate or something 11:25:07 <jmcarthur> like, there's no point in forcing to inline functions having absolutely nothing to do with vector, for example 11:25:18 <conal> jmcarthur: i wonder. ghc already has some heuristics. do we really want anything different/unusual? 11:25:26 <jmcarthur> then again, combinators that don't inline and get used in a vector function later might still be annoying 11:26:08 <conal> jmcarthur: maybe some kind of demand-driven mechanism 11:26:21 <jmcarthur> conal: that's what i was thinking would be best 11:26:28 <conal> jmcarthur: ie pull inlining rather than push them. or some combo. 11:27:21 <conal> jmcarthur: i don't think this issue is specific to either vector fusion or to the vector-space package. 11:27:28 <jmcarthur> conal: actually, this is about rewrite rules more than inlining 11:27:40 <jmcarthur> conal: maybe if we focus on the rewrite rules we can think of something nicer 11:27:46 <conal> jmcarthur: ah, yeah. 11:28:32 <conal> jmcarthur: have you talked with they ghc guys about this issue? i wonder what practice they'd advise for use with the current ghc 11:28:54 <jmcarthur> i have not 11:29:47 <conal> jmcarthur: how did the inlining/rewriting concern arise for vector fusion and the vector-space package? 11:30:16 <jmcarthur> conal: i assume you read the email i linked to? 11:30:27 <jmcarthur> this one: http://www.haskell.org/pipermail/haskell-cafe/2010-March/074153.html 11:30:34 <conal> jmcarthur: yes. i'l reread now. 11:31:03 <jmcarthur> conal: "in general, you have to add INLINE pragmas in such cases if you want to be sure your code fuses. A general-purpose mechanism for handling situations like this automatically would be great but we haven't found a good one so far." 11:31:14 <jmcarthur> i think the most relevant line 11:31:49 <conal> jmcarthur: thx. and the difficulty (with current ghc) is specifically cross-module, right? 11:32:00 <jmcarthur> conal: that is my understanding 11:32:10 <jmcarthur> but perhaps it is more complex 11:32:49 <conal> jmcarthur: if so, i wonder if ghc could be fixed to inline between modules according to the same heuristics as within a module. 11:34:36 <jmcarthur> conal: maybe. 11:35:34 <conal> jmcarthur: part of my discomfort is that i don't know whether the INLINE directives are more helpful or more harmful under all uses. if they were generally helpful, i imagine ghc would do it. 11:56:56 <jmcarthur> me too 11:57:46 <conal> jmcarthur: i just found that haskell-cafe thread and added a reply. 11:58:31 <conal> jmcarthur: hoping that don, roman, etc will have some ideas in addressing that discomfort. 12:09:58 <jmcarthur> conal: apparently the real trick is that GHC will not inline functions in a function that is annotated INLINE, meaning that rewrite rules can fire on the outermost rule before firing on inner ones 12:10:30 <jmcarthur> conal: i think it would be nice if we could come up with a way for rewrite rules to affect GHC's inliner 12:10:44 <conal> jmcarthur: yeah. maybe INLINE ought to be decomposed into two sub-meanings. 12:10:45 <jmcarthur> then it would only happen when necessary 12:11:17 <jmcarthur> well, the fact that it forces that function to be inlined is also good though 12:11:33 <jmcarthur> which is apparently important across module boundaries 12:11:45 <conal> jmcarthur: maybe ghc could *never* inline functions into an inline body. and then do some caching to avoid redundant work. 12:12:00 <jmcarthur> perhaps. still leaves the cross-module inlining issue though 12:12:32 <jmcarthur> i suspect this is also architectural 12:12:44 <jmcarthur> ghc doesn't know if it will inline a function across a module boundary in advance 12:12:52 <jmcarthur> therefore it goes ahead and inlines into it 12:13:04 <jmcarthur> *inlines other functions into it 12:13:31 <conal> jmcarthur: i don't understand how module boundaries come into play 12:14:40 <jmcarthur> conal: my suspicion is that because ghc builds modules separately it can't know whether a function will be inlined in another module, so if it's not marked INLINE it feels free to inline other functions into it as it pleases 12:15:17 <jmcarthur> conal: but a function marked INLINE is clearly going to be inlined everywhere it's used, so ghc can go ahead and avoid inlining into it 12:17:07 <jmcarthur> conal: perhaps a way around it would be to avoid making inlining decisions about exported functions at all until link time 12:17:20 <jmcarthur> but that would stink a little 12:17:24 <conal> jmcarthur: maybe so. sounds like a place for improving ghc. it makes an invalid assumption: that a def not marked INLINE won't be inlined. 12:17:25 <conal> jmcarthur: instead, it could be prepared to inline elsewhere. 12:17:25 <conal> jmcarthur: i'm confused though. ghc *is* prepared to inline elsewhere. 12:18:23 <jmcarthur> conal: well, it's not that ghc makes an invalid assumption, i think. it just doesn't take certain steps unless you tell it to explicitly 12:18:58 <conal> jmcarthur: the invalid assumption i mean is that it's not worth saving an inline-and-optimze-friendly def. 12:19:43 <conal> jmcarthur: cross-module inlining is only possible because ghc stashes a definition in a .hi, iuuc. i'm suggesting that the stashed definition either (a) never include further inlinings, or (b) be augmented by such a definition. 12:20:07 <jmcarthur> conal: i agree with that suggestion, if our understanding of how it works is correct 12:20:54 <conal> jmcarthur: okay. let's take this understanding into the haskell-cafe discussion and see what evolves. okay? - Jake

This is my understanding:
Old story (GHC 6.12.1 (?) and below):
1) Function bodies are only optimised if they are not marked INLINE.
The assumption is that INLINE bodies will be inlined and then
optimised at the point where they are inlined.
2) Unfoldings are exported in .hi files for functions if they are
nonrecursive and "small enough" or marked INLINE
New story (GHC HEAD):
1) Function bodies are optimised regardless of whether they are marked
INLINE or not.
2) Unfoldings are exported in .hi files for functions if they are
nonrecursive and "small enough" or marked INLINE, but the unfolding
that is exported is the _unoptimised_ version!
- It is important to export the unoptimised version because the
optimisation may destroy information we would have relied upon in rule
matching. E.g. the optimised version of a list producer will not use
"build", but we need to be able to spot the "build" in rules to do
foldr/build fusion.
The new story is much more robust - adding INLINE pragmas should be a
much safer thing to do now, because even if the function is not
inlined you will still get an optimised definition for that function.
Other relevant points:
a) There is a new flag, -fexpose-all-unfoldings, in GHC HEAD. This
does what you want - compile vector-space with this flag on and
unfoldings for everything will show up in the .hi file, with no INLINE
annotations necessary. This was intended for supercompilers (such as
the one I'm working on :-) but you can probably use for this as well.
b) GHC already treats RULES differently for the purposes of inlining.
It is very eager to inline into the arguments of a function you have
RULES for, because doing so may let it spot extra chances for the rule
to fire.
Hope that helps,
Max
On 10 March 2010 16:23, Jake McArthur
Here's a transcript from a conversation I had with Conal on IRC.
tl;dr: <conal> cross-module inlining is only possible because ghc stashes a definition in a .hi, iuuc. i'm suggesting that the stashed definition either (a) never include further inlinings, or (b) be augmented by such a definition.
Full transcript:
11:23:10 <conal> jmcarthur: i'm wondering what to do about INLINE pragmas for vector-space and other libraries. i love optimizability and clean/elegant/terse code. and i don't know how to resolve that tension. 11:23:21 <jmcarthur> conal: yeah, me either. it's annoying 11:23:41 <jmcarthur> conal: a compiler feature to do it more succinctly would be nice, if we can come up with one 11:23:52 <conal> jmcarthur: i'm thinking exactly the same 11:24:04 <conal> jmcarthur: a ghc flag that does what you did manually 11:24:41 <jmcarthur> conal: yeah, but we still need to do better than inlining *all* functions. we need to be able to tell it we want it to inline all functions satisfying some predicate or something 11:25:07 <jmcarthur> like, there's no point in forcing to inline functions having absolutely nothing to do with vector, for example 11:25:18 <conal> jmcarthur: i wonder. ghc already has some heuristics. do we really want anything different/unusual? 11:25:26 <jmcarthur> then again, combinators that don't inline and get used in a vector function later might still be annoying 11:26:08 <conal> jmcarthur: maybe some kind of demand-driven mechanism 11:26:21 <jmcarthur> conal: that's what i was thinking would be best 11:26:28 <conal> jmcarthur: ie pull inlining rather than push them. or some combo. 11:27:21 <conal> jmcarthur: i don't think this issue is specific to either vector fusion or to the vector-space package. 11:27:28 <jmcarthur> conal: actually, this is about rewrite rules more than inlining 11:27:40 <jmcarthur> conal: maybe if we focus on the rewrite rules we can think of something nicer 11:27:46 <conal> jmcarthur: ah, yeah. 11:28:32 <conal> jmcarthur: have you talked with they ghc guys about this issue? i wonder what practice they'd advise for use with the current ghc 11:28:54 <jmcarthur> i have not 11:29:47 <conal> jmcarthur: how did the inlining/rewriting concern arise for vector fusion and the vector-space package? 11:30:16 <jmcarthur> conal: i assume you read the email i linked to? 11:30:27 <jmcarthur> this one: http://www.haskell.org/pipermail/haskell-cafe/2010-March/074153.html 11:30:34 <conal> jmcarthur: yes. i'l reread now. 11:31:03 <jmcarthur> conal: "in general, you have to add INLINE pragmas in such cases if you want to be sure your code fuses. A general-purpose mechanism for handling situations like this automatically would be great but we haven't found a good one so far." 11:31:14 <jmcarthur> i think the most relevant line 11:31:49 <conal> jmcarthur: thx. and the difficulty (with current ghc) is specifically cross-module, right? 11:32:00 <jmcarthur> conal: that is my understanding 11:32:10 <jmcarthur> but perhaps it is more complex 11:32:49 <conal> jmcarthur: if so, i wonder if ghc could be fixed to inline between modules according to the same heuristics as within a module. 11:34:36 <jmcarthur> conal: maybe. 11:35:34 <conal> jmcarthur: part of my discomfort is that i don't know whether the INLINE directives are more helpful or more harmful under all uses. if they were generally helpful, i imagine ghc would do it. 11:56:56 <jmcarthur> me too 11:57:46 <conal> jmcarthur: i just found that haskell-cafe thread and added a reply. 11:58:31 <conal> jmcarthur: hoping that don, roman, etc will have some ideas in addressing that discomfort. 12:09:58 <jmcarthur> conal: apparently the real trick is that GHC will not inline functions in a function that is annotated INLINE, meaning that rewrite rules can fire on the outermost rule before firing on inner ones 12:10:30 <jmcarthur> conal: i think it would be nice if we could come up with a way for rewrite rules to affect GHC's inliner 12:10:44 <conal> jmcarthur: yeah. maybe INLINE ought to be decomposed into two sub-meanings. 12:10:45 <jmcarthur> then it would only happen when necessary 12:11:17 <jmcarthur> well, the fact that it forces that function to be inlined is also good though 12:11:33 <jmcarthur> which is apparently important across module boundaries 12:11:45 <conal> jmcarthur: maybe ghc could *never* inline functions into an inline body. and then do some caching to avoid redundant work. 12:12:00 <jmcarthur> perhaps. still leaves the cross-module inlining issue though 12:12:32 <jmcarthur> i suspect this is also architectural 12:12:44 <jmcarthur> ghc doesn't know if it will inline a function across a module boundary in advance 12:12:52 <jmcarthur> therefore it goes ahead and inlines into it 12:13:04 <jmcarthur> *inlines other functions into it 12:13:31 <conal> jmcarthur: i don't understand how module boundaries come into play 12:14:40 <jmcarthur> conal: my suspicion is that because ghc builds modules separately it can't know whether a function will be inlined in another module, so if it's not marked INLINE it feels free to inline other functions into it as it pleases 12:15:17 <jmcarthur> conal: but a function marked INLINE is clearly going to be inlined everywhere it's used, so ghc can go ahead and avoid inlining into it 12:17:07 <jmcarthur> conal: perhaps a way around it would be to avoid making inlining decisions about exported functions at all until link time 12:17:20 <jmcarthur> but that would stink a little 12:17:24 <conal> jmcarthur: maybe so. sounds like a place for improving ghc. it makes an invalid assumption: that a def not marked INLINE won't be inlined. 12:17:25 <conal> jmcarthur: instead, it could be prepared to inline elsewhere. 12:17:25 <conal> jmcarthur: i'm confused though. ghc *is* prepared to inline elsewhere. 12:18:23 <jmcarthur> conal: well, it's not that ghc makes an invalid assumption, i think. it just doesn't take certain steps unless you tell it to explicitly 12:18:58 <conal> jmcarthur: the invalid assumption i mean is that it's not worth saving an inline-and-optimze-friendly def. 12:19:43 <conal> jmcarthur: cross-module inlining is only possible because ghc stashes a definition in a .hi, iuuc. i'm suggesting that the stashed definition either (a) never include further inlinings, or (b) be augmented by such a definition. 12:20:07 <jmcarthur> conal: i agree with that suggestion, if our understanding of how it works is correct 12:20:54 <conal> jmcarthur: okay. let's take this understanding into the haskell-cafe discussion and see what evolves. okay?
- Jake _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Hi Max. Thanks much for passing on this info. Very encouraging news! -
Conal
On Wed, Mar 10, 2010 at 8:41 AM, Max Bolingbroke wrote: This is my understanding: Old story (GHC 6.12.1 (?) and below):
1) Function bodies are only optimised if they are not marked INLINE.
The assumption is that INLINE bodies will be inlined and then
optimised at the point where they are inlined.
2) Unfoldings are exported in .hi files for functions if they are
nonrecursive and "small enough" or marked INLINE New story (GHC HEAD):
1) Function bodies are optimised regardless of whether they are marked
INLINE or not.
2) Unfoldings are exported in .hi files for functions if they are
nonrecursive and "small enough" or marked INLINE, but the unfolding
that is exported is the _unoptimised_ version!
- It is important to export the unoptimised version because the
optimisation may destroy information we would have relied upon in rule
matching. E.g. the optimised version of a list producer will not use
"build", but we need to be able to spot the "build" in rules to do
foldr/build fusion. The new story is much more robust - adding INLINE pragmas should be a
much safer thing to do now, because even if the function is not
inlined you will still get an optimised definition for that function. Other relevant points:
a) There is a new flag, -fexpose-all-unfoldings, in GHC HEAD. This
does what you want - compile vector-space with this flag on and
unfoldings for everything will show up in the .hi file, with no INLINE
annotations necessary. This was intended for supercompilers (such as
the one I'm working on :-) but you can probably use for this as well.
b) GHC already treats RULES differently for the purposes of inlining.
It is very eager to inline into the arguments of a function you have
RULES for, because doing so may let it spot extra chances for the rule
to fire. Hope that helps,
Max On 10 March 2010 16:23, Jake McArthur Here's a transcript from a conversation I had with Conal on IRC. tl;dr: <conal> cross-module inlining is only possible because ghc
stashes a definition in a .hi, iuuc. i'm suggesting that the stashed
definition either (a) never include further inlinings, or (b) be
augmented by such a definition. Full transcript: 11:23:10 <conal> jmcarthur: i'm wondering what to do about INLINE
pragmas for vector-space and other libraries. i love optimizability
and clean/elegant/terse code. and i don't know how to resolve that
tension.
11:23:21 <jmcarthur> conal: yeah, me either. it's annoying
11:23:41 <jmcarthur> conal: a compiler feature to do it more
succinctly would be nice, if we can come up with one
11:23:52 <conal> jmcarthur: i'm thinking exactly the same
11:24:04 <conal> jmcarthur: a ghc flag that does what you did manually
11:24:41 <jmcarthur> conal: yeah, but we still need to do better than
inlining *all* functions. we need to be able to tell it we want it to
inline all functions satisfying some predicate or something
11:25:07 <jmcarthur> like, there's no point in forcing to inline
functions having absolutely nothing to do with vector, for example
11:25:18 <conal> jmcarthur: i wonder. ghc already has some
heuristics. do we really want anything different/unusual?
11:25:26 <jmcarthur> then again, combinators that don't inline and get
used in a vector function later might still be annoying
11:26:08 <conal> jmcarthur: maybe some kind of demand-driven mechanism
11:26:21 <jmcarthur> conal: that's what i was thinking would be best
11:26:28 <conal> jmcarthur: ie pull inlining rather than push them.
or some combo.
11:27:21 <conal> jmcarthur: i don't think this issue is specific to
either vector fusion or to the vector-space package.
11:27:28 <jmcarthur> conal: actually, this is about rewrite rules more
than inlining
11:27:40 <jmcarthur> conal: maybe if we focus on the rewrite rules we
can think of something nicer
11:27:46 <conal> jmcarthur: ah, yeah.
11:28:32 <conal> jmcarthur: have you talked with they ghc guys about
this issue? i wonder what practice they'd advise for use with the
current ghc
11:28:54 <jmcarthur> i have not
11:29:47 <conal> jmcarthur: how did the inlining/rewriting concern
arise for vector fusion and the vector-space package?
11:30:16 <jmcarthur> conal: i assume you read the email i linked to?
11:30:27 <jmcarthur> this one:
http://www.haskell.org/pipermail/haskell-cafe/2010-March/074153.html
11:30:34 <conal> jmcarthur: yes. i'l reread now.
11:31:03 <jmcarthur> conal: "in general, you have to add INLINE
pragmas in such cases if you want to be sure your code fuses. A
general-purpose mechanism for handling situations like this
automatically would be great but we haven't found a good one so far."
11:31:14 <jmcarthur> i think the most relevant line
11:31:49 <conal> jmcarthur: thx. and the difficulty (with current
ghc) is specifically cross-module, right?
11:32:00 <jmcarthur> conal: that is my understanding
11:32:10 <jmcarthur> but perhaps it is more complex
11:32:49 <conal> jmcarthur: if so, i wonder if ghc could be fixed to
inline between modules according to the same heuristics as within a
module.
11:34:36 <jmcarthur> conal: maybe.
11:35:34 <conal> jmcarthur: part of my discomfort is that i don't know
whether the INLINE directives are more helpful or more harmful under
all uses. if they were generally helpful, i imagine ghc would do it.
11:56:56 <jmcarthur> me too
11:57:46 <conal> jmcarthur: i just found that haskell-cafe thread and
added a reply.
11:58:31 <conal> jmcarthur: hoping that don, roman, etc will have some
ideas in addressing that discomfort.
12:09:58 <jmcarthur> conal: apparently the real trick is that GHC will
not inline functions in a function that is annotated INLINE, meaning
that rewrite rules can fire on the outermost rule before firing on
inner ones
12:10:30 <jmcarthur> conal: i think it would be nice if we could come
up with a way for rewrite rules to affect GHC's inliner
12:10:44 <conal> jmcarthur: yeah. maybe INLINE ought to be decomposed
into two sub-meanings.
12:10:45 <jmcarthur> then it would only happen when necessary
12:11:17 <jmcarthur> well, the fact that it forces that function to be
inlined is also good though
12:11:33 <jmcarthur> which is apparently important across module
boundaries
12:11:45 <conal> jmcarthur: maybe ghc could *never* inline functions
into an inline body. and then do some caching to avoid redundant
work.
12:12:00 <jmcarthur> perhaps. still leaves the cross-module inlining
issue though
12:12:32 <jmcarthur> i suspect this is also architectural
12:12:44 <jmcarthur> ghc doesn't know if it will inline a function
across a module boundary in advance
12:12:52 <jmcarthur> therefore it goes ahead and inlines into it
12:13:04 <jmcarthur> *inlines other functions into it
12:13:31 <conal> jmcarthur: i don't understand how module boundaries
come into play
12:14:40 <jmcarthur> conal: my suspicion is that because ghc builds
modules separately it can't know whether a function will be inlined in
another module, so if it's not marked INLINE it feels free to inline
other functions into it as it pleases
12:15:17 <jmcarthur> conal: but a function marked INLINE is clearly
going to be inlined everywhere it's used, so ghc can go ahead and
avoid inlining into it
12:17:07 <jmcarthur> conal: perhaps a way around it would be to avoid
making inlining decisions about exported functions at all until link
time
12:17:20 <jmcarthur> but that would stink a little
12:17:24 <conal> jmcarthur: maybe so. sounds like a place for
improving ghc. it makes an invalid assumption: that a def not marked
INLINE won't be inlined.
12:17:25 <conal> jmcarthur: instead, it could be prepared to inline
elsewhere.
12:17:25 <conal> jmcarthur: i'm confused though. ghc *is* prepared to
inline elsewhere.
12:18:23 <jmcarthur> conal: well, it's not that ghc makes an invalid
assumption, i think. it just doesn't take certain steps unless you
tell it to explicitly
12:18:58 <conal> jmcarthur: the invalid assumption i mean is that it's
not worth saving an inline-and-optimze-friendly def.
12:19:43 <conal> jmcarthur: cross-module inlining is only possible
because ghc stashes a definition in a .hi, iuuc. i'm suggesting that
the stashed definition either (a) never include further inlinings, or
(b) be augmented by such a definition.
12:20:07 <jmcarthur> conal: i agree with that suggestion, if our
understanding of how it works is correct
12:20:54 <conal> jmcarthur: okay. let's take this understanding into
the haskell-cafe discussion and see what evolves. okay? - Jake
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe _______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

On Mar 4, 2010, at 9:05 PM, Roman Leshchinskiy wrote:
On 05/03/2010, at 04:34, stefan kersten wrote:
i've been hunting down some performance problems in DSP code using vector and the single most important transformation seems to be throwing in INLINE pragmas for any function that uses vector combinators and is to be called from higher-level code. failing to do so seems to prevent vector operations from being fused and results in big performance hits (the good news is that the optimized code is quite competitive)....
the downside after adding the INLINE pragmas is that now some of my modules take _really_ long to compile (up to a couple of minutes); any ideas where i can start looking to bring the compilation times down again?
Alas, stream fusion (and fusion in general, I guess) requires what I would call whole loop compilation - you need to inline everything into loops. That tends to be slow. I don't know what your code looks like but you could try to control inlining a bit more. For instance, if you have something like this:
foo ... = ... map f xs ... where f x = ...
I can confirm that this is a general problem with libraries based on fusion / deforestation (having done the independent implementation of fusion in pH back in the day). No INLINE pragma? No fusion for you! That said, as Roman points out it'd be nice if when GHC discovers something is inlinable, it would inline the original definition (or perhaps the inlined, simplified, no-rules-firing version of same). The problem is that this duplicates a lot of the work of the optimizer a lot of the time.
you could tell GHC not to inline f until fairly late in the game by adding
{-# INLINE [0] f #-}
to the where clause. This helps sometimes.
Hands up if you can remember what things are legal in the braces, and what they mean. :-) I suspect I'm not the only one for whom remembering this stuff is fairly hard, in part because it doesn't come up too often. -Jan-Willem Maessen
participants (8)
-
Bryan O'Sullivan
-
Conal Elliott
-
Don Stewart
-
Jake McArthur
-
Jan-Willem Maessen
-
Max Bolingbroke
-
Roman Leshchinskiy
-
stefan kersten