Should folds in containers package remain INLINE

Hi all, this came up when discussing increasing size of GHC binary. Currently all folds in containers package are marked as INLINE. This has following effects: 1) the code using folds can be (much) more efficient -- for example, when calling statically known function. If the unfolding of that function is available, GHC can spot strictness and unbox values -- so `foldl (+) (0::Int)` evaluated the sum strictly and without allocating space for intermediate Ints. 2) the resulting binary size increases. If the folds in containers package are not INLINEd, the stripped GHC binary shrinks by 303kB, which is 0.8% of its size. Therefore we have speed vs. code size trade-off. FYI, Prelude.foldr is always inlined, Prelude.foldl is inlined only when GHC thinks it is worth it. Simon Marlow suggested that folds could be marked INLINABLE. Then they would probably not be inlined automatically, but one could say inline foldr to inline the fold on the call sites she chooses. Personally I am a bit in favor of keeping the folds INLINE. That allows the users of containers to get best performance without any change to the code (i.e., adding explicit `inline`). The price to pay is code size increase, which I consider minor (0.8% for GHC binary). Any other thoughts? Cheers, Milan

It doesn't seem like enough of a code size reduction to justify the change in this case.
Is there any opportunity to attack this problem later in the compiler? Perhaps a CSE for similar blocks of code? I've noticed enormous reductions in size using UPX, so I know these binaries themselves are quite compressible.
Sent from my cell phone
On Apr 26, 2012, at 7:10 AM, Milan Straka
Hi all,
this came up when discussing increasing size of GHC binary.
Currently all folds in containers package are marked as INLINE. This has following effects:
1) the code using folds can be (much) more efficient -- for example, when calling statically known function. If the unfolding of that function is available, GHC can spot strictness and unbox values -- so `foldl (+) (0::Int)` evaluated the sum strictly and without allocating space for intermediate Ints.
2) the resulting binary size increases. If the folds in containers package are not INLINEd, the stripped GHC binary shrinks by 303kB, which is 0.8% of its size.
Therefore we have speed vs. code size trade-off. FYI, Prelude.foldr is always inlined, Prelude.foldl is inlined only when GHC thinks it is worth it.
Simon Marlow suggested that folds could be marked INLINABLE. Then they would probably not be inlined automatically, but one could say inline foldr to inline the fold on the call sites she chooses.
Personally I am a bit in favor of keeping the folds INLINE. That allows the users of containers to get best performance without any change to the code (i.e., adding explicit `inline`). The price to pay is code size increase, which I consider minor (0.8% for GHC binary).
Any other thoughts?
Cheers, Milan
_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

I agree with Ryan, I think we should keep the INLINEs on folds. It's a
nice property of the current implementation that folds will be as fast
as handrolled recursive traversals over the data type.
On Thu, Apr 26, 2012 at 7:22 AM, Ryan Newton
It doesn't seem like enough of a code size reduction to justify the change in this case.
Is there any opportunity to attack this problem later in the compiler? Perhaps a CSE for similar blocks of code? I've noticed enormous reductions in size using UPX, so I know these binaries themselves are quite compressible.
Sent from my cell phone
On Apr 26, 2012, at 7:10 AM, Milan Straka
wrote: Hi all,
this came up when discussing increasing size of GHC binary.
Currently all folds in containers package are marked as INLINE. This has following effects:
1) the code using folds can be (much) more efficient -- for example, when calling statically known function. If the unfolding of that function is available, GHC can spot strictness and unbox values -- so `foldl (+) (0::Int)` evaluated the sum strictly and without allocating space for intermediate Ints.
2) the resulting binary size increases. If the folds in containers package are not INLINEd, the stripped GHC binary shrinks by 303kB, which is 0.8% of its size.
Therefore we have speed vs. code size trade-off. FYI, Prelude.foldr is always inlined, Prelude.foldl is inlined only when GHC thinks it is worth it.
Simon Marlow suggested that folds could be marked INLINABLE. Then they would probably not be inlined automatically, but one could say inline foldr to inline the fold on the call sites she chooses.
Personally I am a bit in favor of keeping the folds INLINE. That allows the users of containers to get best performance without any change to the code (i.e., adding explicit `inline`). The price to pay is code size increase, which I consider minor (0.8% for GHC binary).
Any other thoughts?
Cheers, Milan
_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries
_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

I would like the folds to be INLINABLE. Rationale: - The 0.8% figure is really plucked out of thin air, it depends entirely on how many times we call fold. - Getting a complete copy of fold each time you call it is overkill. - INLINABLE puts the user in control of the performance/size tradeoff; with INLINE the user has no control, they always get a copy (well, they could write a NOINLINE wrapper, but that's horrible). just MHO... Cheers, Simon On 26/04/2012 18:35, Johan Tibell wrote:
I agree with Ryan, I think we should keep the INLINEs on folds. It's a nice property of the current implementation that folds will be as fast as handrolled recursive traversals over the data type.
On Thu, Apr 26, 2012 at 7:22 AM, Ryan Newton
wrote: It doesn't seem like enough of a code size reduction to justify the change in this case.
Is there any opportunity to attack this problem later in the compiler? Perhaps a CSE for similar blocks of code? I've noticed enormous reductions in size using UPX, so I know these binaries themselves are quite compressible.
Sent from my cell phone
On Apr 26, 2012, at 7:10 AM, Milan Straka
wrote: Hi all,
this came up when discussing increasing size of GHC binary.
Currently all folds in containers package are marked as INLINE. This has following effects:
1) the code using folds can be (much) more efficient -- for example, when calling statically known function. If the unfolding of that function is available, GHC can spot strictness and unbox values -- so `foldl (+) (0::Int)` evaluated the sum strictly and without allocating space for intermediate Ints.
2) the resulting binary size increases. If the folds in containers package are not INLINEd, the stripped GHC binary shrinks by 303kB, which is 0.8% of its size.
Therefore we have speed vs. code size trade-off. FYI, Prelude.foldr is always inlined, Prelude.foldl is inlined only when GHC thinks it is worth it.
Simon Marlow suggested that folds could be marked INLINABLE. Then they would probably not be inlined automatically, but one could say inline foldr to inline the fold on the call sites she chooses.
Personally I am a bit in favor of keeping the folds INLINE. That allows the users of containers to get best performance without any change to the code (i.e., adding explicit `inline`). The price to pay is code size increase, which I consider minor (0.8% for GHC binary).
Any other thoughts?
Cheers, Milan
_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries
_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries
_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

On 5/2/12 7:16 AM, Simon Marlow wrote:
I would like the folds to be INLINABLE. Rationale:
- The 0.8% figure is really plucked out of thin air, it depends entirely on how many times we call fold.
- Getting a complete copy of fold each time you call it is overkill.
- INLINABLE puts the user in control of the performance/size tradeoff; with INLINE the user has no control, they always get a copy (well, they could write a NOINLINE wrapper, but that's horrible).
What exactly are the semantics of INLINABLE again? I know it means that it's like a constructor in that it's considered cheap to reevaluate, but how does that play into the whole inlining/rules/fusion/cse infrastructure? I know one thing I run into a lot, especially with folds but also with other generic functions, is that what I really want is type class specialization. That is, at compile time inline the function enough to make the type class parameters static so that the methods can be inlined back into the generic function; and then at runtime do type-based dispatch to the specialized versions just like if you were looking up the instance record or doing SPECIALIZE lookup. The goal being to remove the indirect calls in inner loops, but in a particularly restricted way since instances are known at compile time and therefore code bloat will be limited (unlike, say, function or record arguments). Since that's one of the big goals that people use the INLINE hammer for, I wonder if it would be sensible to have a pragma which specifically offers those semantics. It doesn't quite solve the fold issue since many folds use ad-hoc arguments, but it may help reduce code bloat elsewhere. -- Live well, ~wren

On Thu, May 3, 2012 at 4:31 PM, wren ng thornton
On 5/2/12 7:16 AM, Simon Marlow wrote:
I would like the folds to be INLINABLE. Rationale:
- The 0.8% figure is really plucked out of thin air, it depends entirely on how many times we call fold.
- Getting a complete copy of fold each time you call it is overkill.
- INLINABLE puts the user in control of the performance/size tradeoff; with INLINE the user has no control, they always get a copy (well, they could write a NOINLINE wrapper, but that's horrible).
What exactly are the semantics of INLINABLE again? I know it means that it's like a constructor in that it's considered cheap to reevaluate, but how does that play into the whole inlining/rules/fusion/cse infrastructure?
INLINABLE: make the core available in the .hi file so this function *may* be inlined. INLINE:: make the core available in the .hi file so this function *may* be inlined, also try to inline the function. So INLINABLE does make the function look any cheaper (i.e. not like a constructor.) It just makes sure that the functions definition is available when compiling other modules. However, GHC also specializes (at the call site) functions with type class arguments that are marked as INLINABLE (it might also specialize such functions even without the INLINABLE.) The containers package makes frequent use of INLINABLE for this reason. -- Johan

On 5/3/12 7:46 PM, Johan Tibell wrote:
On Thu, May 3, 2012 at 4:31 PM, wren ng thornton
wrote: What exactly are the semantics of INLINABLE again? I know it means that it's like a constructor in that it's considered cheap to reevaluate, but how does that play into the whole inlining/rules/fusion/cse infrastructure?
INLINABLE: make the core available in the .hi file so this function *may* be inlined. INLINE:: make the core available in the .hi file so this function *may* be inlined, also try to inline the function.
So INLINABLE does make the function look any cheaper (i.e. not like a constructor.) It just makes sure that the functions definition is available when compiling other modules.
Ah, ISTR that the goal was to make things look cheap ---to remove certain conflicts that had been necessitating INLINE--- without actually 'forcing' the inlining to happen. (Yes, I know INLINE doesn't actually guarantee inlining will happen.) So it's just making it possible to inline, rather than actually affecting any decisions about inlining then?
However, GHC also specializes (at the call site) functions with type class arguments that are marked as INLINABLE (it might also specialize such functions even without the INLINABLE.) The containers package makes frequent use of INLINABLE for this reason.
So INLINABLE does do type class specialization. Excellent. -- Live well, ~wren

| I know one thing I run into a lot, especially with folds but also with | other generic functions, is that what I really want is type class | specialization. That is, at compile time inline the function enough to | make the type class parameters static so that the methods can be inlined | back into the generic function; and then at runtime do type-based | dispatch to the specialized versions just like if you were looking up | the instance record or doing SPECIALIZE lookup. The goal being to remove | the indirect calls in inner loops, but in a particularly restricted way | since instances are known at compile time and therefore code bloat will | be limited (unlike, say, function or record arguments). Can you give a concrete example. It's hard to be certain but what you describe sounds like exactly what INLINABLE does. Simon

On 5/4/12 5:01 AM, Simon Peyton-Jones wrote:
| I know one thing I run into a lot, especially with folds but also with | other generic functions, is that what I really want is type class | specialization. That is, at compile time inline the function enough to | make the type class parameters static so that the methods can be inlined | back into the generic function; and then at runtime do type-based | dispatch to the specialized versions just like if you were looking up | the instance record or doing SPECIALIZE lookup. The goal being to remove | the indirect calls in inner loops, but in a particularly restricted way | since instances are known at compile time and therefore code bloat will | be limited (unlike, say, function or record arguments).
Can you give a concrete example. It's hard to be certain but what you describe sounds like exactly what INLINABLE does.
For a (semi)concrete example, say we have an implementation of the forward--backward algorithm. The algorithm is O(n^3) but those are some nice tight loops. The algorithm can be parametrized by your choice of semiring. So we make a type class for semirings in order to make the algorithm generic, but we want the algorithm to be specialized on the type class rather than performing indirect calls in the middle of O(n^3). The (first-order) forward algorithm is defined by the recurrence: alpha :: Position -> State -> Probability alpha 0 s0 = prior s0 alpha j sj = let i = j-1 in sum [ alpha i si * p si sj | si <- states i ] where prior and p are probability models, states gives us all the possible states at that position, the sum and (*) are actually for the semiring of choice, and where we actually store everything in a table in order to do dynamic programming. The backward algorithm is defined similarly: beta :: Position -> State -> Probability beta T sT = coprior sT beta i si = let j = i+1 in sum [ p si sj * beta j sj | sj <- states j ] where T is some known constant, namely the maximum position. For efficiency reasons, the actual code is a good deal more complicated than the above recurrences, but the basic idea is the same. We don't particularly want to specialize on the parameters (prior, p, states,...) since those are defined dynamically, but we do want to specialize on the semiring since the set of semirings we'll care about is fixed at compile time. Given what Johan Tibell said, I think INLINABLE will in fact do this; but his description of the pragma differs from what I recalled of back when we came up with it. Hence the question. -- Live well, ~wren

| > Can you give a concrete example. It's hard to be certain but what you | describe sounds like exactly what INLINABLE does. Your example didn't use any type classes, so GHC won't specialise it. (Specialising on *value* parameters is another story.) But your words mention type classes, so maybe it will. It's hard to tell. Maybe you can try it and see. Simon | -----Original Message----- | From: libraries-bounces@haskell.org [mailto:libraries- | bounces@haskell.org] On Behalf Of wren ng thornton | Sent: 06 May 2012 01:31 | To: libraries@haskell.org | Subject: Re: Should folds in containers package remain INLINE | | On 5/4/12 5:01 AM, Simon Peyton-Jones wrote: | > | > | I know one thing I run into a lot, especially with folds but also | > | with other generic functions, is that what I really want is type | > | class specialization. That is, at compile time inline the function | > | enough to make the type class parameters static so that the methods | > | can be inlined back into the generic function; and then at runtime | > | do type-based dispatch to the specialized versions just like if you | > | were looking up the instance record or doing SPECIALIZE lookup. The | > | goal being to remove the indirect calls in inner loops, but in a | > | particularly restricted way since instances are known at compile | > | time and therefore code bloat will be limited (unlike, say, function | or record arguments). | > | > Can you give a concrete example. It's hard to be certain but what you | describe sounds like exactly what INLINABLE does. | | For a (semi)concrete example, say we have an implementation of the | forward--backward algorithm. The algorithm is O(n^3) but those are some | nice tight loops. The algorithm can be parametrized by your choice of | semiring. So we make a type class for semirings in order to make the | algorithm generic, but we want the algorithm to be specialized on the | type class rather than performing indirect calls in the middle of | O(n^3). | | The (first-order) forward algorithm is defined by the recurrence: | | alpha :: Position -> State -> Probability | alpha 0 s0 = prior s0 | alpha j sj = | let i = j-1 in | sum [ alpha i si * p si sj | si <- states i ] | | where prior and p are probability models, states gives us all the | possible states at that position, the sum and (*) are actually for the | semiring of choice, and where we actually store everything in a table in | order to do dynamic programming. | | The backward algorithm is defined similarly: | | beta :: Position -> State -> Probability | beta T sT = coprior sT | beta i si = | let j = i+1 in | sum [ p si sj * beta j sj | sj <- states j ] | | where T is some known constant, namely the maximum position. | | For efficiency reasons, the actual code is a good deal more complicated | than the above recurrences, but the basic idea is the same. We don't | particularly want to specialize on the parameters (prior, p, states,...) | since those are defined dynamically, but we do want to specialize on the | semiring since the set of semirings we'll care about is fixed at compile | time. | | Given what Johan Tibell said, I think INLINABLE will in fact do this; | but his description of the pragma differs from what I recalled of back | when we came up with it. Hence the question. | | -- | Live well, | ~wren | | _______________________________________________ | Libraries mailing list | Libraries@haskell.org | http://www.haskell.org/mailman/listinfo/libraries

On 5/7/12 4:02 AM, Simon Peyton-Jones wrote:
|> Can you give a concrete example. It's hard to be certain but what you |> describe sounds like exactly what INLINABLE does.
Your example didn't use any type classes, so GHC won't specialise it.
In order to give a sense of why it matters I only presented the general recurrence rather than the actual Haskell code; in the recurrence the type class is left implicit, but covers (Probability,(*),sum). The actual type of the forward algorithm in the Haskell code is: forward :: ( Enum i , MapLike i map_i , MapLike ts map_ts , SlidingWindow t ts , ExtendedSemiring t sr sr0 ) => map_ts Prob -- ^ Prior probabilities: @Pr( t_{1-k}^0 )@ -> (t -> ts -> Prob) -- ^ Transition probabilities: @Pr( t_j | t_{j-k}^{j-1} )@ -> (w -> t -> Prob) -- ^ Emission probabilities: @Pr( w_j | t_j )@ -> (w -> [t]) -- ^ A tag dictionary for all words -> [w] -- ^ The sentence to be tagged: @w_1^N@ -> (i, map_i (map_ts sr)) -- ^ The final index and table: @(N, alpha)@ The first four arguments are passed in together, but are dynamically defined; and the resulting function will be called on multiple [w]. We will almost certainly satisfy: i ~ Nat -- a newtype of Int map_i ~ NatMap -- a newtype of IntMap; fundep defines i t ~ ID Tag -- a newtype of Int w ~ ID Word -- a newtype of Int and so should definitely specialize on them. That part can be handled by SPECIALIZE since it's well-known. But the important things are the types which are left abstract but which we want to specialize on: ts -- some n-tuple of @t@ for unknown @n@ map_ts -- fundep defines ts sr -- fundep defined by sr0 sr0 -- the semiring-like structure -- Live well, ~wren

On 4/26/12 10:22 AM, Ryan Newton wrote:
It doesn't seem like enough of a code size reduction to justify the change in this case.
Is there any opportunity to attack this problem later in the compiler? Perhaps a CSE for similar blocks of code? I've noticed enormous reductions in size using UPX, so I know these binaries themselves are quite compressible.
I agree. The reduction doesn't seem large enough to justify a change. Though it definitely seems like we should keep an eye out for ways of doing CSE, or similar, in order to remove redundancies in inlining. -- Live well, ~wren

On Thu, Apr 26, 2012 at 8:10 AM, Milan Straka
2) the resulting binary size increases. If the folds in containers package are not INLINEd, the stripped GHC binary shrinks by 303kB, which is 0.8% of its size.
At first I thought you said that GHC binary shrinked to 303 KiB, which would be quite a feat! =) Cheers, -- Felipe.
participants (7)
-
Felipe Almeida Lessa
-
Johan Tibell
-
Milan Straka
-
Ryan Newton
-
Simon Marlow
-
Simon Peyton-Jones
-
wren ng thornton