Containers INLINE issue continues

Hi, we still need to find consensus on containers INLINE issue (sorry about me already deciding and pushing :(). To recapitulate, one part of the containers performance improvements was to put INLINE on nearly all Data.Map methods. That has positive impact on performance (especially for Map Int, Map Word, Map Char, etc, where the comparison is just an instruction; for fold method the speedup works for almost any type). The reason for this is that inlining allows specialization to happen. The negative effect is that the code for insert, lookup, etc. is repeated on every call site. That is not very reasonable (code bloat, I-cache misses). In the long run we could have something like 'SPECIALISABLE' pragma, which would work like c++ template specialization -- every specialization of a method would be only once in the resulting binary. That would have both speed and reasonably small code bloat. GHC 7.0 supports INLINABLE -- the method unfolding is present in the .hi file and GHC decides whether to inline or not. In the short run we need to choose some kind of compromise. What are the possibilities: a) we remove all INLINE and will mark every method INLINABLE. This keeps smallest code, but if someone needs great performance, they can persuade GHC-7.0 inliner to inline the methods from containers. b) we leave INLINE on small amount of small methods where there is a big gain, and mark the rest as INLINABLE. This is what is currently present in the repo (what I pushed without consensus). This is a compromise, some code bloat, small performance losses. The performance gains are present also on ghc-6.* family (not true with a). c) leave all methods marked INLINE. Biggest code bloat (numbers in some of my previous mails), theoretically biggest performance (theoretically I-cache misses could be larger). I think b) is good in the short run, but it is only my opinion. But a) is good idea too and I would also support it. I am flying to ICFP in an hour. I will investigate a) [what switches to give to GHC-7.0 to inline] and post the results, but it will take some time. Thank you, Milan

fox:
Hi,
we still need to find consensus on containers INLINE issue (sorry about me already deciding and pushing :().
To recapitulate, one part of the containers performance improvements was to put INLINE on nearly all Data.Map methods. That has positive impact on performance (especially for Map Int, Map Word, Map Char, etc, where the comparison is just an instruction; for fold method the speedup works for almost any type). The reason for this is that inlining allows specialization to happen.
The negative effect is that the code for insert, lookup, etc. is repeated on every call site. That is not very reasonable (code bloat, I-cache misses).
In the long run we could have something like 'SPECIALISABLE' pragma, which would work like c++ template specialization -- every specialization of a method would be only once in the resulting binary. That would have both speed and reasonably small code bloat.
GHC 7.0 supports INLINABLE -- the method unfolding is present in the .hi file and GHC decides whether to inline or not.
In the short run we need to choose some kind of compromise. What are the possibilities:
a) we remove all INLINE and will mark every method INLINABLE. This keeps smallest code, but if someone needs great performance, they can persuade GHC-7.0 inliner to inline the methods from containers.
b) we leave INLINE on small amount of small methods where there is a big gain, and mark the rest as INLINABLE.
This is what is currently present in the repo (what I pushed without consensus).
This is a compromise, some code bloat, small performance losses. The performance gains are present also on ghc-6.* family (not true with a).
c) leave all methods marked INLINE. Biggest code bloat (numbers in some of my previous mails), theoretically biggest performance (theoretically I-cache misses could be larger).
I think b) is good in the short run, but it is only my opinion. But a) is good idea too and I would also support it.
I am flying to ICFP in an hour. I will investigate a) [what switches to give to GHC-7.0 to inline] and post the results, but it will take some time.
There's no rush, and we can discuss it all at ICFP :-) -- Don

On Sat, Sep 25, 2010 at 7:17 AM, Milan Straka
To recapitulate, one part of the containers performance improvements was to put INLINE on nearly all Data.Map methods. That has positive impact on performance (especially for Map Int, Map Word, Map Char, etc, where the comparison is just an instruction; for fold method the speedup works for almost any type). The reason for this is that inlining allows specialization to happen.
Lets add benchmarks for Map Integer, Map String, and Map ByteString so we can quantify the difference in performance gains from * cheap vs expensive comparison, and * boxed vs unboxed keys.
In the long run we could have something like 'SPECIALISABLE' pragma, which would work like c++ template specialization -- every specialization of a method would be only once in the resulting binary. That would have both speed and reasonably small code bloat.
Once we have this pragma we could play some tricks with the linker to share specializations across modules e.g. if insert is specialized at Int in two modules, they could share the same code. I believe that the higher order functions, like fold, will always have to be INLINE (or INLINABLE) as the performance gain there comes from making the higher order parameter known, something specialization won't do. -- Johan

On 25/09/10 01:17, Milan Straka wrote:
Hi,
we still need to find consensus on containers INLINE issue (sorry about me already deciding and pushing :().
To recapitulate, one part of the containers performance improvements was to put INLINE on nearly all Data.Map methods. That has positive impact on performance (especially for Map Int, Map Word, Map Char, etc, where the comparison is just an instruction; for fold method the speedup works for almost any type). The reason for this is that inlining allows specialization to happen.
The negative effect is that the code for insert, lookup, etc. is repeated on every call site. That is not very reasonable (code bloat, I-cache misses).
In the long run we could have something like 'SPECIALISABLE' pragma, which would work like c++ template specialization -- every specialization of a method would be only once in the resulting binary. That would have both speed and reasonably small code bloat.
GHC 7.0 supports INLINABLE -- the method unfolding is present in the .hi file and GHC decides whether to inline or not.
In the short run we need to choose some kind of compromise. What are the possibilities:
a) we remove all INLINE and will mark every method INLINABLE. This keeps smallest code, but if someone needs great performance, they can persuade GHC-7.0 inliner to inline the methods from containers.
b) we leave INLINE on small amount of small methods where there is a big gain, and mark the rest as INLINABLE.
This is what is currently present in the repo (what I pushed without consensus).
This is a compromise, some code bloat, small performance losses. The performance gains are present also on ghc-6.* family (not true with a).
c) leave all methods marked INLINE. Biggest code bloat (numbers in some of my previous mails), theoretically biggest performance (theoretically I-cache misses could be larger).
I think b) is good in the short run, but it is only my opinion. But a) is good idea too and I would also support it.
I am flying to ICFP in an hour. I will investigate a) [what switches to give to GHC-7.0 to inline] and post the results, but it will take some time.
There's something else we need to resolve. Earlier I reported that GHC itself allocates more with the dons/tibbe containers patches - the issue seems to be worse with the current patches, to the extent that two of the GHC performance regression tests fail (T1969 and T3294). in T1969: containers-0.3, allocations: 246,402,492 containers + dons/tibbe/milan, allocations: 283,370,180 (+15%) in T3294: containers-0.3, allocations: 832,307,880 containers + dons/tibbe/milan, allocations: 950,107,000 (+14%) We'll need to isolate exactly which change(s) causes the problem and track down the cause. Cheers, Simon

On Sun, Sep 26, 2010 at 09:16:12AM -0400, Simon Marlow wrote:
There's something else we need to resolve. Earlier I reported that GHC itself allocates more with the dons/tibbe containers patches - the issue seems to be worse with the current patches, to the extent that two of the GHC performance regression tests fail (T1969 and T3294).
in T1969: containers-0.3, allocations: 246,402,492 containers + dons/tibbe/milan, allocations: 283,370,180 (+15%)
in T3294: containers-0.3, allocations: 832,307,880 containers + dons/tibbe/milan, allocations: 950,107,000 (+14%)
We'll need to isolate exactly which change(s) causes the problem and track down the cause.
I've filed http://hackage.haskell.org/trac/ghc/ticket/4342 Thanks Ian
participants (5)
-
Don Stewart
-
Ian Lynagh
-
Johan Tibell
-
Milan Straka
-
Simon Marlow