Summary of containers patches

Hi Simon, as you probably know, there are several containers patches dealing with performance, separate testsuite and benchmarking. Specifically, I mean tickets http://hackage.haskell.org/trac/ghc/ticket/4277 http://hackage.haskell.org/trac/ghc/ticket/4279 http://hackage.haskell.org/trac/ghc/ticket/4280 http://hackage.haskell.org/trac/ghc/ticket/4311 http://hackage.haskell.org/trac/ghc/ticket/4312 http://hackage.haskell.org/trac/ghc/ticket/4333 I have all of them in the following repository http://fox.auryn.cz/darcs/containers/ There are no user-visible changes, just performance and the split testsuite. There were no comments in some time, so I would like to push them to containers. There is just the code-bloat issue to solve. The excessive use of INLINE pragma causes a large code bloat. I am aware of it, it happened even when I was benchmarking the containers. Personally I vote for: - changing INLINE to INLINABLE - leave INLINE just in - member/lookup functions - insert* functions as these benefit from specialization a lot and do not cause such a bloat (measured during the internship, do not recall the numbers now). I suspect the performance to drop a bit, but I think it is better to have a bit slower containers without such a code bloat. Simon, I will be flying to IFCP soon, but if you want, I can prepare the INLINE->INLINABLE patches and put them in the repo. Cheers, Milan

On 23/09/2010 14:25, Milan Straka wrote:
Hi Simon,
as you probably know, there are several containers patches dealing with performance, separate testsuite and benchmarking.
Specifically, I mean tickets http://hackage.haskell.org/trac/ghc/ticket/4277 http://hackage.haskell.org/trac/ghc/ticket/4279 http://hackage.haskell.org/trac/ghc/ticket/4280 http://hackage.haskell.org/trac/ghc/ticket/4311 http://hackage.haskell.org/trac/ghc/ticket/4312 http://hackage.haskell.org/trac/ghc/ticket/4333
I have all of them in the following repository http://fox.auryn.cz/darcs/containers/
There are no user-visible changes, just performance and the split testsuite.
There were no comments in some time, so I would like to push them to containers.
There is just the code-bloat issue to solve. The excessive use of INLINE pragma causes a large code bloat. I am aware of it, it happened even when I was benchmarking the containers.
Personally I vote for: - changing INLINE to INLINABLE - leave INLINE just in - member/lookup functions - insert* functions as these benefit from specialization a lot and do not cause such a bloat (measured during the internship, do not recall the numbers now).
I suspect the performance to drop a bit, but I think it is better to have a bit slower containers without such a code bloat.
Simon, I will be flying to IFCP soon, but if you want, I can prepare the INLINE->INLINABLE patches and put them in the repo.
That would help a lot, thanks. I started on this today but I'm a bit overloaded, and getting criterion installed wasn't very smooth (a couple of package breakages, but nothing serious and patches are upstream). Cheers, Simon

as you probably know, there are several containers patches dealing with performance, separate testsuite and benchmarking.
Specifically, I mean tickets http://hackage.haskell.org/trac/ghc/ticket/4277 http://hackage.haskell.org/trac/ghc/ticket/4279 http://hackage.haskell.org/trac/ghc/ticket/4280 http://hackage.haskell.org/trac/ghc/ticket/4311 http://hackage.haskell.org/trac/ghc/ticket/4312 http://hackage.haskell.org/trac/ghc/ticket/4333
I have all of them in the following repository http://fox.auryn.cz/darcs/containers/
There are no user-visible changes, just performance and the split testsuite.
There were no comments in some time, so I would like to push them to containers.
There is just the code-bloat issue to solve. The excessive use of INLINE pragma causes a large code bloat. I am aware of it, it happened even when I was benchmarking the containers.
Personally I vote for: - changing INLINE to INLINABLE - leave INLINE just in - member/lookup functions - insert* functions as these benefit from specialization a lot and do not cause such a bloat (measured during the internship, do not recall the numbers now).
I suspect the performance to drop a bit, but I think it is better to have a bit slower containers without such a code bloat.
Simon, I will be flying to IFCP soon, but if you want, I can prepare the INLINE->INLINABLE patches and put them in the repo.
That would help a lot, thanks. I started on this today but I'm a bit overloaded, and getting criterion installed wasn't very smooth (a couple of package breakages, but nothing serious and patches are upstream).
Great, I will look into it. Is it alright if I push the rest of the performance patches first? (The tickets I listed here.) That would make it much easier. Cheers, Milan PS: I have ssh keys to be able to push.

On 23/09/2010 17:54, Bryan O'Sullivan wrote:
On Thu, Sep 23, 2010 at 7:09 AM, Simon Marlow
mailto:marlowsd@gmail.com> wrote: I started on this today but I'm a bit overloaded, and getting criterion installed wasn't very smooth (a couple of package breakages, but nothing serious and patches are upstream).
Anything pertinent to moi?
No, syb and vector-algorithms broke. The former already has a patch upstream (needs an updated dependency on base), and the latter had a type error related to the new typechecker, I think because it relied on impredicativity. Cheers, Simon

marlowsd:
On 23/09/2010 14:25, Milan Straka wrote:
Hi Simon,
as you probably know, there are several containers patches dealing with performance, separate testsuite and benchmarking.
Specifically, I mean tickets http://hackage.haskell.org/trac/ghc/ticket/4277 http://hackage.haskell.org/trac/ghc/ticket/4279 http://hackage.haskell.org/trac/ghc/ticket/4280 http://hackage.haskell.org/trac/ghc/ticket/4311 http://hackage.haskell.org/trac/ghc/ticket/4312 http://hackage.haskell.org/trac/ghc/ticket/4333
I have all of them in the following repository http://fox.auryn.cz/darcs/containers/
There are no user-visible changes, just performance and the split testsuite.
There were no comments in some time, so I would like to push them to containers.
There is just the code-bloat issue to solve. The excessive use of INLINE pragma causes a large code bloat. I am aware of it, it happened even when I was benchmarking the containers.
Personally I vote for: - changing INLINE to INLINABLE - leave INLINE just in - member/lookup functions - insert* functions as these benefit from specialization a lot and do not cause such a bloat (measured during the internship, do not recall the numbers now).
I suspect the performance to drop a bit, but I think it is better to have a bit slower containers without such a code bloat.
I'm not sure I agree with this. Any slowdown is not welcome. Can you quantify the impact of code size? What does it harm? -- Don

dons:
marlowsd:
On 23/09/2010 14:25, Milan Straka wrote:
Hi Simon,
as you probably know, there are several containers patches dealing with performance, separate testsuite and benchmarking.
Specifically, I mean tickets http://hackage.haskell.org/trac/ghc/ticket/4277 http://hackage.haskell.org/trac/ghc/ticket/4279 http://hackage.haskell.org/trac/ghc/ticket/4280 http://hackage.haskell.org/trac/ghc/ticket/4311 http://hackage.haskell.org/trac/ghc/ticket/4312 http://hackage.haskell.org/trac/ghc/ticket/4333
I have all of them in the following repository http://fox.auryn.cz/darcs/containers/
There are no user-visible changes, just performance and the split testsuite.
There were no comments in some time, so I would like to push them to containers.
There is just the code-bloat issue to solve. The excessive use of INLINE pragma causes a large code bloat. I am aware of it, it happened even when I was benchmarking the containers.
Personally I vote for: - changing INLINE to INLINABLE - leave INLINE just in - member/lookup functions - insert* functions as these benefit from specialization a lot and do not cause such a bloat (measured during the internship, do not recall the numbers now).
I suspect the performance to drop a bit, but I think it is better to have a bit slower containers without such a code bloat.
I'm not sure I agree with this. Any slowdown is not welcome.
Can you quantify the impact of code size? What does it harm?
No worries, saw the INLINEABLE thread. Looks good. I'm not happy with slowdowns though. -- Don

Hi Simon, others, here comes the results. I consider 4 variants of containers: inline: the current fox.auryn.cz/darcs/containers with many INLINE pragmas nothing: all INLINE pragmas deleted inlinable: all INLINE pragmas replaced with INLINABLE mix: all INLINE pragmas replaced with INLINABLE, but lookup*, member* and insert* functions are kept INLINE. At first code bloat. Writing size of the binaries in bytes, all binaries are stripped, I386. * map-properties - binary running test on Data.Map, uses many functions from Data.Map - inline: 5266740 - nothing: 2325940 - inlinable: 2348116 - mix: 2571156 That is 2.9MB increase in size from nothing to inline. * Map.hs - benchmark of Data.Map, using several functions only - inline: 24465452 - nothing: 24302348 - inlinable: 24326924 - mix: 24417100 160kB only from nothing to inline. Another code bloat. This time I compiled ghc-7.0: GHC itself with containers containing INLINE pragmas 20723280B ghc-7.0-noinlinemap: GHC itself with containers without INLINE pragmas 19965464B That is 740kB difference between ghc-7.0 binaries. (There is 1082kB difference between libHSghc-7.0.0.a libraries.) Another code bloat. This time Map.hs benchmark again. All packages needed for the benchmark (most notably criterion) are compiled: - against containers package contains INLINE pragmas 24465452B - against containers package without INLINE pragmas 23753808B That is 700kB difference _on one binary_. So the code bloat is considerable. Now the timing results. The results are the change in runtime in %. [1] speedup between inline and noinline (negative numbers means slowdown) [2] speedup between inline and inlinable [3] speedup between inline and mix (the results are += several %, I did not have time to do it better) [1] [2] [3] alter -50.33 -49.17 -47.51 delete -46.67 -50.93 -48.79 difference 1.93 0.18 6.33 insert -44.11 -44.25 -3.09 insertLookupWithKey empty -14.62 -15.70 1.22 insertLookupWithKey update -21.54 -19.23 6.20 insertLookupWithKey' empty -16.79 -23.67 -2.82 insertLookupWithKey' update -26.62 -26.99 0.47 insertWith empty -44.42 -48.29 -5.88 insertWith update -43.28 -45.03 2.50 insertWith' empty -45.30 -46.66 -3.45 insertWith' update -38.15 -38.27 3.94 insertWithKey empty -42.60 -45.58 -6.99 insertWithKey update -40.13 -36.62 -1.71 insertWithKey' empty -46.54 -46.53 -4.22 insertWithKey' update -43.10 -42.16 0.98 intersection 1.42 -1.75 0.37 lookup -94.03 -94.62 6.14 lookupIndex -86.25 -86.92 4.91 union -0.04 -0.87 2.12 update -50.64 -55.27 -54.17 updateLookupWithKey -43.90 -43.31 -49.22 So, the INLINABLE does not help at all. I in fact expected this outcome -- it is not inlining itself that is beneficial, it is the _specialization_ (which happens during inlining if possible). Please bear in mind that the speedup caused by specialization is significant only in the case when the Ord.compare function is trivial (if it is an instruction like <=, the result between instruction call and a dynamic dictionary call is very big). That means things like Map Int, Map Word, Map Int8 and so on. But with these types one usually uses IntMap. What I am saying is that the usage of Map which would benefit from specialization may not be very common. At this point I suggest the following: - as immediate solution, remove all INLINE from the containers package, just leave INLINE in lookup*, member* and insert* functions. This causes small code bloat and quite big wins on these functions, which I consider to be very frequent. I could imagine to remove also INLINE for insert* functions, or to add INLINE for delete*, update* and alter*. - in the long term, find a way to get back the speedup without the code bloat. The problem is that _every_ call to a lookup has a different code in the binary. We need to share the code of specialized functions (like the c++ templates do). I mean the following: - before linking, find out which dictionaries are passed to Data.Map.lookup method. Specialize Data.Map.lookup for these types (only once for each type) and change the calls to lookup with specific dictionaries. The effect should be as if user said SPECIALIZE, for each method only for types it is non-generically being used for. - I would not bother with INLINABLE currently. Simon, others, do you agree with my short-term solution? If you do, I will push the patches (I need to do it before I fly to ICFP, which will be in 30 hours). Cheers, Milan
as you probably know, there are several containers patches dealing with performance, separate testsuite and benchmarking.
Specifically, I mean tickets http://hackage.haskell.org/trac/ghc/ticket/4277 http://hackage.haskell.org/trac/ghc/ticket/4279 http://hackage.haskell.org/trac/ghc/ticket/4280 http://hackage.haskell.org/trac/ghc/ticket/4311 http://hackage.haskell.org/trac/ghc/ticket/4312 http://hackage.haskell.org/trac/ghc/ticket/4333
I have all of them in the following repository http://fox.auryn.cz/darcs/containers/
There are no user-visible changes, just performance and the split testsuite.
There were no comments in some time, so I would like to push them to containers.
There is just the code-bloat issue to solve. The excessive use of INLINE pragma causes a large code bloat. I am aware of it, it happened even when I was benchmarking the containers.
Personally I vote for: - changing INLINE to INLINABLE - leave INLINE just in - member/lookup functions - insert* functions as these benefit from specialization a lot and do not cause such a bloat (measured during the internship, do not recall the numbers now).
I suspect the performance to drop a bit, but I think it is better to have a bit slower containers without such a code bloat.
Simon, I will be flying to IFCP soon, but if you want, I can prepare the INLINE->INLINABLE patches and put them in the repo.
That would help a lot, thanks. I started on this today but I'm a bit overloaded, and getting criterion installed wasn't very smooth (a couple of package breakages, but nothing serious and patches are upstream).
Cheers, Simon

If I am reading this right, the numbers for alter/delete/update are rather unsettling in your proposal.
Is there a compromise point that does not cause those to suffer so much?
Because otherwise doubling the object code size doesn't seem like such a terrible burden for the such significant speed benefits we currently see in those areas.
-Edward
On Sep 23, 2010, at 8:22 PM, Milan Straka
Hi Simon, others,
here comes the results.
I consider 4 variants of containers: inline: the current fox.auryn.cz/darcs/containers with many INLINE pragmas nothing: all INLINE pragmas deleted inlinable: all INLINE pragmas replaced with INLINABLE mix: all INLINE pragmas replaced with INLINABLE, but lookup*, member* and insert* functions are kept INLINE.
At first code bloat. Writing size of the binaries in bytes, all binaries are stripped, I386.
* map-properties - binary running test on Data.Map, uses many functions from Data.Map - inline: 5266740 - nothing: 2325940 - inlinable: 2348116 - mix: 2571156 That is 2.9MB increase in size from nothing to inline.
* Map.hs - benchmark of Data.Map, using several functions only - inline: 24465452 - nothing: 24302348 - inlinable: 24326924 - mix: 24417100 160kB only from nothing to inline.
Another code bloat. This time I compiled ghc-7.0: GHC itself with containers containing INLINE pragmas 20723280B ghc-7.0-noinlinemap: GHC itself with containers without INLINE pragmas 19965464B That is 740kB difference between ghc-7.0 binaries. (There is 1082kB difference between libHSghc-7.0.0.a libraries.)
Another code bloat. This time Map.hs benchmark again. All packages needed for the benchmark (most notably criterion) are compiled: - against containers package contains INLINE pragmas 24465452B - against containers package without INLINE pragmas 23753808B That is 700kB difference _on one binary_.
So the code bloat is considerable.
Now the timing results. The results are the change in runtime in %. [1] speedup between inline and noinline (negative numbers means slowdown) [2] speedup between inline and inlinable [3] speedup between inline and mix (the results are += several %, I did not have time to do it better)
[1] [2] [3] alter -50.33 -49.17 -47.51 delete -46.67 -50.93 -48.79 difference 1.93 0.18 6.33 insert -44.11 -44.25 -3.09 insertLookupWithKey empty -14.62 -15.70 1.22 insertLookupWithKey update -21.54 -19.23 6.20 insertLookupWithKey' empty -16.79 -23.67 -2.82 insertLookupWithKey' update -26.62 -26.99 0.47 insertWith empty -44.42 -48.29 -5.88 insertWith update -43.28 -45.03 2.50 insertWith' empty -45.30 -46.66 -3.45 insertWith' update -38.15 -38.27 3.94 insertWithKey empty -42.60 -45.58 -6.99 insertWithKey update -40.13 -36.62 -1.71 insertWithKey' empty -46.54 -46.53 -4.22 insertWithKey' update -43.10 -42.16 0.98 intersection 1.42 -1.75 0.37 lookup -94.03 -94.62 6.14 lookupIndex -86.25 -86.92 4.91 union -0.04 -0.87 2.12 update -50.64 -55.27 -54.17 updateLookupWithKey -43.90 -43.31 -49.22
So, the INLINABLE does not help at all.
I in fact expected this outcome -- it is not inlining itself that is beneficial, it is the _specialization_ (which happens during inlining if possible).
Please bear in mind that the speedup caused by specialization is significant only in the case when the Ord.compare function is trivial (if it is an instruction like <=, the result between instruction call and a dynamic dictionary call is very big). That means things like Map Int, Map Word, Map Int8 and so on. But with these types one usually uses IntMap. What I am saying is that the usage of Map which would benefit from specialization may not be very common.
At this point I suggest the following: - as immediate solution, remove all INLINE from the containers package, just leave INLINE in lookup*, member* and insert* functions. This causes small code bloat and quite big wins on these functions, which I consider to be very frequent.
I could imagine to remove also INLINE for insert* functions, or to add INLINE for delete*, update* and alter*.
- in the long term, find a way to get back the speedup without the code bloat. The problem is that _every_ call to a lookup has a different code in the binary. We need to share the code of specialized functions (like the c++ templates do). I mean the following: - before linking, find out which dictionaries are passed to Data.Map.lookup method. Specialize Data.Map.lookup for these types (only once for each type) and change the calls to lookup with specific dictionaries. The effect should be as if user said SPECIALIZE, for each method only for types it is non-generically being used for.
- I would not bother with INLINABLE currently.
Simon, others, do you agree with my short-term solution? If you do, I will push the patches (I need to do it before I fly to ICFP, which will be in 30 hours).
Cheers, Milan
as you probably know, there are several containers patches dealing with performance, separate testsuite and benchmarking.
Specifically, I mean tickets http://hackage.haskell.org/trac/ghc/ticket/4277 http://hackage.haskell.org/trac/ghc/ticket/4279 http://hackage.haskell.org/trac/ghc/ticket/4280 http://hackage.haskell.org/trac/ghc/ticket/4311 http://hackage.haskell.org/trac/ghc/ticket/4312 http://hackage.haskell.org/trac/ghc/ticket/4333
I have all of them in the following repository http://fox.auryn.cz/darcs/containers/
There are no user-visible changes, just performance and the split testsuite.
There were no comments in some time, so I would like to push them to containers.
There is just the code-bloat issue to solve. The excessive use of INLINE pragma causes a large code bloat. I am aware of it, it happened even when I was benchmarking the containers.
Personally I vote for: - changing INLINE to INLINABLE - leave INLINE just in - member/lookup functions - insert* functions as these benefit from specialization a lot and do not cause such a bloat (measured during the internship, do not recall the numbers now).
I suspect the performance to drop a bit, but I think it is better to have a bit slower containers without such a code bloat.
Simon, I will be flying to IFCP soon, but if you want, I can prepare the INLINE->INLINABLE patches and put them in the repo.
That would help a lot, thanks. I started on this today but I'm a bit overloaded, and getting criterion installed wasn't very smooth (a couple of package breakages, but nothing serious and patches are upstream).
Cheers, Simon
_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

On Thu, Sep 23, 2010 at 09:12:21PM -0400, Edward Kmett wrote:
If I am reading this right, the numbers for alter/delete/update are rather unsettling in your proposal.
Is there a compromise point that does not cause those to suffer so much?
It's really late to be doing research for the GHC 7.0 release. We're just trying to get the builds to go through so we can get the RC out. Simon Marlow wrote:
So what to do for GHC 7.0. The options are:
- make the INLINABLE changes, re-do the measurements to make sure performance hasn't got worse, and push the patches.
- roll back the first set of patches.
Milan Straka wrote:
So, the INLINABLE does not help at all.
The safest option is probably to roll back. There weren't any INLINE pragmas in the containers that came with 6.12, so I believe this is unlikely to be a regression, even with the changes to the inliner. I don't think we need to roll back the other parts of the patches, though; just remove the pragmas. Then we can do the research and make a less rushed decision for the HEAD (and perhaps 7.0.2). Thanks Ian

fox:
Now the timing results. The results are the change in runtime in %. [1] speedup between inline and noinline (negative numbers means slowdown) [2] speedup between inline and inlinable [3] speedup between inline and mix (the results are += several %, I did not have time to do it better)
[1] [2] [3] alter -50.33 -49.17 -47.51 delete -46.67 -50.93 -48.79 difference 1.93 0.18 6.33 insert -44.11 -44.25 -3.09 insertLookupWithKey empty -14.62 -15.70 1.22 insertLookupWithKey update -21.54 -19.23 6.20 insertLookupWithKey' empty -16.79 -23.67 -2.82 insertLookupWithKey' update -26.62 -26.99 0.47 insertWith empty -44.42 -48.29 -5.88 insertWith update -43.28 -45.03 2.50 insertWith' empty -45.30 -46.66 -3.45 insertWith' update -38.15 -38.27 3.94 insertWithKey empty -42.60 -45.58 -6.99 insertWithKey update -40.13 -36.62 -1.71 insertWithKey' empty -46.54 -46.53 -4.22 insertWithKey' update -43.10 -42.16 0.98 intersection 1.42 -1.75 0.37 lookup -94.03 -94.62 6.14 lookupIndex -86.25 -86.92 4.91 union -0.04 -0.87 2.12 update -50.64 -55.27 -54.17 updateLookupWithKey -43.90 -43.31 -49.22
So, the INLINABLE does not help at all.
I in fact expected this outcome -- it is not inlining itself that is beneficial, it is the _specialization_ (which happens during inlining if possible).
Please bear in mind that the speedup caused by specialization is significant only in the case when the Ord.compare function is trivial (if it is an instruction like <=, the result between instruction call and a dynamic dictionary call is very big). That means things like Map Int, Map Word, Map Int8 and so on. But with these types one usually uses IntMap. What I am saying is that the usage of Map which would benefit from specialization may not be very common.
50%. Huh. Do you have enough time to look on Hackage to see at which types Map is used most frequently? I suspect Map Int is actually pretty common.
At this point I suggest the following: - as immediate solution, remove all INLINE from the containers package, just leave INLINE in lookup*, member* and insert* functions. This causes small code bloat and quite big wins on these functions, which I consider to be very frequent.
I could imagine to remove also INLINE for insert* functions, or to add INLINE for delete*, update* and alter*.
- in the long term, find a way to get back the speedup without the code bloat. The problem is that _every_ call to a lookup has a different code in the binary. We need to share the code of specialized functions (like the c++ templates do). I mean the following: - before linking, find out which dictionaries are passed to Data.Map.lookup method. Specialize Data.Map.lookup for these types (only once for each type) and change the calls to lookup with specific dictionaries. The effect should be as if user said SPECIALIZE, for each method only for types it is non-generically being used for.
- I would not bother with INLINABLE currently.
Simon, others, do you agree with my short-term solution? If you do, I will push the patches (I need to do it before I fly to ICFP, which will be in 30 hours).
Well, I still don't understand what penalty the code bloat is exactly. Making things a lot slower is clearly bad, making disk size a bit bigger -- when my harddrive is already 8x bigger than 3 years ago, doesn't seem so bad. -- Don

On Thu, Sep 23, 2010 at 6:17 PM, Don Stewart
Well, I still don't understand what penalty the code bloat is exactly.
It's going to push I-cache misses through the roof when there are umpteen identical versions of the same function over Map ByteString Foo. Also, some of us would like to ship applications written in Haskell, and network bandwidth is a big concern there, having gone nowhere in years.

bos:
On Thu, Sep 23, 2010 at 6:17 PM, Don Stewart
wrote: Well, I still don't understand what penalty the code bloat is exactly.
It's going to push I-cache misses through the roof when there are umpteen identical versions of the same function over Map ByteString Foo. Also, some of us would like to ship applications written in Haskell, and network bandwidth is a big concern there, having gone nowhere in years.
Reducing binary sizes via dynamic linking is going to make a lot more of an impact than dditional unrolling. However, we have *no* other techniques for making containers faster. Perhaps a containers-inline fork is needed, for those who still need the speed. -- Don

On Fri, Sep 24, 2010 at 6:26 AM, Don Stewart
bos:
On Thu, Sep 23, 2010 at 6:17 PM, Don Stewart
wrote: Well, I still don't understand what penalty the code bloat is exactly.
It's going to push I-cache misses through the roof when there are umpteen identical versions of the same function over Map ByteString Foo. Also, some of us would like to ship applications written in Haskell, and network bandwidth is a big concern there, having gone nowhere in years.
Reducing binary sizes via dynamic linking is going to make a lot more of an impact than dditional unrolling. However, we have *no* other techniques for making containers faster.
Perhaps a containers-inline fork is needed, for those who still need the speed.
Would it be possible to use CPP to turn the INLINE flags into a compile-time argument, ie: #ifdef INLINE {-# INLINE #-} #endif I'd hate to start seeing incompatible Data.Map.Maps floating around. Michael

On 24 September 2010 14:36, Michael Snoyman
On Fri, Sep 24, 2010 at 6:26 AM, Don Stewart
wrote: Perhaps a containers-inline fork is needed, for those who still need the speed.
Would it be possible to use CPP to turn the INLINE flags into a compile-time argument, ie:
#ifdef INLINE {-# INLINE #-} #endif
Since containers ships with GHC, wouldn't this then require an extra flag being used when building GHC to enable this? And then to use it, you'd have to build your own GHC rather than using a pre-built binary like just about everyone does...
I'd hate to start seeing incompatible Data.Map.Maps floating around.
Agreed. At the very least if there was a fork it would presumably have to be in a different module namespace to avoid namespace collisions, which would make the incompatability obvious. -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

ivan.miljenovic:
On 24 September 2010 14:36, Michael Snoyman
wrote: On Fri, Sep 24, 2010 at 6:26 AM, Don Stewart
wrote: Perhaps a containers-inline fork is needed, for those who still need the speed.
Would it be possible to use CPP to turn the INLINE flags into a compile-time argument, ie:
#ifdef INLINE {-# INLINE #-} #endif
Since containers ships with GHC, wouldn't this then require an extra flag being used when building GHC to enable this?
And then to use it, you'd have to build your own GHC rather than using a pre-built binary like just about everyone does...
I'd hate to start seeing incompatible Data.Map.Maps floating around.
Agreed. At the very least if there was a fork it would presumably have to be in a different module namespace to avoid namespace collisions, which would make the incompatability obvious.
We're talking about a 3% increase in the size of the Map, a 2% size in the Map.hs benchmark binary, right? For a 50% increase in Map function performance. -- Don

On 24 September 2010 14:58, Don Stewart
ivan.miljenovic:
On 24 September 2010 14:36, Michael Snoyman
wrote: On Fri, Sep 24, 2010 at 6:26 AM, Don Stewart
wrote: Perhaps a containers-inline fork is needed, for those who still need the speed.
Would it be possible to use CPP to turn the INLINE flags into a compile-time argument, ie:
#ifdef INLINE {-# INLINE #-} #endif
Since containers ships with GHC, wouldn't this then require an extra flag being used when building GHC to enable this?
And then to use it, you'd have to build your own GHC rather than using a pre-built binary like just about everyone does...
I'd hate to start seeing incompatible Data.Map.Maps floating around.
Agreed. At the very least if there was a fork it would presumably have to be in a different module namespace to avoid namespace collisions, which would make the incompatability obvious.
We're talking about a 3% increase in the size of the Map, a 2% size in the Map.hs benchmark binary, right?
For a 50% increase in Map function performance.
I'm not arguing against it; I'm just arguing against whether or not a fork would be beneficial. -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

ivan.miljenovic:
On 24 September 2010 14:58, Don Stewart
wrote: ivan.miljenovic:
On 24 September 2010 14:36, Michael Snoyman
wrote: On Fri, Sep 24, 2010 at 6:26 AM, Don Stewart
wrote: Perhaps a containers-inline fork is needed, for those who still need the speed.
Would it be possible to use CPP to turn the INLINE flags into a compile-time argument, ie:
#ifdef INLINE {-# INLINE #-} #endif
Since containers ships with GHC, wouldn't this then require an extra flag being used when building GHC to enable this?
And then to use it, you'd have to build your own GHC rather than using a pre-built binary like just about everyone does...
I'd hate to start seeing incompatible Data.Map.Maps floating around.
Agreed. At the very least if there was a fork it would presumably have to be in a different module namespace to avoid namespace collisions, which would make the incompatability obvious.
We're talking about a 3% increase in the size of the Map, a 2% size in the Map.hs benchmark binary, right?
For a 50% increase in Map function performance.
I'm not arguing against it; I'm just arguing against whether or not a fork would be beneficial.
Indeed. I'm surprised we're actually considering rolling back Map specialization -- and the big improvements that gives us -- for what seems to be a small percent in binary size. Recall Johan's recent user survey: people are asking for more performance, and more reliable performance, not shaving binary sizes. -- Don

On 24 September 2010 06:19, Don Stewart
ivan.miljenovic:
On 24 September 2010 14:58, Don Stewart
wrote: ivan.miljenovic:
On 24 September 2010 14:36, Michael Snoyman
wrote: On Fri, Sep 24, 2010 at 6:26 AM, Don Stewart
wrote: Perhaps a containers-inline fork is needed, for those who still need the speed.
Would it be possible to use CPP to turn the INLINE flags into a compile-time argument, ie:
#ifdef INLINE {-# INLINE #-} #endif
Since containers ships with GHC, wouldn't this then require an extra flag being used when building GHC to enable this?
And then to use it, you'd have to build your own GHC rather than using a pre-built binary like just about everyone does...
I'd hate to start seeing incompatible Data.Map.Maps floating around.
Agreed. At the very least if there was a fork it would presumably have to be in a different module namespace to avoid namespace collisions, which would make the incompatability obvious.
We're talking about a 3% increase in the size of the Map, a 2% size in the Map.hs benchmark binary, right?
For a 50% increase in Map function performance.
I'm not arguing against it; I'm just arguing against whether or not a fork would be beneficial.
Indeed. I'm surprised we're actually considering rolling back Map specialization -- and the big improvements that gives us -- for what seems to be a small percent in binary size.
Recall Johan's recent user survey: people are asking for more performance, and more reliable performance, not shaving binary sizes.
That would be true if we actually had numbers on how the changes affect programs in the wild. Your performance claims are based on micro benchmarks. Who knows whether code bloat will offset the performance gains on any "real" benchmark. One GHC object file *tripled* in size, for example. This is probably because it's simply wrapping Map, but it's still not clear how it may affect other programs. All I'm saying, be careful with these numbers. / Thomas -- Push the envelope. Watch it bend.

On 24/09/2010 09:41, Thomas Schilling wrote:
On 24 September 2010 06:19, Don Stewart
wrote: ivan.miljenovic:
On 24 September 2010 14:58, Don Stewart
wrote: ivan.miljenovic:
On 24 September 2010 14:36, Michael Snoyman
wrote: On Fri, Sep 24, 2010 at 6:26 AM, Don Stewart
wrote: > Perhaps a containers-inline fork is needed, for those who still need the speed. Would it be possible to use CPP to turn the INLINE flags into a compile-time argument, ie:
#ifdef INLINE {-# INLINE #-} #endif
Since containers ships with GHC, wouldn't this then require an extra flag being used when building GHC to enable this?
And then to use it, you'd have to build your own GHC rather than using a pre-built binary like just about everyone does...
I'd hate to start seeing incompatible Data.Map.Maps floating around.
Agreed. At the very least if there was a fork it would presumably have to be in a different module namespace to avoid namespace collisions, which would make the incompatability obvious.
We're talking about a 3% increase in the size of the Map, a 2% size in the Map.hs benchmark binary, right?
For a 50% increase in Map function performance.
I'm not arguing against it; I'm just arguing against whether or not a fork would be beneficial.
Indeed. I'm surprised we're actually considering rolling back Map specialization -- and the big improvements that gives us -- for what seems to be a small percent in binary size.
Recall Johan's recent user survey: people are asking for more performance, and more reliable performance, not shaving binary sizes.
That would be true if we actually had numbers on how the changes affect programs in the wild. Your performance claims are based on micro benchmarks. Who knows whether code bloat will offset the performance gains on any "real" benchmark. One GHC object file *tripled* in size, for example. This is probably because it's simply wrapping Map, but it's still not clear how it may affect other programs.
It's not a wrapper for Map, in fact. The module is here: http://darcs.haskell.org/ghc/compiler/cmm/CmmStackLayout.hs it's just a client module that happens to call a lot of Map functions. The keys aren't Ints (or at least, most of them aren't). We didn't see any difference in GHC's performance with the first round of Map changes; in fact GHC allocated a bit more afterward. We didn't investigate any further, and of course that's just one data point. The difference wasn't significant enough to warrant backing off.
All I'm saying, be careful with these numbers.
+1 Cheers, Simon

Thanks to Milan for gathering numbers. Several things to say Size and performance | We're talking about a 3% increase in the size of the Map, a 2% size in | the Map.hs benchmark binary, right? For a 50% increase in Map function performance. No, I don't think so. With lots of INLINE, you get a complete copy of the lookup code at *every call site*. I don't care about the size of Map.o. But I believe that we saw a major jump in some GHC modules binary size with the new containers thing (like 300). I don't think we saw anything like a comparable increase in speed. I think Ian will have the numbers. Specialisation One of the things that INLINABLE gives is that client modules can now see the full code for a function, even if it's recursive. So it is now (at last) relatively straightforward to generate specialisations for INLINABE functions in client modules. Here's how it'll go. Currently the specialiser (it focuses only on *overloaded* functions) - Looks for all calls to overloaded functions f, say (f Int d), where d is a dictionary - Makes a copy of f, specialises it to the dictionary arguments passed to that call, and turns that into a new definition f_spec = <specialised rhs> - Adds a RULE f Int d = f_spec - The RHS of f might call other overloaded functions, say g, so when specialising f we might generate new calls to g. So we specialise them too, in a cascade. At the moment, this only happens for *locally-defined* functions. But now we have INLINABLE I am going to make it happen for *imported, INLINABLE* functions too. So if you call (lookup Int d), we'll get a local specialised copy of lookup, and a RULE for lookup. Handily, the rule will be exposed to clients of *this* module, so they'll see the specialisation too. I think I'll also arrange that you can give a SPECIALISE pragma in a client module: {-# SPECIALISE lookup :: Map Int a -> Int -> Maybe a #-} None of this will be in GHC 7.0. I defer to others about whether it could be in a patch release or would await 7.2. Simon

On Fri, Sep 24, 2010 at 10:43 AM, Simon Peyton-Jones
Specialisation
One of the things that INLINABLE gives is that client modules can now see the full code for a function, even if it's recursive. So it is now (at last) relatively straightforward to generate specialisations for INLINABE functions in client modules. Here's how it'll go.
Currently the specialiser (it focuses only on *overloaded* functions)
With inlining, we speedups for (parametric) polymorphic functions as well, some of it due to the unboxing. Perhaps we could extend the set of functions that are candidates for specialization. If it's hard to decide in general which functions might benefit from specialization we can introduce a SPECIALIZABLE pragma that *applies at all types*, that acts like an INLINABLE pragma but makes the function a candidate for specialization, even if it's not an *overloaded* function. I think I'll also arrange that you can give a SPECIALISE pragma in a client
module: {-# SPECIALISE lookup :: Map Int a -> Int -> Maybe a #-}
I don't know if I want to litter client modules with SPECIALIZE pragmas, I think they should go in the library and not at the call site. For example, in C++ you get client module specialization without any annotations at the call site. We should be able to get by without having to have the client module care about these things. Cheers, Johan

With inlining, we speedups for (parametric) polymorphic functions as well, some of it due to the unboxing. Can you give some examples to back up this claim? I am skeptical I don't know if I want to litter client modules with SPECIALIZE pragmas, I think they should go in the library and not at the call site. For example, in C++ you get client module specialization without any annotations at the call site. We should be able to get by without having to have the client module care about these things. That's what the auto-specialisation I described does. You can't write a SECIALISE pragma in the libray for a type that is defined by the client. This is very very common Simon Cheers, Johan

On Fri, Sep 24, 2010 at 12:58 PM, Simon Peyton-Jones
Can you give some examples to back up this claim? I am skeptical
I can look into it again after ICFP. I did spend some time trying to tease apart the performance improvements we got from inlining foldl'. From what I remember, the biggest gain from inlining is that the higher-order argument is known, but there's some additional gain from unboxing the accumulator (and some additional gain from inlining simple higher-order arguments like (+) into the body of foldl').
That’s what the auto-specialisation I described does.
You can’t write a SECIALISE pragma in the libray for a type that is defined by the client. This is very very common
Right, I think that's the reason we rely on INLINING pragmas so much. Inlining gives us the effect of specialize, but at types not known to the library author. I don't think we really want to force inlining, but want to say something like this: foldl' :: (a -> b -> a) -> a -> [b] -> a foldl' = ... {-# SPECIALIZABLE foldl' #-} and then have foldl' be specialized in the client module at whatever type it's called at, but only once for each type (instead of once per call site). -- Johan

[beautiful quote marks, thank you] | > Can you give some examples to back up this claim? I am skeptical | | I can look into it again after ICFP. | | I did spend some time trying to tease apart the performance | improvements we got from inlining foldl'. From what I remember, the | biggest gain from inlining is that the higher-order argument is known, Ah yes! But that is *quite different* from specialising only on the *type*! | foldl' :: (a -> b -> a) -> a -> [b] -> a | foldl' = ... | {-# SPECIALIZABLE foldl' #-} | | and then have foldl' be specialized in the client module at whatever | type it's called at, but only once for each type (instead of once per | call site). For overloaded functions, that is precisely what I'm proposing. For foldl' I think you will get no benefit from specialising on the type; only from specialising on the actual function passed. So in that case inlining probably really is the answer. Simon

On Fri, Sep 24, 2010 at 1:35 PM, Simon Peyton-Jones
| > Can you give some examples to back up this claim? I am skeptical | | I can look into it again after ICFP. | | I did spend some time trying to tease apart the performance | improvements we got from inlining foldl'. From what I remember, the | biggest gain from inlining is that the higher-order argument is known,
Ah yes! But that is *quite different* from specialising only on the *type*!
I know. I was just trying to be transparent by saying that *most* of the speedup from inlining foldl' is not from unboxing. If I remember correctly there was still *some* speedup from unboxing. That's all.
| foldl' :: (a -> b -> a) -> a -> [b] -> a | foldl' = ... | {-# SPECIALIZABLE foldl' #-} | | and then have foldl' be specialized in the client module at whatever | type it's called at, but only once for each type (instead of once per | call site).
For overloaded functions, that is precisely what I'm proposing.
For foldl' I think you will get no benefit from specialising on the type; only from specialising on the actual function passed. So in that case inlining probably really is the answer.
I used foldl' as an example. My question is: are there any kind of functions, other than overloaded functions, that would benefit from specialization. If the answer is yes, can we apply the same heuristic you use for overloaded functions to specialize these kind of functions. If we cannot automatically detect all the cases where specializing would be beneficial, should we allow the user to tell us when it might be beneficial? -- Johan

| My question is: are there any kind of functions, other than overloaded | functions, that would benefit from specialization. If the answer is | yes, can we apply the same heuristic you use for overloaded functions | to specialize these kind of functions. If we cannot automatically | detect all the cases where specializing would be beneficial, should we | allow the user to tell us when it might be beneficial? If you allow specialisation on *values* then yes. That's what SpecConstr does. If you are talking of specialisation on *types* then I think it's very limited (aside from overloading). By all means show me otherwise. Simon

Hi,
ivan.miljenovic:
On 24 September 2010 14:36, Michael Snoyman
wrote: On Fri, Sep 24, 2010 at 6:26 AM, Don Stewart
wrote: Perhaps a containers-inline fork is needed, for those who still need the speed.
Would it be possible to use CPP to turn the INLINE flags into a compile-time argument, ie:
#ifdef INLINE {-# INLINE #-} #endif
Since containers ships with GHC, wouldn't this then require an extra flag being used when building GHC to enable this?
And then to use it, you'd have to build your own GHC rather than using a pre-built binary like just about everyone does...
I'd hate to start seeing incompatible Data.Map.Maps floating around.
Agreed. At the very least if there was a fork it would presumably have to be in a different module namespace to avoid namespace collisions, which would make the incompatability obvious.
We're talking about a 3% increase in the size of the Map, a 2% size in the Map.hs benchmark binary, right?
For a 50% increase in Map function performance.
We are also talking about 3.7% increase of ghc-7.0 binary and 2.5% increase of libHSghc-7.0. Milan

On 24/09/2010 05:26, Don Stewart wrote:
bos:
On Thu, Sep 23, 2010 at 6:17 PM, Don Stewart
wrote: Well, I still don't understand what penalty the code bloat is exactly.
It's going to push I-cache misses through the roof when there are umpteen identical versions of the same function over Map ByteString Foo. Also, some of us would like to ship applications written in Haskell, and network bandwidth is a big concern there, having gone nowhere in years.
Reducing binary sizes via dynamic linking is going to make a lot more of an impact than dditional unrolling. However, we have *no* other techniques for making containers faster.
Perhaps a containers-inline fork is needed, for those who still need the speed.
I absolutely agree that speed is important, and speed by default is a valuable property to have. The recent work on containers will be a tremendous benefit to the community. Having said that, I'm not keen on forcing functions to be inlined at every call site: it takes away the choice from the user, and I like users to be able to tweak the tradeoff between code size and speed. This is the great thing about INLINABLE: it defers the choice to the client. Unfortunately it seems that these functions are just too big for GHC to consider worth inlining - GHC is quite conservative about code bloat with its default settings. You can already tweak the tradeoff yourself with -funfolding-use-threshold100, but perhaps we need a more accessible, user-friendly way to do this, such as -O3. We don't need a fork of the containers package - just tell users that they need to turn up the knob to get better performance. I'm not sure if we should be basing all our decisions on measurements done with Map Int. Clients using Map Int can already get a huge speed boost by switching to IntMap. Let's use something more realistic like Map ByteString or Map Integer with some big Integers. Cheers, Simon

On Fri, Sep 24, 2010 at 11:30 AM, Simon Marlow
I'm not sure if we should be basing all our decisions on measurements done with Map Int. Clients using Map Int can already get a huge speed boost by switching to IntMap. Let's use something more realistic like Map ByteString or Map Integer with some big Integers.
For the record I also ran the benchmarks using (small) Integer keys and the speedups are nearly identical.

On Thu, Sep 23, 2010 at 9:22 PM, Milan Straka
I in fact expected this outcome -- it is not inlining itself that is beneficial, it is the _specialization_ (which happens during inlining if possible).
Does that mean that instead of INLINEABLE, we need SPECIALIZABLE? SPECIALIZABLE would be like the SPECIALIZE pragma, but for all types that are used throughtout a program. The bloat would be proportional to the number of data types, not of function uses. Cheers! =) -- Felipe.

On Thu, Sep 23, 2010 at 7:01 PM, Felipe Lessa
On Thu, Sep 23, 2010 at 9:22 PM, Milan Straka
wrote: I in fact expected this outcome -- it is not inlining itself that is beneficial, it is the _specialization_ (which happens during inlining if possible).
Does that mean that instead of INLINEABLE, we need SPECIALIZABLE?
Yes, this was a point Johan was trying to make when this came up a few weeks ago.
participants (12)
-
Bryan O'Sullivan
-
Don Stewart
-
Edward Kmett
-
Felipe Lessa
-
Ian Lynagh
-
Ivan Lazar Miljenovic
-
Johan Tibell
-
Michael Snoyman
-
Milan Straka
-
Simon Marlow
-
Simon Peyton-Jones
-
Thomas Schilling