Re: Summary of containers patches

24 Sep 2010

      If I am reading this right, the numbers for alter/delete/update are rather unsettling in your proposal.

Is there a compromise point that does not cause those to suffer so much?

Because otherwise doubling the object code size doesn't seem like such a terrible burden for the such significant speed benefits we currently see in those areas.

-Edward

On Sep 23, 2010, at 8:22 PM, Milan Straka  wrote:
...
Hi Simon, others,
here comes the results.
I consider 4 variants of containers:
inline: the current fox.auryn.cz/darcs/containers with many INLINE pragmas
nothing: all INLINE pragmas deleted
inlinable: all INLINE pragmas replaced with INLINABLE
mix: all INLINE pragmas replaced with INLINABLE, but lookup*, member*
  and insert* functions are kept INLINE.
At first code bloat. Writing size of the binaries in bytes, all binaries
are stripped, I386.
* map-properties - binary running test on Data.Map, uses many functions
 from Data.Map
 - inline: 5266740
 - nothing: 2325940
 - inlinable: 2348116
 - mix: 2571156
 That is 2.9MB increase in size from nothing to inline.
* Map.hs - benchmark of Data.Map, using several functions only
 - inline: 24465452
 - nothing: 24302348
 - inlinable: 24326924
 - mix: 24417100
 160kB only from nothing to inline.
Another code bloat. This time I compiled
ghc-7.0: GHC itself with containers containing INLINE pragmas
 20723280B
ghc-7.0-noinlinemap: GHC itself with containers without INLINE pragmas
 19965464B
That is 740kB difference between ghc-7.0 binaries.
(There is 1082kB difference between libHSghc-7.0.0.a libraries.)
Another code bloat. This time Map.hs benchmark again. All packages
needed for the benchmark (most notably criterion) are compiled:
- against containers package contains INLINE pragmas
 24465452B
- against containers package without INLINE pragmas
 23753808B
That is 700kB difference _on one binary_.
So the code bloat is considerable.
Now the timing results. The results are the change in runtime in %.
[1] speedup between inline and noinline (negative numbers means slowdown)
[2] speedup between inline and inlinable
[3] speedup between inline and mix
(the results are += several %, I did not have time to do it better)
[1]        [2]        [3]
alter                       -50.33     -49.17     -47.51
delete                      -46.67     -50.93     -48.79
difference                  1.93       0.18       6.33
insert                      -44.11     -44.25     -3.09
insertLookupWithKey empty   -14.62     -15.70     1.22
insertLookupWithKey update  -21.54     -19.23     6.20
insertLookupWithKey' empty  -16.79     -23.67     -2.82
insertLookupWithKey' update -26.62     -26.99     0.47
insertWith empty            -44.42     -48.29     -5.88
insertWith update           -43.28     -45.03     2.50
insertWith' empty           -45.30     -46.66     -3.45
insertWith' update          -38.15     -38.27     3.94
insertWithKey empty         -42.60     -45.58     -6.99
insertWithKey update        -40.13     -36.62     -1.71
insertWithKey' empty        -46.54     -46.53     -4.22
insertWithKey' update       -43.10     -42.16     0.98
intersection                1.42       -1.75      0.37
lookup                      -94.03     -94.62     6.14
lookupIndex                 -86.25     -86.92     4.91
union                       -0.04      -0.87      2.12
update                      -50.64     -55.27     -54.17
updateLookupWithKey         -43.90     -43.31     -49.22
So, the INLINABLE does not help at all.
I in fact expected this outcome -- it is not inlining itself that is
beneficial, it is the _specialization_ (which happens during inlining if
possible).
Please bear in mind that the speedup caused by specialization is
significant only in the case when the Ord.compare function is trivial
(if it is an instruction like <=, the result between instruction call
and a dynamic dictionary call is very big). That means things like Map
Int, Map Word, Map Int8 and so on. But with these types one usually uses
IntMap. What I am saying is that the usage of Map which would benefit
from specialization may not be very common.
At this point I suggest the following:
- as immediate solution, remove all INLINE from the containers package,
 just leave INLINE in lookup*, member* and insert* functions. This
 causes small code bloat and quite big wins on these functions, which
 I consider to be very frequent.
I could imagine to remove also INLINE for insert* functions, or to add
 INLINE for delete*, update* and alter*.
- in the long term, find a way to get back the speedup without the code
 bloat. The problem is that _every_ call to a lookup has a different
 code in the binary. We need to share the code of specialized functions
 (like the c++ templates do). I mean the following:
 - before linking, find out which dictionaries are passed to
   Data.Map.lookup method. Specialize Data.Map.lookup for these types
   (only once for each type) and change the calls to lookup with
   specific dictionaries.
 The effect should be as if user said SPECIALIZE, for each method
 only for types it is non-generically being used for.
- I would not bother with INLINABLE currently.
Simon, others, do you agree with my short-term solution? If you do,
I will push the patches (I need to do it before I fly to ICFP, which
will be in 30 hours).
Cheers,
Milan
...
...
as you probably know, there are several containers patches dealing with
performance, separate testsuite and benchmarking.
Specifically, I mean tickets
http://hackage.haskell.org/trac/ghc/ticket/4277
http://hackage.haskell.org/trac/ghc/ticket/4279
http://hackage.haskell.org/trac/ghc/ticket/4280
http://hackage.haskell.org/trac/ghc/ticket/4311
http://hackage.haskell.org/trac/ghc/ticket/4312
http://hackage.haskell.org/trac/ghc/ticket/4333
I have all of them in the following repository
http://fox.auryn.cz/darcs/containers/
There are no user-visible changes, just performance and the split
testsuite.
There were no comments in some time, so I would like to push them to
containers.
There is just the code-bloat issue to solve. The excessive use of INLINE
pragma causes a large code bloat. I am aware of it, it happened even
when I was benchmarking the containers.
Personally I vote for:
- changing INLINE to INLINABLE
- leave INLINE just in
  - member/lookup functions
  - insert* functions
  as these benefit from specialization a lot and do not cause such
  a bloat (measured during the internship, do not recall the numbers
  now).
I suspect the performance to drop a bit, but I think it is better to
have a bit slower containers without such a code bloat.
Simon, I will be flying to IFCP soon, but if you want, I can prepare the
INLINE->INLINABLE patches and put them in the repo.
That would help a lot, thanks.  I started on this today but I'm a bit  
overloaded, and getting criterion installed wasn't very smooth (a couple  
of package breakages, but nothing serious and patches are upstream).
Cheers,
   Simon
_______________________________________________
Libraries mailing list
Libraries@haskell.org
http://www.haskell.org/mailman/listinfo/libraries

Re: Summary of containers patches

Edward Kmett