Data.Map improvement patch

21 Apr 2008

      Hello,

Recently I was going through the Data.Set implementation and I was taken aback
by setting the parameter \delta to 4. The implementation is based on Adams'
bounded trees and uses ratio \alpha=2, in which case the paper clearly states
\delta must be at least 4.646 (which is written in the source file Set.hs
several lines above the assignment \delta=4; Map.hs uses \delta=5). So I read
that Adams' paper again and found out, that the analysis described there
contains two major mistakes [it is good to read the analysis from the paper 
now if you want to follow, pages 26 to 31].

1) The analysis assumes that just before the balancing rotation, the heavier
subtree of the unbalanced vertex was inserted to, i.e. the balance is ``off by
1''. But when the lighter subtree is deleted from, the balance is ``off by
\delta'', which is worse.  When taken into account, it results in equation
$\delta\le{n\over\alpha}+1$, which must hold for each positive integer n.

2) The cases of empty subtrees are not analysed. When considered, it follows
that for \alpha=2 a single rotation can restore balance only when \delta<5.
[tree with 1 left vertex and 5 right, delete the left one and voila...]

Right now I was puzzled why both Data.Set and Data.Map implementations seem to work.

*) Data.Set: The constraint from the paper \delta >= 4.646 comes from the
inequality $(1/\alpha) >= {\delta(1+{1\over n})+1\over\delta^2-1}$, which
must hold for each positive integer n. The worst case is n=1, which is
chosen in the paper. But when using n=2, the resulting constraint is
\delta>3.792. The remaining case n=1 can be solved by examining all
possible situations for \alpha<=2 and \delta=4, which all reveal to be
correctly balanced using a~single rotation. Also the new constraint
$\delta\le{n\over\alpha}+1$ can be satisfied for \alpha=2 and \delta=4 by
examining small cases separately.
The thing is the inequalities from the paper do not contain the floor function.
Sure, for general parameters it is difficult to handle, but for the specific
parameters the results are much better when truncation is taken into account.

*) Data.Map: The constraint \delta<5 for \alpha=2 is not possible to change
without changing the algorithm. But it turns out that the Data.Map
implementation performs the balancing rotations not when one subtree is more
than \delta times heavier than the other as described in the paper, but when
one subtree is at least \delta times heavier. So the Data.Map implementation
behaves like the analysed implementation with \delta just a bit smaller than 5.

Ok, so both 4 and 5 work for the implementation of Data.Map and Data.Set.
Which one is better? This parameter affects the height of a tree.  With
\delta=4 the worst case is 1:4, so the height is log_{5/4}n ~ 3.1 log_2 n.
With \delta=5 the height is log_{6/5}n ~ 3.8 log_2 n. In Haskell, lot of
operations must copy a whole path in a tree -- so a lower tree means less time
and memory needed. There are more rotations, but we have to construct new nodes
anyway, so one would expect minor advantage of \delta=4 over \delta=5.
It is difficult to measure these things in Haskell for me (just adding a
SPECIALIZE of increasing/decreasing heap size can change order of execution
time), but anyway, here are times of 50000 sequential inserts, then 800000
lookups and 50000 deletes (the source is attached for the freaks):
Data.Set with \delta=4  180 300 72
Data.Set with \delta=5  208 304 76
Data.Map with \delta=4  212 328 84
Data.Map with \delta=5  256 324 88
IntSet                   56 356 48
IntMap                   56 360 48

So I would suggest to change "delta = 5" to "delta = 4" in the Data.Map.hs
and leave "delta = 4" in the Data.Set.hs file. The darcs patch is included :)

Any comments are welcome,
Milan

Milan Straka

Duncan Coutts

Milan Straka

Adrian Hey

Simon Marlow

tags

participants (4)