Data.Map improvement patch

Hello, Recently I was going through the Data.Set implementation and I was taken aback by setting the parameter \delta to 4. The implementation is based on Adams' bounded trees and uses ratio \alpha=2, in which case the paper clearly states \delta must be at least 4.646 (which is written in the source file Set.hs several lines above the assignment \delta=4; Map.hs uses \delta=5). So I read that Adams' paper again and found out, that the analysis described there contains two major mistakes [it is good to read the analysis from the paper now if you want to follow, pages 26 to 31]. 1) The analysis assumes that just before the balancing rotation, the heavier subtree of the unbalanced vertex was inserted to, i.e. the balance is ``off by 1''. But when the lighter subtree is deleted from, the balance is ``off by \delta'', which is worse. When taken into account, it results in equation $\delta\le{n\over\alpha}+1$, which must hold for each positive integer n. 2) The cases of empty subtrees are not analysed. When considered, it follows that for \alpha=2 a single rotation can restore balance only when \delta<5. [tree with 1 left vertex and 5 right, delete the left one and voila...] Right now I was puzzled why both Data.Set and Data.Map implementations seem to work. *) Data.Set: The constraint from the paper \delta >= 4.646 comes from the inequality $(1/\alpha) >= {\delta(1+{1\over n})+1\over\delta^2-1}$, which must hold for each positive integer n. The worst case is n=1, which is chosen in the paper. But when using n=2, the resulting constraint is \delta>3.792. The remaining case n=1 can be solved by examining all possible situations for \alpha<=2 and \delta=4, which all reveal to be correctly balanced using a~single rotation. Also the new constraint $\delta\le{n\over\alpha}+1$ can be satisfied for \alpha=2 and \delta=4 by examining small cases separately. The thing is the inequalities from the paper do not contain the floor function. Sure, for general parameters it is difficult to handle, but for the specific parameters the results are much better when truncation is taken into account. *) Data.Map: The constraint \delta<5 for \alpha=2 is not possible to change without changing the algorithm. But it turns out that the Data.Map implementation performs the balancing rotations not when one subtree is more than \delta times heavier than the other as described in the paper, but when one subtree is at least \delta times heavier. So the Data.Map implementation behaves like the analysed implementation with \delta just a bit smaller than 5. Ok, so both 4 and 5 work for the implementation of Data.Map and Data.Set. Which one is better? This parameter affects the height of a tree. With \delta=4 the worst case is 1:4, so the height is log_{5/4}n ~ 3.1 log_2 n. With \delta=5 the height is log_{6/5}n ~ 3.8 log_2 n. In Haskell, lot of operations must copy a whole path in a tree -- so a lower tree means less time and memory needed. There are more rotations, but we have to construct new nodes anyway, so one would expect minor advantage of \delta=4 over \delta=5. It is difficult to measure these things in Haskell for me (just adding a SPECIALIZE of increasing/decreasing heap size can change order of execution time), but anyway, here are times of 50000 sequential inserts, then 800000 lookups and 50000 deletes (the source is attached for the freaks): Data.Set with \delta=4 180 300 72 Data.Set with \delta=5 208 304 76 Data.Map with \delta=4 212 328 84 Data.Map with \delta=5 256 324 88 IntSet 56 356 48 IntMap 56 360 48 So I would suggest to change "delta = 5" to "delta = 4" in the Data.Map.hs and leave "delta = 4" in the Data.Set.hs file. The darcs patch is included :) Any comments are welcome, Milan

On Mon, 2008-04-21 at 09:38 +0200, Milan Straka wrote:
It is difficult to measure these things in Haskell for me (just adding a SPECIALIZE of increasing/decreasing heap size can change order of execution time), but anyway, here are times of 50000 sequential inserts, then 800000 lookups and 50000 deletes (the source is attached for the freaks): Data.Set with \delta=4 180 300 72 Data.Set with \delta=5 208 304 76 Data.Map with \delta=4 212 328 84 Data.Map with \delta=5 256 324 88 IntSet 56 356 48 IntMap 56 360 48
Sorry, I'm confused, what are the columns here exactly? Duncan

Hello,
On Mon, 2008-04-21 at 09:38 +0200, Milan Straka wrote:
It is difficult to measure these things in Haskell for me (just adding a SPECIALIZE of increasing/decreasing heap size can change order of execution time), but anyway, here are times of 50000 sequential inserts, then 800000 lookups and 50000 deletes (the source is attached for the freaks): Data.Set with \delta=4 180 300 72 Data.Set with \delta=5 208 304 76 Data.Map with \delta=4 212 328 84 Data.Map with \delta=5 256 324 88 IntSet 56 356 48 IntMap 56 360 48
Sorry, I'm confused, what are the columns here exactly?
Sorry, I should made myself more clear. The first column is the time of 50000 sequential inserts ([1..50000]), the second column is the time of 800000 lookups [bigger number because they are fast, 800k=16*50k], the third column is the time od 50000 sequential deletes ([1..50000]). Everything is measured in ms. One can think of a test with random sequence -- the source can be modified easily to support it. The results are (in the same formatting) Random seq; Data.Set with \delta=4 300 636 256 Random seq; Data.Set with \delta=5 300 644 256 Random seq; Data.Map with \delta=4 372 692 268 Random seq; Data.Map with \delta=5 376 700 272 Random seq; IntSet 216 572 168 Random seq; IntMap 228 584 176 As one should expect, they are inconclusive. Thats because when the data inserted are truly random, the rebalancing is not really needed. (What is more interesting is the time bloat of all tests. I think it might be caused by being able to inline [1..50000] better than random list... Maybe a list fusion? I do not really know.) Hello, Milan PS: Here is a diff of the random test: --- Test.hs 2008-04-21 09:33:55.000000000 +0200 +++ TestNew.hs 2008-04-21 13:17:21.000000000 +0200 @@ -8,10 +8,17 @@ import System.CPUTime +import System.Random force_list list = sum list {-# SPECIALIZE force_list::[Int]->Int #-} +gen_rnd n = rnd (mkStdGen 42) (Map4.fromList [(b,0)|b<-[1..n]]) where + rnd g m | Map4.null m = [] + rnd g m = let n = Map4.size m + (i,g') = randomR (0,n-1) g + in fst (Map4.elemAt i m) : rnd g' (Map4.deleteAt i m) + test::Int->c->(Int->c->c)->(Int->c->Bool)->(Int->c->c)->(c->[Int])->IO () @@ -20,23 +27,23 @@ test n cs insert find delete tolist = do - let ls=[1..n] + let ls=gen_rnd n ts<-force_list ls `seq` getCPUTime
participants (4)
-
Adrian Hey
-
Duncan Coutts
-
Milan Straka
-
Simon Marlow