Red-black tree performance

Adrien Haxaire

20 Mar 2012 20 Mar '12

2:37 p.m.

Dear list, I had to implement a red-black tree in C++. So I took Okasaki's functional pearl and ported the algorithm to C++, where the rebalancing is done by following parent pointers. I shew the Haskell algorithm to colleagues, and after admitting it is impressive, they told me whatever the beauty as long as we have performance. So I benchmarked it and the C++ implementation, insertion of one million elements, and the C++ version is 15 times faster. Mainly because in C++ it is just moving pointers around I guess. I have to mention that the space usage is outrageous with Haskell. The Tree is strict in its subtrees, but I think the problem comes from the laziness of the balance function, which needs to hold the whole tree. This reduces to my main question: how can I decompose balance in a stricter version? Or maybe I'm completely wrong and there is something obvious I am missing? If you want to take a look, the code is here: http://hpaste.org/65611 Adrien PS: compiler options and profiling stderr output

...

ghc -Wall -O2 -prof -auto-all -rtsopts rbt.hs

...

rbt.exe +RTS -K20M -p -sstderr -RTS 1000000 964,079,612 bytes allocated in the heap 230,842,072 bytes copied during GC 58,442,812 bytes maximum residency (7 sample(s)) 327,812 bytes maximum slop 105 MB total memory in use (0 MB lost due to fragmentation)

Tot time (elapsed) Avg pause Max pause Gen 0 1810 colls, 0 par 0.30s 0.27s 0.0001s 0.0034s Gen 1 7 colls, 0 par 0.19s 0.19s 0.0278s 0.0938s INIT time 0.00s ( 0.00s elapsed) MUT time 1.78s ( 1.82s elapsed) GC time 0.48s ( 0.46s elapsed) RP time 0.00s ( 0.00s elapsed) PROF time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 2.28s ( 2.28s elapsed) %GC time 21.2% (20.3% elapsed) Alloc rate 537,387,643 bytes per MUT second Productivity 78.8% of total user, 78.5% of total elapsed -- Adrien Haxaire www.adrienhaxaire.org | @adrienhaxaire

Show replies by date

Lorenzo Bolla

20 Mar 20 Mar

4:27 p.m.

Your scripts stack-overflows on my box. I've made 2 little changes, to ensure strictness. I believe your "insertList" is building up a huge list of thunks. import Data.Foldable (foldr') <snip> main :: IO () main = putStrLn $ show $ maxTree $ foldr' insert Empty toInsert Better? L. On Tue, Mar 20, 2012 at 2:37 PM, Adrien Haxaire wrote:

...

Dear list,

I had to implement a red-black tree in C++. So I took Okasaki's functional pearl and ported the algorithm to C++, where the rebalancing is done by following parent pointers. I shew the Haskell algorithm to colleagues, and after admitting it is impressive, they told me whatever the beauty as long as we have performance. So I benchmarked it and the C++ implementation, insertion of one million elements, and the C++ version is 15 times faster. Mainly because in C++ it is just moving pointers around I guess. I have to mention that the space usage is outrageous with Haskell.

The Tree is strict in its subtrees, but I think the problem comes from the laziness of the balance function, which needs to hold the whole tree.

This reduces to my main question: how can I decompose balance in a stricter version? Or maybe I'm completely wrong and there is something obvious I am missing?

If you want to take a look, the code is here: http://hpaste.org/65611

Adrien

PS: compiler options and profiling stderr output

ghc -Wall -O2 -prof -auto-all -rtsopts rbt.hs

...
rbt.exe +RTS -K20M -p -sstderr -RTS

...
1000000 964,079,612 bytes allocated in the heap 230,842,072 bytes copied during GC 58,442,812 bytes maximum residency (7 sample(s)) 327,812 bytes maximum slop 105 MB total memory in use (0 MB lost due to fragmentation)

Tot time (elapsed) Avg pause Max pause Gen 0 1810 colls, 0 par 0.30s 0.27s 0.0001s 0.0034s Gen 1 7 colls, 0 par 0.19s 0.19s 0.0278s 0.0938s

INIT time 0.00s ( 0.00s elapsed) MUT time 1.78s ( 1.82s elapsed) GC time 0.48s ( 0.46s elapsed) RP time 0.00s ( 0.00s elapsed) PROF time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 2.28s ( 2.28s elapsed)

%GC time 21.2% (20.3% elapsed)

Alloc rate 537,387,643 bytes per MUT second

Productivity 78.8% of total user, 78.5% of total elapsed

-- Adrien Haxaire www.adrienhaxaire.org | @adrienhaxaire

______________________________**_________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/**mailman/listinfo/beginners http://www.haskell.org/mailman/listinfo/beginners

Adrien Haxaire

5:05 p.m.

On Tue, Mar 20, 2012 at 04:27:08PM +0000, Lorenzo Bolla wrote:

...

Your scripts stack-overflows on my box.

I've made 2 little changes, to ensure strictness. I believe your "insertList" is building up a huge list of thunks.

import Data.Foldable (foldr') <snip> main :: IO () main = putStrLn $ show $ maxTree $ foldr' insert Empty toInsert

Better? L.

Yes definitely, thank you. That was someting obvious that I didn't see. Now I have 0.64 seconds for C++ and 3.95 for Haskell. That's better! The ratio 3.95/0.64 is 6.171875, which is more in the range I expected. That's twice better from what I had before. I am still looking at the balance function, maybe there is some things I can optimize there as well. -- Adrien Haxaire www.adrienhaxaire.org | @adrienhaxaire

Heinrich Apfelmus

5:30 p.m.

Lorenzo Bolla wrote:

...

Your scripts stack-overflows on my box.

I've made 2 little changes, to ensure strictness. I believe your "insertList" is building up a huge list of thunks.

import Data.Foldable (foldr') <snip> main :: IO () main = putStrLn $ show $ maxTree $ foldr' insert Empty toInsert

Better?

Huh, why foldr' and not Data.List.foldl' ? As far as I understand, the former has to evaluate the whole spine of the list first before it can begin to assemble the tree. The latter can start building trees right away. I'm not sure whether the integers in toInsert are evaluated to WHNF. You can make the insert function strict in the first argument to make sure that they are. Best regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com

Lorenzo Bolla

9:03 p.m.

On Tue, Mar 20, 2012 at 5:30 PM, Heinrich Apfelmus < apfelmus@quantentunnel.de> wrote:

...

Lorenzo Bolla wrote:

...
Your scripts stack-overflows on my box.

I've made 2 little changes, to ensure strictness. I believe your "insertList" is building up a huge list of thunks.

import Data.Foldable (foldr') <snip> main :: IO () main = putStrLn $ show $ maxTree $ foldr' insert Empty toInsert

Better?

Huh, why foldr' and not Data.List.foldl' ? As far as I understand, the former has to evaluate the whole spine of the list first before it can begin to assemble the tree. The latter can start building trees right away.

Yes, definitely foldl' is worth a try, too: main = putStrLn $ show $ maxTree $ foldl' (flip insert) Empty toInsert L.

Adrien Haxaire

9:55 p.m.

On Tue, Mar 20, 2012 at 09:03:30PM +0000, Lorenzo Bolla wrote:

...

On Tue, Mar 20, 2012 at 5:30 PM, Heinrich Apfelmus < apfelmus@quantentunnel.de> wrote:

...
Huh, why foldr' and not Data.List.foldl' ? As far as I understand, the former has to evaluate the whole spine of the list first before it can begin to assemble the tree. The latter can start building trees right away.

Yes, definitely foldl' is worth a try, too: main = putStrLn $ show $ maxTree $ foldl' (flip insert) Empty toInsert

L.

I tried with foldl'. I modified the code at several places to match the argument pattern, and now I see why flip is so useful :) The conclusion is also interesting: the productivity climbs up to 92%, while the calculation time raises to 6.3s. I guess that the choice is space or time, as often. Another interesting point. By using foldr' and removing the exclamation marks in the Tree data type, I even get better results: 3.76s. Which means a ratio of 5.87. I tried adding some bang patterns and inline pragmas but it didn't help. I also ran the benchmark with 10 000 000 insertions and the ratio remains around 6. Which is quite acceptable, right? I'll be glad to show those results tomorrow. I wouldn't imagine that GHC would optimize so well plain Haskell code. The algorithm is beautiful, its implementation too, and it produces fast code. Dear it's so great to learn Haskell! Thank you Lorenzo and Heinrich, I have learnt a lot today. Best regards, Adrien -- Adrien Haxaire www.adrienhaxaire.org | @adrienhaxaire

Heinrich Apfelmus

21 Mar 21 Mar

9:27 a.m.

Adrien Haxaire wrote:

...

I tried with foldl'. I modified the code at several places to match the argument pattern, and now I see why flip is so useful :) The conclusion is also interesting: the productivity climbs up to 92%, while the calculation time raises to 6.3s. I guess that the choice is space or time, as often.

92% productivity seems right for me. In contrast, 20% garbage collection may be a sign that something went wrong.

...

Another interesting point. By using foldr' and removing the exclamation marks in the Tree data type, I even get better results: 3.76s. Which means a ratio of 5.87. I tried adding some bang patterns and inline pragmas but it didn't help.

I think that this is likely due to laziness: in the very end, you only query the rightmost element. After a while, the program simply won't evaluate the balancing on the left side of the tree, as you're not asking it to evaluate anything there. So, you're not necessarily comparing apples and apples here. But on the other hand, maybe that's a performance disadvantage of the C++ version. In Haskell, performance depends a lot on usage patterns. Best regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com

Lorenzo Bolla

10:02 a.m.

On Wed, Mar 21, 2012 at 9:27 AM, Heinrich Apfelmus < apfelmus@quantentunnel.de> wrote:

...

Adrien Haxaire wrote:

...
I tried with foldl'. I modified the code at several places to match the argument pattern, and now I see why flip is so useful :) The conclusion is also interesting: the productivity climbs up to 92%, while the calculation time raises to 6.3s. I guess that the choice is space or time, as often.

92% productivity seems right for me. In contrast, 20% garbage collection may be a sign that something went wrong.

I think that this is likely due to laziness: in the very end, you only

...

query the rightmost element. After a while, the program simply won't evaluate the balancing on the left side of the tree, as you're not asking it to evaluate anything there.

So, you're not necessarily comparing apples and apples here. But on the other hand, maybe that's a performance disadvantage of the C++ version. In Haskell, performance depends a lot on usage patterns.

This is very true. In fact, after some tweaking, I found that the best solution is using foldl', lazy type and force some strictness in "insert" using "seq". See below: import Data.Foldable (foldl', foldr') data Color = Red | Black deriving (Show) data Tree a = Empty | Node Color (Tree a) a (Tree a) deriving (Show) insert :: Ord a => a -> Tree a -> Tree a insert x t = makeBlack (ins t) where ins Empty = Node Red Empty x Empty -- ins (Node color a y b) | x < y = ins a `seq` balance color (ins a) y b -- | x == y = Node color a y b -- | x > y = ins b `seq` balance color a y (ins b) ins (Node color a y b) | x < y = balance color (ins a) y b | x == y = Node color a y b | x > y = balance color a y (ins b) makeBlack :: Tree a -> Tree a makeBlack (Node _ a y b) = Node Black a y b makeBlack Empty = Empty balance :: Color -> Tree a -> a -> Tree a -> Tree a balance Black (Node Red (Node Red a x b) y c) z d = Node Red (Node Black a x b) y (Node Black c z d) balance Black (Node Red a x (Node Red b y c)) z d = Node Red (Node Black a x b) y (Node Black c z d) balance Black a x (Node Red (Node Red b y c) z d) = Node Red (Node Black a x b) y (Node Black c z d) balance Black a x (Node Red b y (Node Red c z d)) = Node Red (Node Black a x b) y (Node Black c z d) balance color a x b = Node color a x b maxTree :: Ord a => Tree a -> a maxTree (Node _ Empty n Empty) = n maxTree (Node _ _ _ t) = maxTree t toInsert :: [Int] -- toInsert = [1..1000000] toInsert = map (`mod` 100) [1..10000000] main :: IO () main = putStrLn $ show $ maxTree $ foldl' (flip insert) Empty toInsert Note that if the improvement is around 10% for "toInsert" being a monotonic sequence of integers, the improvement is much bigger (>2x for me) for a more "random" "toInsert" sequence. L.

Lorenzo Bolla

10:08 a.m.

Ops, wrong copy and paste! See correct script below: import Data.Foldable (foldl') data Color = Red | Black deriving (Show) data Tree a = Empty | Node Color (Tree a) a (Tree a) deriving (Show) insert :: Ord a => a -> Tree a -> Tree a insert x t = makeBlack (ins t) where ins Empty = Node Red Empty x Empty ins (Node color a y b) | x < y = ins a `seq` balance color (ins a) y b | x == y = Node color a y b | x > y = ins b `seq` balance color a y (ins b) makeBlack :: Tree a -> Tree a makeBlack (Node _ a y b) = Node Black a y b makeBlack Empty = Empty balance :: Color -> Tree a -> a -> Tree a -> Tree a balance Black (Node Red (Node Red a x b) y c) z d = Node Red (Node Black a x b) y (Node Black c z d) balance Black (Node Red a x (Node Red b y c)) z d = Node Red (Node Black a x b) y (Node Black c z d) balance Black a x (Node Red (Node Red b y c) z d) = Node Red (Node Black a x b) y (Node Black c z d) balance Black a x (Node Red b y (Node Red c z d)) = Node Red (Node Black a x b) y (Node Black c z d) balance color a x b = Node color a x b maxTree :: Ord a => Tree a -> a maxTree (Node _ Empty n Empty) = n maxTree (Node _ _ _ t) = maxTree t toInsert :: [Int] -- toInsert = [1..1000000] toInsert = map (`mod` 100) [1..10000000] main :: IO () main = putStrLn $ show $ maxTree $ foldl' (flip insert) Empty toInsert Sorry for the noise, L. On Wed, Mar 21, 2012 at 10:02 AM, Lorenzo Bolla wrote:

...

On Wed, Mar 21, 2012 at 9:27 AM, Heinrich Apfelmus < apfelmus@quantentunnel.de> wrote:

...
Adrien Haxaire wrote:

...
I tried with foldl'. I modified the code at several places to match the argument pattern, and now I see why flip is so useful :) The conclusion is also interesting: the productivity climbs up to 92%, while the calculation time raises to 6.3s. I guess that the choice is space or time, as often.

92% productivity seems right for me. In contrast, 20% garbage collection may be a sign that something went wrong.

I think that this is likely due to laziness: in the very end, you only

...
query the rightmost element. After a while, the program simply won't evaluate the balancing on the left side of the tree, as you're not asking it to evaluate anything there.

So, you're not necessarily comparing apples and apples here. But on the other hand, maybe that's a performance disadvantage of the C++ version. In Haskell, performance depends a lot on usage patterns.

This is very true. In fact, after some tweaking, I found that the best solution is using foldl', lazy type and force some strictness in "insert" using "seq". See below:

import Data.Foldable (foldl', foldr')

data Color = Red | Black deriving (Show)

data Tree a = Empty | Node Color (Tree a) a (Tree a) deriving (Show)

insert :: Ord a => a -> Tree a -> Tree a insert x t = makeBlack (ins t) where ins Empty = Node Red Empty x Empty -- ins (Node color a y b) | x < y = ins a `seq` balance color (ins a) y b -- | x == y = Node color a y b -- | x > y = ins b `seq` balance color a y (ins b) ins (Node color a y b) | x < y = balance color (ins a) y b | x == y = Node color a y b | x > y = balance color a y (ins b)

makeBlack :: Tree a -> Tree a makeBlack (Node _ a y b) = Node Black a y b makeBlack Empty = Empty

balance :: Color -> Tree a -> a -> Tree a -> Tree a balance Black (Node Red (Node Red a x b) y c) z d = Node Red (Node Black a x b) y (Node Black c z d) balance Black (Node Red a x (Node Red b y c)) z d = Node Red (Node Black a x b) y (Node Black c z d) balance Black a x (Node Red (Node Red b y c) z d) = Node Red (Node Black a x b) y (Node Black c z d) balance Black a x (Node Red b y (Node Red c z d)) = Node Red (Node Black a x b) y (Node Black c z d) balance color a x b = Node color a x b

maxTree :: Ord a => Tree a -> a maxTree (Node _ Empty n Empty) = n maxTree (Node _ _ _ t) = maxTree t

toInsert :: [Int] -- toInsert = [1..1000000] toInsert = map (`mod` 100) [1..10000000]

main :: IO () main = putStrLn $ show $ maxTree $ foldl' (flip insert) Empty toInsert

Note that if the improvement is around 10% for "toInsert" being a monotonic sequence of integers, the improvement is much bigger (>2x for me) for a more "random" "toInsert" sequence.

L.

Adrien Haxaire

25 Mar 25 Mar

1:16 p.m.

On Wed, Mar 21, 2012 at 10:08:06AM +0000, Lorenzo Bolla wrote:

...

toInsert :: [Int] -- toInsert = [1..1000000] toInsert = map (`mod` 100) [1..10000000]

Hello, Sorry for replying so late. Using mod 100 will just give me entries from 0 to 99, right? so this can explain the performance improvement :) Concerning the GC. Foldl' gives me a better productivity but takes more time, when foldr' uses more GC but results in faster code. In general, what are the best practices here: tolerate a use of the GC up to 20% but with good performance, or strive for the minimum GC use? Apart from the RBTree case, in the Haskell code I am writing, performance is not the primary goal, so I would go for latter. But maybe there are some common rules to measure the right tradeoff? I chose to use this worst cas scenario of inserting a sorted ascending list to require many rebalancing; it is the same list I use to compare with the C++ implementation. I ask for the maximum to see if it yields the right result. Thanks for the reminder about the pattern used in the program, and the pattern order in functions. Reorganizing them led to some improvement. I'll keep that in mind. Best regards, Adrien -- Adrien Haxaire www.adrienhaxaire.org | @adrienhaxaire

4857

Age (days ago)

4862

Last active (days ago)

List overview

Download

9 comments

3 participants

participants (3)

Adrien Haxaire
Heinrich Apfelmus
Lorenzo Bolla