Re: Feedback request: priority queues in containers

Why not use Okasaki & Brodal's "Optimal Purely Functional Priority Queues"? They offer worst case: * O(1) union, findMin, and insert * O(lg n) deleteMin The primary reason that we don't use this implementation is, quoting from the paper you yourself linked to,
Our data structure is reasonably efficient in practice; however, there are several competing data structures that, although not asymptotically optimal, are somewhat faster than ours in practice.
Hence, our work is primarily of theoretical interest.
The current implementation is considerably faster than Okasaki's implementation in practice, based on our benchmarks. Furthermore, the asymptotics are really pretty good, and the constant factors appear to be relatively low. I wrote a pretty efficient skew binomial heap implementation -- the first step of Okasaki's approach -- and found it was already slower than the optimized binomial heap. I'm not sure whether or not I uploaded that benchmark, though. I'll do that at some point today, just to keep everyone happy. Louis Wasserman wasserman.louis@gmail.com http://profiles.google.com/wasserman.louis

I'm not sure about some of that. Imperative queues sometimes offer
O(1) decreaseKey and O(lg n) delete by storing pointers to elements in
a priority queue. The usual purely functional PQs can only offer O(n)
decreaseKey and O(n) delete. Purely functional PSQs can offer O(lg n)
decreaseKey and O(lg n) delete.
Minimum spanning tree is a common application for PQs that makes good
use of decreaseKey.
That example did occur to me, but I feel okay about following the examples
of Java and C++ STL, which offer priority queue implementations, but those
implementations don't support decreaseKey.
You might be able to convince me that we should offer PSQueues in addition
to PQueues, but I don't think they should be merged, for the following
reason. Using the PSQueue package, which is essentially a copy of Ralf
Hinze's original implementation, yields the following benchmark for
heapsorting 25000 Ints:
Binomial: 0.000 3.240 2.180 4.000 8.001
PSQ: 8.001 13.241 2.882 12.001 24.002
I'm really not okay with that kind of performance loss for added
functionality that not everyone needs.
Louis Wasserman
wasserman.louis@gmail.com
http://profiles.google.com/wasserman.louis
On Tue, Mar 16, 2010 at 1:58 PM, Louis Wasserman
Why not use Okasaki & Brodal's "Optimal Purely Functional Priority
Queues"? They offer worst case:
* O(1) union, findMin, and insert
* O(lg n) deleteMin
The primary reason that we don't use this implementation is, quoting from the paper you yourself linked to,
Our data structure is reasonably efficient in practice; however, there are several competing data structures that, although not asymptotically optimal, are somewhat faster than ours in practice.
Hence, our work is primarily of theoretical interest.
The current implementation is considerably faster than Okasaki's implementation in practice, based on our benchmarks. Furthermore, the asymptotics are really pretty good, and the constant factors appear to be relatively low.
I wrote a pretty efficient skew binomial heap implementation -- the first step of Okasaki's approach -- and found it was already slower than the optimized binomial heap. I'm not sure whether or not I uploaded that benchmark, though. I'll do that at some point today, just to keep everyone happy.
Louis Wasserman wasserman.louis@gmail.com http://profiles.google.com/wasserman.louis

I wrote a pretty efficient skew binomial heap implementation -- the first step of Okasaki's approach -- and found it was already slower than the optimized binomial heap. I'm not sure whether or not I uploaded that benchmark, though. I'll do that at some point today, just to keep everyone happy.
The skew binomial heaps you implemented should only be asymptotically faster than the usual binomial heaps on one special case: comparing a binomial heap in a state that would case an \Omega(lg n) time cascade on insert to the worst-case O(1) insert of binomial heaps. I think it would also be worth comparing binomial heap merge against Brodal-Okasaki heap merge. Finally, if speed is the ultimate goal, I suspect the clever nested type approach to guaranteeing binomial tree shape slows things down a bit. For reference, the type in the latest patch is: data BinomForest rk a = Nil | Skip !(BinomForest (Succ rk) a) | Cons {-# UNPACK #-} !(BinomTree rk a) !(BinomForest (Succ rk) a) data BinomTree rk a = BinomTree a (rk a) data Succ rk a = Succ {-# UNPACK #-} !(BinomTree rk a) (rk a) data Zero a = Zero I suspect the following might be faster: data BinomForest2 a = Empty | NonEmpty a [BinomTree2 a] data BinomTree2 a = BinomTree2 a [BinomTree2 a] This eliminates the "Skip" constructor, which contributes only to the nested type guarantee.

I suspect the following might be faster:
data BinomForest2 a = Empty | NonEmpty a [BinomTree2 a]
data BinomTree2 a = BinomTree2 a [BinomTree2 a]
This eliminates the "Skip" constructor, which contributes only to the nested type guarantee.
Ehehehe. This is something I'm pretty proud of inventing, because your implementation is actually significantly *slower*. The reason is essentially that I can do a lot more constructor unpacking when I have access to that much type information about the structure of the tree at each level. You didn't quite implement it correctly, because among other things, we *need* to track tree rank, at least for the roots, if it's not encoded in the type. Here are your data types, with everything unpacked as much as possible. data BinomQueue a = Nil | Cons {-# UNPACK #-} !Int {-# UNPACK #-} !(BinHeap a) (BinomQueue a) -- equivalent to [(Int, BinHeap a)], but unrolled -- we only need to track ranks in the roots data BinomSubtree a = Nil' | Cons' {-# UNPACK #-} !(BinHeap a) (BinomSubtree a) -- equivalent to [BinHeap a], but unrolled data BinHeap a = Bin a (BinomSubtree a) I tested, and this implementation actually performs better if the spine is maintained lazily, so we'll test that version. The full implementation (that is, the basic methods: insert, extract, fromList, toAscList) of your approach was attached to the ticket herehttp://hackage.haskell.org/trac/ghc/attachment/ticket/3909/QuickBinom.hs. Feel free to ghc-core it, or tinker with the implementation, but I've done a fair bit of tinkering, and my results haven't changed. Running a benchpress test on heapsorting 25000 Ints, calling performGC after every run of each method. SBinomial stands for "sparse binomial heap," which is your approach. With -O2, +RTS -H128m: min mean +/-sd median max Binomial: 0.000 3.440 2.204 4.000 8.001 SBinomial: 24.001 28.562 5.600 28.001 48.003 (ratio: 8.3x slower average) With -O2: min mean +/-sd median max Binomial: 4.000 8.801 2.606 8.001 24.002 SBinomial: 32.002 41.763 8.007 42.003 64.004 (ratio: 4.7x slower average) Without optimization, with +RTS -H128m: min mean +/-sd median max Binomial: 4.001 10.041 3.140 8.001 20.002 SBinomial: 64.004 76.805 8.790 76.005 100.006 (ratio: 7.6x slower average) Without optimization: Binomial: 12.000 19.761 5.328 16.001 40.002 SBinomial: 72.004 90.126 11.906 88.006 120.008 (ratio: 4.6x slower average) These times are measured in milliseconds. Conclusion: my implementation is *massively faster*, by a factor of at least 4.5x. (I spent the last half hour trying to optimize SBinomial -- it really is that big a difference, and it's not going to change.) Here's why I think that's the case: even though we might have the Skip constructor, how many Skip constructors are there, total? On average, half the forest is made up of Skips, so there are 1/2 log n Skip values there. But the thing is that the sparse binomial tree representation can't do anywhere near as much unpacking; in particular, the set of children of each node is not a single-constructor type. That's an O(n) increase in allocation, all for an O(log n) shortening of the spine. That's why it's a bad plan. Most of the work is being done in allocations of nodes in the *trees,* rather than along the spine among the roots. In this area, the crazy, type-system-exploiting approach actually does much less allocation, because it can do a lot more constructor unpacking. Let's test this hypothesis: My type-magical implementation, -O2, +RTS -sstderr (no -H128m this time): 3,050,272,052 bytes allocated in the heap 240,340,552 bytes copied during GC 1,087,992 bytes maximum residency (201 sample(s)) 53,136 bytes maximum slop 3 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 5703 collections, 0 parallel, 0.48s, 0.53s elapsed Generation 1: 201 collections, 0 parallel, 0.22s, 0.24s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 4.52s ( 4.74s elapsed) GC time 0.70s ( 0.77s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 5.23s ( 5.51s elapsed) %GC time 13.5% (13.9% elapsed) Alloc rate 674,200,547 bytes per MUT second Productivity 86.5% of total user, 82.2% of total elapsed The sparse binomial forest representation, same options: 5,612,965,672 bytes allocated in the heap 563,671,500 bytes copied during GC 1,967,576 bytes maximum residency (202 sample(s)) 107,212 bytes maximum slop 5 MB total memory in use (1 MB lost due to fragmentation) Generation 0: 10602 collections, 0 parallel, 1.60s, 1.67s elapsed Generation 1: 202 collections, 0 parallel, 0.28s, 0.30s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 6.52s ( 6.51s elapsed) GC time 1.87s ( 1.97s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 8.40s ( 8.48s elapsed) %GC time 22.3% (23.2% elapsed) Alloc rate 860,302,018 bytes per MUT second Productivity 77.7% of total user, 76.9% of total elapsed Furthermore, if we actually profile the sparse forest representation, we confirm that a *huge* amount of allocation is being done during tree melding. Conclusion: the type-magical binomial heap implementation does a lot less allocating, and is faster even ignoring GC time. I encourage you to test it out for yourself, but I think you'll find similar results. =D Louis

On 17/03/2010 00:17, Louis Wasserman wrote:
I tested, and this implementation actually performs better if the spine is maintained lazily, so we'll test that version.
May I request that, unless there's a significant speedup from using a strict spine, that you use a lazy spine where possible. The reason being that lazy data structures work much better in a parallel setting: a strict spine is a course-grained lock on the whole operation, whereas a lazy spine corresponds to a fine-grained locking strategy. Apart from this, as you were ;) Cheers, Simon
participants (3)
-
Jim Apple
-
Louis Wasserman
-
Simon Marlow