Re: Feedback request: priority queues in containers

I suspect the following might be faster:
data BinomForest2 a = Empty | NonEmpty a [BinomTree2 a]
data BinomTree2 a = BinomTree2 a [BinomTree2 a]
This eliminates the "Skip" constructor, which contributes only to the nested type guarantee.
Ehehehe. This is something I'm pretty proud of inventing, because your implementation is actually significantly *slower*. The reason is essentially that I can do a lot more constructor unpacking when I have access to that much type information about the structure of the tree at each level. You didn't quite implement it correctly, because among other things, we *need* to track tree rank, at least for the roots, if it's not encoded in the type. Here are your data types, with everything unpacked as much as possible. data BinomQueue a = Nil | Cons {-# UNPACK #-} !Int {-# UNPACK #-} !(BinHeap a) (BinomQueue a) -- equivalent to [(Int, BinHeap a)], but unrolled -- we only need to track ranks in the roots data BinomSubtree a = Nil' | Cons' {-# UNPACK #-} !(BinHeap a) (BinomSubtree a) -- equivalent to [BinHeap a], but unrolled data BinHeap a = Bin a (BinomSubtree a) I tested, and this implementation actually performs better if the spine is maintained lazily, so we'll test that version. The full implementation (that is, the basic methods: insert, extract, fromList, toAscList) of your approach was attached to the ticket herehttp://hackage.haskell.org/trac/ghc/attachment/ticket/3909/QuickBinom.hs. Feel free to ghc-core it, or tinker with the implementation, but I've done a fair bit of tinkering, and my results haven't changed. Running a benchpress test on heapsorting 25000 Ints, calling performGC after every run of each method. SBinomial stands for "sparse binomial heap," which is your approach. With -O2, +RTS -H128m: min mean +/-sd median max Binomial: 0.000 3.440 2.204 4.000 8.001 SBinomial: 24.001 28.562 5.600 28.001 48.003 (ratio: 8.3x slower average) With -O2: min mean +/-sd median max Binomial: 4.000 8.801 2.606 8.001 24.002 SBinomial: 32.002 41.763 8.007 42.003 64.004 (ratio: 4.7x slower average) Without optimization, with +RTS -H128m: min mean +/-sd median max Binomial: 4.001 10.041 3.140 8.001 20.002 SBinomial: 64.004 76.805 8.790 76.005 100.006 (ratio: 7.6x slower average) Without optimization: Binomial: 12.000 19.761 5.328 16.001 40.002 SBinomial: 72.004 90.126 11.906 88.006 120.008 (ratio: 4.6x slower average) These times are measured in milliseconds. Conclusion: my implementation is *massively faster*, by a factor of at least 4.5x. (I spent the last half hour trying to optimize SBinomial -- it really is that big a difference, and it's not going to change.) Here's why I think that's the case: even though we might have the Skip constructor, how many Skip constructors are there, total? On average, half the forest is made up of Skips, so there are 1/2 log n Skip values there. But the thing is that the sparse binomial tree representation can't do anywhere near as much unpacking; in particular, the set of children of each node is not a single-constructor type. That's an O(n) increase in allocation, all for an O(log n) shortening of the spine. That's why it's a bad plan. Most of the work is being done in allocations of nodes in the *trees,* rather than along the spine among the roots. In this area, the crazy, type-system-exploiting approach actually does much less allocation, because it can do a lot more constructor unpacking. Let's test this hypothesis: My type-magical implementation, -O2, +RTS -sstderr (no -H128m this time): 3,050,272,052 bytes allocated in the heap 240,340,552 bytes copied during GC 1,087,992 bytes maximum residency (201 sample(s)) 53,136 bytes maximum slop 3 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 5703 collections, 0 parallel, 0.48s, 0.53s elapsed Generation 1: 201 collections, 0 parallel, 0.22s, 0.24s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 4.52s ( 4.74s elapsed) GC time 0.70s ( 0.77s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 5.23s ( 5.51s elapsed) %GC time 13.5% (13.9% elapsed) Alloc rate 674,200,547 bytes per MUT second Productivity 86.5% of total user, 82.2% of total elapsed The sparse binomial forest representation, same options: 5,612,965,672 bytes allocated in the heap 563,671,500 bytes copied during GC 1,967,576 bytes maximum residency (202 sample(s)) 107,212 bytes maximum slop 5 MB total memory in use (1 MB lost due to fragmentation) Generation 0: 10602 collections, 0 parallel, 1.60s, 1.67s elapsed Generation 1: 202 collections, 0 parallel, 0.28s, 0.30s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 6.52s ( 6.51s elapsed) GC time 1.87s ( 1.97s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 8.40s ( 8.48s elapsed) %GC time 22.3% (23.2% elapsed) Alloc rate 860,302,018 bytes per MUT second Productivity 77.7% of total user, 76.9% of total elapsed Furthermore, if we actually profile the sparse forest representation, we confirm that a *huge* amount of allocation is being done during tree melding. Conclusion: the type-magical binomial heap implementation does a lot less allocating, and is faster even ignoring GC time. I encourage you to test it out for yourself, but I think you'll find similar results. =D Louis

I encourage you to test it out for yourself, but I think you'll find similar results. =D
I don't spend a lot of time thinking about non-asymptotic optimizations, so I may have made an error or ten, but I found vastly different results for heap sort. I tested heap sorting mostly following the description of your example: 1. Generate 25,000 random Ints 2. System.Mem.performGC 3. Start timer 4. Insert Ints into PQ 5. Remove one at a time until there are none left 6. Print the last one removed 7. Stop timer I compiled with GHC 6.10.1 and -O2. I tested four implementations of binomial PQs: A: Brodal-Okasaki binomial PQs, modified for O(1) findMin (the first implementation in the paper I linked, transcribed by me by hand). This file contains no strictness or unpacking annotations or optimizations. B: Wasserman PQs, using your unpacking but writing the functions myself (just so I could make a small file that I could be sure I mostly understood). This is clearly different from D, your proposal, but I think it uses the same algorithms (not functions) and the same data structures (down to unpacking and strictness). C: Simple binomial queues without the O(1) findMin modification, as written by you in the last attachment on 3909. This implementation uses some unpacking and strictness annotations, and uses a different deleteMin algorithm than Brodal & Okasaki use. You called this implementation "sparse" to distinguish it from your nested types proposal. D: Wasserman PQs, using your latest proposed patch from ticket 3909. Binomial priority queue timings (in milliseconds, averaged over 20 runs) A: Simple Brodal-Okasaki list implementation: 199.3697 B: Wasserman nested type implementation, simple functions: 235.0143 C: Wasserman's 'Quick' list implementation: 179.7227 D: Wasserman's proposed nested type implementation: 163.47515 D is still the fastest. However, A is a good bit faster than B. Is it possible that C would show a similar advantage over D if it were optimized as carefully as D? Also, why do my benchmarks show only slight differences between the implementations when yours show massive ones? It is possible I am making a noob benchmarking error. Attached is a zip file that should contain everything necessary to replicate my tests.

I should also note that these timings are quite temperamental on my computer; using profiling or removing the performGC calls can make C faster than D and B faster than A.
Binomial priority queue timings (in milliseconds, averaged over 20 runs) A: Simple Brodal-Okasaki list implementation: 199.3697 B: Wasserman nested type implementation, simple functions: 235.0143 C: Wasserman's 'Quick' list implementation: 179.7227 D: Wasserman's proposed nested type implementation: 163.47515
D is still the fastest. However, A is a good bit faster than B. Is it possible that C would show a similar advantage over D if it were optimized as carefully as D?

I really recommend using Criterion to get rid of some of the many
problems associated with benchmarking.
http://hackage.haskell.org/package/criterion
Cheers,
Johan
On Wed, Mar 17, 2010 at 9:35 AM, Jim Apple
I should also note that these timings are quite temperamental on my computer; using profiling or removing the performGC calls can make C faster than D and B faster than A.
Binomial priority queue timings (in milliseconds, averaged over 20 runs) A: Simple Brodal-Okasaki list implementation: 199.3697 B: Wasserman nested type implementation, simple functions: 235.0143 C: Wasserman's 'Quick' list implementation: 179.7227 D: Wasserman's proposed nested type implementation: 163.47515
D is still the fastest. However, A is a good bit faster than B. Is it possible that C would show a similar advantage over D if it were optimized as carefully as D?
Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

I'd concur, but I haven't been able to install Criterion yet. =(
Let me rewrite things with BenchPress, and see what happens.
I should add, though, that I think C *is* as optimized as D. In particular,
some of the tricks that worked with D only worked because of the way the
spines were arranged. (For instance, unrolling the children incrementally
was a trick that really worked because the children and the trees in the
spine were lined up, which meant that it needed the Skip combinator.)
Simon: I wasn't sure what to do there, because e.g. Data.Map, I think, is
strict that way. I prefer the lazy way, though, so I certainly don't mind
keeping it lazy =)
Louis Wasserman
wasserman.louis@gmail.com
http://profiles.google.com/wasserman.louis
On Wed, Mar 17, 2010 at 4:04 AM, Johan Tibell
I really recommend using Criterion to get rid of some of the many problems associated with benchmarking.
http://hackage.haskell.org/package/criterion
Cheers, Johan
On Wed, Mar 17, 2010 at 9:35 AM, Jim Apple
> wrote: I should also note that these timings are quite temperamental on my computer; using profiling or removing the performGC calls can make C faster than D and B faster than A.
Binomial priority queue timings (in milliseconds, averaged over 20 runs) A: Simple Brodal-Okasaki list implementation: 199.3697 B: Wasserman nested type implementation, simple functions: 235.0143 C: Wasserman's 'Quick' list implementation: 179.7227 D: Wasserman's proposed nested type implementation: 163.47515
D is still the fastest. However, A is a good bit faster than B. Is it possible that C would show a similar advantage over D if it were optimized as carefully as D?
Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

On Wed, Mar 17, 2010 at 3:44 PM, Louis Wasserman
I'd concur, but I haven't been able to install Criterion yet. =(
Let me rewrite things with BenchPress, and see what happens.
As the author of BenchPress I'd encourage you to use Criterion as BenchPress was written to benchmark medium size I/O actions (such as HTTP requests) and not (small) pure computations.
I should add, though, that I think C *is* as optimized as D. In particular, some of the tricks that worked with D only worked because of the way the spines were arranged. (For instance, unrolling the children incrementally was a trick that really worked because the children and the trees in the spine were lined up, which meant that it needed the Skip combinator.)
Simon: I wasn't sure what to do there, because e.g. Data.Map, I think, is strict that way. I prefer the lazy way, though, so I certainly don't mind keeping it lazy =)
Simon, we really need a good concurrent benchmark for this stuff as most (all) of the data types in containers are strict in the spine. Cheers, Johan

On 17/03/2010 16:48, Johan Tibell wrote:
On Wed, Mar 17, 2010 at 3:44 PM, Louis Wasserman
wrote: I'd concur, but I haven't been able to install Criterion yet. =(
Let me rewrite things with BenchPress, and see what happens.
As the author of BenchPress I'd encourage you to use Criterion as BenchPress was written to benchmark medium size I/O actions (such as HTTP requests) and not (small) pure computations.
I should add, though, that I think C *is* as optimized as D. In particular, some of the tricks that worked with D only worked because of the way the spines were arranged. (For instance, unrolling the children incrementally was a trick that really worked because the children and the trees in the spine were lined up, which meant that it needed the Skip combinator.)
Simon: I wasn't sure what to do there, because e.g. Data.Map, I think, is strict that way. I prefer the lazy way, though, so I certainly don't mind keeping it lazy =)
Simon, we really need a good concurrent benchmark for this stuff as most (all) of the data types in containers are strict in the spine.
I know. I'll try to get to this sometime. In the meantime just optimise for sequential performance, and don't add strictness just for the sake of it. Cheers, Simon

On 17/03/2010 00:17, Louis Wasserman wrote:
I tested, and this implementation actually performs better if the spine is maintained lazily, so we'll test that version.
May I request that, unless there's a significant speedup from using a strict spine, that you use a lazy spine where possible. The reason being that lazy data structures work much better in a parallel setting: a strict spine is a course-grained lock on the whole operation, whereas a lazy spine corresponds to a fine-grained locking strategy. Apart from this, as you were ;) Cheers, Simon
participants (4)
-
Jim Apple
-
Johan Tibell
-
Louis Wasserman
-
Simon Marlow