I suspect the following might be faster:

data BinomForest2 a = Empty
| NonEmpty a [BinomTree2 a]

data BinomTree2 a = BinomTree2 a [BinomTree2 a]

This eliminates the "Skip" constructor, which contributes only to the
nested type guarantee.

Ehehehe. This is something I'm pretty proud of inventing, because your implementation is actually significantly *slower*. The reason is essentially that I can do a lot more constructor unpacking when I have access to that much type information about the structure of the tree at each level.

You didn't quite implement it correctly, because among other things, we *need* to track tree rank, at least for the roots, if it's not encoded in the type. Here are your data types, with everything unpacked as much as possible.

data BinomQueue a = Nil | Cons {-# UNPACK #-} !Int {-# UNPACK #-} !(BinHeap a) (BinomQueue a)

-- equivalent to [(Int, BinHeap a)], but unrolled
-- we only need to track ranks in the roots

data BinomSubtree a = Nil' | Cons' {-# UNPACK #-} !(BinHeap a) (BinomSubtree a)

-- equivalent to [BinHeap a], but unrolled

data BinHeap a = Bin a (BinomSubtree a)

I tested, and this implementation actually performs better if the spine is maintained lazily, so we'll test that version. The full implementation (that is, the basic methods: insert, extract, fromList, toAscList) of your approach was attached to the ticket here. Feel free to ghc-core it, or tinker with the implementation, but I've done a fair bit of tinkering, and my results haven't changed.

Running a benchpress test on heapsorting 25000 Ints, calling performGC after every run of each method. SBinomial stands for "sparse binomial heap," which is your approach.

With -O2, +RTS -H128m:

min mean +/-sd median max

Binomial: 0.000 3.440 2.204 4.000 8.001

SBinomial: 24.001 28.562 5.600 28.001 48.003 (ratio: 8.3x slower average)

With -O2:

min mean +/-sd median max

Binomial: 4.000 8.801 2.606 8.001 24.002

SBinomial: 32.002 41.763 8.007 42.003 64.004 (ratio: 4.7x slower average)

Without optimization, with +RTS -H128m:

min mean +/-sd median max

Binomial: 4.001 10.041 3.140 8.001 20.002

SBinomial: 64.004 76.805 8.790 76.005 100.006 (ratio: 7.6x slower average)

Without optimization:

Binomial: 12.000 19.761 5.328 16.001 40.002

SBinomial: 72.004 90.126 11.906 88.006 120.008 (ratio: 4.6x slower average)

These times are measured in milliseconds. Conclusion: my implementation is massively faster, by a factor of at least 4.5x. (I spent the last half hour trying to optimize SBinomial -- it really is that big a difference, and it's not going to change.)

Here's why I think that's the case: even though we might have the Skip constructor, how many Skip constructors are there, total? On average, half the forest is made up of Skips, so there are 1/2 log n Skip values there.

But the thing is that the sparse binomial tree representation can't do anywhere near as much unpacking; in particular, the set of children of each node is not a single-constructor type. That's an O(n) increase in allocation, all for an O(log n) shortening of the spine. That's why it's a bad plan. Most of the work is being done in allocations of nodes in the *trees,* rather than along the spine among the roots. In this area, the crazy, type-system-exploiting approach actually does much less allocation, because it can do a lot more constructor unpacking. Let's test this hypothesis:

My type-magical implementation, -O2, +RTS -sstderr (no -H128m this time):

3,050,272,052 bytes allocated in the heap

240,340,552 bytes copied during GC

1,087,992 bytes maximum residency (201 sample(s))

53,136 bytes maximum slop

3 MB total memory in use (0 MB lost due to fragmentation)

Generation 0: 5703 collections, 0 parallel, 0.48s, 0.53s elapsed

Generation 1: 201 collections, 0 parallel, 0.22s, 0.24s elapsed

INIT time 0.00s ( 0.00s elapsed)

MUT time 4.52s ( 4.74s elapsed)

GC time 0.70s ( 0.77s elapsed)

EXIT time 0.00s ( 0.00s elapsed)

Total time 5.23s ( 5.51s elapsed)

%GC time 13.5% (13.9% elapsed)

Alloc rate 674,200,547 bytes per MUT second

Productivity 86.5% of total user, 82.2% of total elapsed

The sparse binomial forest representation, same options:

5,612,965,672 bytes allocated in the heap

563,671,500 bytes copied during GC

1,967,576 bytes maximum residency (202 sample(s))

107,212 bytes maximum slop

5 MB total memory in use (1 MB lost due to fragmentation)

Generation 0: 10602 collections, 0 parallel, 1.60s, 1.67s elapsed

Generation 1: 202 collections, 0 parallel, 0.28s, 0.30s elapsed

INIT time 0.00s ( 0.00s elapsed)

MUT time 6.52s ( 6.51s elapsed)

GC time 1.87s ( 1.97s elapsed)

EXIT time 0.00s ( 0.00s elapsed)

Total time 8.40s ( 8.48s elapsed)

%GC time 22.3% (23.2% elapsed)

Alloc rate 860,302,018 bytes per MUT second

Productivity 77.7% of total user, 76.9% of total elapsed

Furthermore, if we actually profile the sparse forest representation, we confirm that a huge amount of allocation is being done during tree melding.

Conclusion: the type-magical binomial heap implementation does a lot less allocating, and is faster even ignoring GC time. I encourage you to test it out for yourself, but I think you'll find similar results. =D

Louis