
I suspect the following might be faster:
data BinomForest2 a = Empty | NonEmpty a [BinomTree2 a]
data BinomTree2 a = BinomTree2 a [BinomTree2 a]
This eliminates the "Skip" constructor, which contributes only to the nested type guarantee.
Ehehehe. This is something I'm pretty proud of inventing, because your implementation is actually significantly *slower*. The reason is essentially that I can do a lot more constructor unpacking when I have access to that much type information about the structure of the tree at each level. You didn't quite implement it correctly, because among other things, we *need* to track tree rank, at least for the roots, if it's not encoded in the type. Here are your data types, with everything unpacked as much as possible. data BinomQueue a = Nil | Cons {-# UNPACK #-} !Int {-# UNPACK #-} !(BinHeap a) (BinomQueue a) -- equivalent to [(Int, BinHeap a)], but unrolled -- we only need to track ranks in the roots data BinomSubtree a = Nil' | Cons' {-# UNPACK #-} !(BinHeap a) (BinomSubtree a) -- equivalent to [BinHeap a], but unrolled data BinHeap a = Bin a (BinomSubtree a) I tested, and this implementation actually performs better if the spine is maintained lazily, so we'll test that version. The full implementation (that is, the basic methods: insert, extract, fromList, toAscList) of your approach was attached to the ticket herehttp://hackage.haskell.org/trac/ghc/attachment/ticket/3909/QuickBinom.hs. Feel free to ghc-core it, or tinker with the implementation, but I've done a fair bit of tinkering, and my results haven't changed. Running a benchpress test on heapsorting 25000 Ints, calling performGC after every run of each method. SBinomial stands for "sparse binomial heap," which is your approach. With -O2, +RTS -H128m: min mean +/-sd median max Binomial: 0.000 3.440 2.204 4.000 8.001 SBinomial: 24.001 28.562 5.600 28.001 48.003 (ratio: 8.3x slower average) With -O2: min mean +/-sd median max Binomial: 4.000 8.801 2.606 8.001 24.002 SBinomial: 32.002 41.763 8.007 42.003 64.004 (ratio: 4.7x slower average) Without optimization, with +RTS -H128m: min mean +/-sd median max Binomial: 4.001 10.041 3.140 8.001 20.002 SBinomial: 64.004 76.805 8.790 76.005 100.006 (ratio: 7.6x slower average) Without optimization: Binomial: 12.000 19.761 5.328 16.001 40.002 SBinomial: 72.004 90.126 11.906 88.006 120.008 (ratio: 4.6x slower average) These times are measured in milliseconds. Conclusion: my implementation is *massively faster*, by a factor of at least 4.5x. (I spent the last half hour trying to optimize SBinomial -- it really is that big a difference, and it's not going to change.) Here's why I think that's the case: even though we might have the Skip constructor, how many Skip constructors are there, total? On average, half the forest is made up of Skips, so there are 1/2 log n Skip values there. But the thing is that the sparse binomial tree representation can't do anywhere near as much unpacking; in particular, the set of children of each node is not a single-constructor type. That's an O(n) increase in allocation, all for an O(log n) shortening of the spine. That's why it's a bad plan. Most of the work is being done in allocations of nodes in the *trees,* rather than along the spine among the roots. In this area, the crazy, type-system-exploiting approach actually does much less allocation, because it can do a lot more constructor unpacking. Let's test this hypothesis: My type-magical implementation, -O2, +RTS -sstderr (no -H128m this time): 3,050,272,052 bytes allocated in the heap 240,340,552 bytes copied during GC 1,087,992 bytes maximum residency (201 sample(s)) 53,136 bytes maximum slop 3 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 5703 collections, 0 parallel, 0.48s, 0.53s elapsed Generation 1: 201 collections, 0 parallel, 0.22s, 0.24s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 4.52s ( 4.74s elapsed) GC time 0.70s ( 0.77s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 5.23s ( 5.51s elapsed) %GC time 13.5% (13.9% elapsed) Alloc rate 674,200,547 bytes per MUT second Productivity 86.5% of total user, 82.2% of total elapsed The sparse binomial forest representation, same options: 5,612,965,672 bytes allocated in the heap 563,671,500 bytes copied during GC 1,967,576 bytes maximum residency (202 sample(s)) 107,212 bytes maximum slop 5 MB total memory in use (1 MB lost due to fragmentation) Generation 0: 10602 collections, 0 parallel, 1.60s, 1.67s elapsed Generation 1: 202 collections, 0 parallel, 0.28s, 0.30s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 6.52s ( 6.51s elapsed) GC time 1.87s ( 1.97s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 8.40s ( 8.48s elapsed) %GC time 22.3% (23.2% elapsed) Alloc rate 860,302,018 bytes per MUT second Productivity 77.7% of total user, 76.9% of total elapsed Furthermore, if we actually profile the sparse forest representation, we confirm that a *huge* amount of allocation is being done during tree melding. Conclusion: the type-magical binomial heap implementation does a lot less allocating, and is faster even ignoring GC time. I encourage you to test it out for yourself, but I think you'll find similar results. =D Louis