On Tue, Feb 21, 2012 at 3:58 PM, Ben Gamari <bgamari.foss@gmail.com> wrote:
On Tue, 21 Feb 2012 11:25:05 -0800, Johan Tibell <johan.tibell@gmail.com> wrote:
> On Tue, Feb 21, 2012 at 11:20 AM, Ben Gamari <bgamari.foss@gmail.com> wrote:
> > This is certainly an option. It's actually been suggested that I use
> > HashMaps instead of IntMaps to get more uniform use of the
> > key-space. I have been a little worried that the cost of hashing would
> > outweigh the benefit of using benefits of more shallow trees, but I
> > suppose memory accesses are expensive. That being said, I could
> > certainly just use the Enum instance.
>
> In the case of Ints and newtypes thereof hashing is very cheap. A
> no-op or a multiplication with a large prime.
>
Sure. In my application (machine learning) we were mapping from tokens
(short strings, generally less than 15 bytes) to sequential newtype'd
Ints (newtype Token = Token Int). However, the fact that these
identifiers were sequential meant that IntMap's trees would be close to
the worst-case log(N) depth.

Either way, the tries only contain levels on which keys differ, so tree depths shouldn't vary by more than 1 or so.

HashMap uses lsb-first splitting in its current incarnation, so that you end up balancing contiguous keys across the trie.  This is the main argument in favor of low-order-first splitting.  The main argument against is that high-order-first splitting yields a standard binary search tree for lookup operations if you encode the pivots appropriately.

I don't recall which method is used by Johan's wide-fanout tries.  I seem to recall that at wider fanouts you're happier (for GC reasons) churning the right spine in response to contiguous insertions, rather than churning the entire tree.

The takeaway: if you're seeing performance problems from contiguous insertion, it might be worth comparing with HashMap.

-Jan-Willem Maessen