enummapset merge into containers

Ben Gamari

21 Feb 2012 21 Feb '12

4:07 p.m.

Not so long ago, it was suggested[1] that enummapset[2] might be a good candidate for merge into containers. Having found this package to be very useful in the past, I would find this quite nice. Would others find this a reasonable proposal? If so, what needs to be done to make this happen? Cheers, - Ben [1] http://blog.ezyang.com/2011/08/changes-to-intmap/comment-page-1/#comment-287... [2] http://hackage.haskell.org/package/enummapset

Show replies by date

Johan Tibell

21 Feb 21 Feb

5:54 p.m.

Hi Ben, On Tue, Feb 21, 2012 at 8:07 AM, Ben Gamari wrote:

...

Not so long ago, it was suggested[1] that enummapset[2] might be a good candidate for merge into containers. Having found this package to be very useful in the past, I would find this quite nice. Would others find this a reasonable proposal? If so, what needs to be done to make this happen?

We need to discuss the pros and cons of making the API bigger. An alternative would be for us to merge unordered-containers into containers. HashMaps are as fast as IntMaps for any type that can be hashed to an Int, like Enums, but is more widely applicable. However, HashMaps are not ordered. -- Johan

Ben Gamari

7:20 p.m.

On Tue, 21 Feb 2012 09:54:08 -0800, Johan Tibell wrote:

...

Hi Ben,

We need to discuss the pros and cons of making the API bigger.

An alternative would be for us to merge unordered-containers into containers. HashMaps are as fast as IntMaps for any type that can be hashed to an Int, like Enums, but is more widely applicable. However, HashMaps are not ordered.

This is certainly an option. It's actually been suggested that I use HashMaps instead of IntMaps to get more uniform use of the key-space. I have been a little worried that the cost of hashing would outweigh the benefit of using benefits of more shallow trees, but I suppose memory accesses are expensive. That being said, I could certainly just use the Enum instance. Anyways, I would be fine with seeing HashMap merged in leiu of enummapset. It would be nice to see at least one merged, however. Cheers, - Ben

Johan Tibell

7:25 p.m.

On Tue, Feb 21, 2012 at 11:20 AM, Ben Gamari wrote:

...

This is certainly an option. It's actually been suggested that I use HashMaps instead of IntMaps to get more uniform use of the key-space. I have been a little worried that the cost of hashing would outweigh the benefit of using benefits of more shallow trees, but I suppose memory accesses are expensive. That being said, I could certainly just use the Enum instance.

In the case of Ints and newtypes thereof hashing is very cheap. A no-op or a multiplication with a large prime.

...

Anyways, I would be fine with seeing HashMap merged in leiu of enummapset. It would be nice to see at least one merged, however.

I'm hurrying slowly on purpose here. I'm almost finished with a new implementation of HashMap and it's easier to experiment while it's in a separate package (e.g. people expect less stability.) My goal is to eventually merge though. -- Johan

Ben Gamari

8:58 p.m.

On Tue, 21 Feb 2012 11:25:05 -0800, Johan Tibell wrote:

...

On Tue, Feb 21, 2012 at 11:20 AM, Ben Gamari wrote:

...
This is certainly an option. It's actually been suggested that I use HashMaps instead of IntMaps to get more uniform use of the key-space. I have been a little worried that the cost of hashing would outweigh the benefit of using benefits of more shallow trees, but I suppose memory accesses are expensive. That being said, I could certainly just use the Enum instance.

In the case of Ints and newtypes thereof hashing is very cheap. A no-op or a multiplication with a large prime.

Sure. In my application (machine learning) we were mapping from tokens (short strings, generally less than 15 bytes) to sequential newtype'd Ints (newtype Token = Token Int). However, the fact that these identifiers were sequential meant that IntMap's trees would be close to the worst-case log(N) depth. We would use the hash of the string if it weren't for the fact that at times we'd want to index a map by pairs of tokens. For this, I used an awful hack where I'd make a pair of Enums an instance of Enum by bitwise operations, instance (Enum a, Enum b) => Enum (a,b) where fromEnum (a,b) = (fromEnum a `shiftL` 32) .|. fromEnum b ... This never sat well with me but it worked and I didn't see any equally efficient alternatives. A lesser issue of this approach is the need to keep around a bijection between Tokens and their associated strings for the purpose of adding new data or presenting results. In light of all of this, it was suggested by Edward Kmett that perhaps a HashMap would make sense. In my understanding, he proposed that the cost of hashing a short string could be dwarfed by the potential cache misses of a deep tree lookup. I haven't tested this hypothesis, but it seems plausible. Do you think this trade-off could make sense? Using Hashable has the attendent advantage of easily accomodating pairs, as well as not requiring the Token<->String mapping. It has the disadvantage, however, potentially leading to collisions. Given the size of our data sets, though, it seems unlikely that this would happen (although the outcome could be rather unfortunate if it did).

...

...
Anyways, I would be fine with seeing HashMap merged in leiu of enummapset. It would be nice to see at least one merged, however.

I'm hurrying slowly on purpose here. I'm almost finished with a new implementation of HashMap and it's easier to experiment while it's in a separate package (e.g. people expect less stability.) My goal is to eventually merge though.

I'm very glad to hear this. Cheers, - Ben

Jan-Willem Maessen

9:34 p.m.

On Tue, Feb 21, 2012 at 3:58 PM, Ben Gamari wrote:

...

On Tue, 21 Feb 2012 11:25:05 -0800, Johan Tibell wrote:

...
On Tue, Feb 21, 2012 at 11:20 AM, Ben Gamari wrote:

...
This is certainly an option. It's actually been suggested that I use HashMaps instead of IntMaps to get more uniform use of the key-space. I have been a little worried that the cost of hashing would outweigh the benefit of using benefits of more shallow trees, but I suppose memory accesses are expensive. That being said, I could certainly just use the Enum instance.

In the case of Ints and newtypes thereof hashing is very cheap. A no-op or a multiplication with a large prime.

Sure. In my application (machine learning) we were mapping from tokens (short strings, generally less than 15 bytes) to sequential newtype'd Ints (newtype Token = Token Int). However, the fact that these identifiers were sequential meant that IntMap's trees would be close to the worst-case log(N) depth.

Either way, the tries only contain levels on which keys differ, so tree depths shouldn't vary by more than 1 or so. HashMap uses lsb-first splitting in its current incarnation, so that you end up balancing contiguous keys across the trie. This is the main argument in favor of low-order-first splitting. The main argument against is that high-order-first splitting yields a standard binary search tree for lookup operations if you encode the pivots appropriately. I don't recall which method is used by Johan's wide-fanout tries. I seem to recall that at wider fanouts you're happier (for GC reasons) churning the right spine in response to contiguous insertions, rather than churning the entire tree. The takeaway: if you're seeing performance problems from contiguous insertion, it might be worth comparing with HashMap. -Jan-Willem Maessen

wren ng thornton

22 Feb 22 Feb

4:52 a.m.

On 2/21/12 3:58 PM, Ben Gamari wrote:

...

In light of all of this, it was suggested by Edward Kmett that perhaps a HashMap would make sense. In my understanding, he proposed that the cost of hashing a short string could be dwarfed by the potential cache misses of a deep tree lookup. I haven't tested this hypothesis, but it seems plausible. Do you think this trade-off could make sense?

I'm in a similar situation for a natural language processing project. In my case I keep around a (HashMap,IntMap) bijection between strings and their integerization, and then have something like EnumMap which is used elsewhere since the Ints are newtyped. This approach has been more than fast enough for my needs; i.e., the performance bottlenecks are elsewhere so far as I can tell. I too often have to deal with pairs, triples, etc, of IDs--- but I need to do so with a trie's API so, alas, I can't use a HashMap for that part. I don't think that giving out keys sequentially is the worst-case for IntMap[1]. In any case, I've been meaning to post my project to Hackage for about a year now. I was derailed a bit after I gave a talk about my EDSL for smoothing probability distributions and Ed Kmett asked me to factor it out into a separate package so he could use it. I've since broken the project up into a number of smaller more-reusable packages--- which is good in the long run, but I haven't had the time to put the pieces back together in order to post it to Hackage. Other than the smoothing EDSL, one of the packages is just for interning and integerizing strings. Since you're doing the same thing, we should talk off-list about API issues so that we can combine our work into a single package. [1] At any given point all the allocated keys will form a compact subspace of Int which all share the high-order bits. Since IntMap is a big-endian trie with node-fusion, that means you'll have a complete tree for the low order bits and only one comparison for the long spine of all the high-order bits. The number of comparisons you need to make is always O(B) where B is the number of bits that differ among the subspace, but since the subspace is compact that means you're making the most-efficient use of the differing bits and so B = O(log N) where N is the number of allocated keys. In fact, this is one of the best cases for IntMap. The worst case is if you hand out keys in an order which maximizes B (e.g., handing out keys in an order like 00000,10000,01000,00100,00010,...), since that means B will be min(N,W) where W is the word-size. Any time you ensure that B = O(log N) it will be a best-case. For the case of sequential keys this amounts to having a complete tree, but for other cases it amounts to having an essentially complete tree ---where by "essential" we mean that the incompleteness always comes in long shared spines, such as the spine above the complete tree for the sequentially ordered case. If you gave out keys in the order (map bitwiseReverse [0..]) then you'd get an essentially-complete tree at the top, with long spines coming off for each of the leaves. You could also come up with other crazy orderings where you use, say, the top three and bottom three bits but hold the middle bits fixed. The greater the number of discontiguous regions of variance the worse the performance will be, because even though we can still make it B = O(log N) we're increasing the constant factor since we have to do a comparison for each of the non-varying spines. Thus, [0..] or (map bitwiseReverse [0..]) are the best since they have only one non-varying spine. As described in the paper, Okasaki chose to use a big-endian trie instead of a little-endian trie because the constant factors are better for the [0..] case often used in projects that need to hand out unique IDs like machine learning, natural language processing, and compilers. -- Live well, ~wren

4884

Age (days ago)

4885

Last active (days ago)

List overview

Download

6 comments

4 participants

participants (4)

Ben Gamari
Jan-Willem Maessen
Johan Tibell
wren ng thornton