
On Tue, 21 Feb 2012 11:25:05 -0800, Johan Tibell
On Tue, Feb 21, 2012 at 11:20 AM, Ben Gamari
wrote: This is certainly an option. It's actually been suggested that I use HashMaps instead of IntMaps to get more uniform use of the key-space. I have been a little worried that the cost of hashing would outweigh the benefit of using benefits of more shallow trees, but I suppose memory accesses are expensive. That being said, I could certainly just use the Enum instance.
In the case of Ints and newtypes thereof hashing is very cheap. A no-op or a multiplication with a large prime.
Sure. In my application (machine learning) we were mapping from tokens (short strings, generally less than 15 bytes) to sequential newtype'd Ints (newtype Token = Token Int). However, the fact that these identifiers were sequential meant that IntMap's trees would be close to the worst-case log(N) depth. We would use the hash of the string if it weren't for the fact that at times we'd want to index a map by pairs of tokens. For this, I used an awful hack where I'd make a pair of Enums an instance of Enum by bitwise operations, instance (Enum a, Enum b) => Enum (a,b) where fromEnum (a,b) = (fromEnum a `shiftL` 32) .|. fromEnum b ... This never sat well with me but it worked and I didn't see any equally efficient alternatives. A lesser issue of this approach is the need to keep around a bijection between Tokens and their associated strings for the purpose of adding new data or presenting results. In light of all of this, it was suggested by Edward Kmett that perhaps a HashMap would make sense. In my understanding, he proposed that the cost of hashing a short string could be dwarfed by the potential cache misses of a deep tree lookup. I haven't tested this hypothesis, but it seems plausible. Do you think this trade-off could make sense? Using Hashable has the attendent advantage of easily accomodating pairs, as well as not requiring the Token<->String mapping. It has the disadvantage, however, potentially leading to collisions. Given the size of our data sets, though, it seems unlikely that this would happen (although the outcome could be rather unfortunate if it did).
Anyways, I would be fine with seeing HashMap merged in leiu of enummapset. It would be nice to see at least one merged, however.
I'm hurrying slowly on purpose here. I'm almost finished with a new implementation of HashMap and it's easier to experiment while it's in a separate package (e.g. people expect less stability.) My goal is to eventually merge though.
I'm very glad to hear this. Cheers, - Ben