
On 2/21/12 3:58 PM, Ben Gamari wrote:
In light of all of this, it was suggested by Edward Kmett that perhaps a HashMap would make sense. In my understanding, he proposed that the cost of hashing a short string could be dwarfed by the potential cache misses of a deep tree lookup. I haven't tested this hypothesis, but it seems plausible. Do you think this trade-off could make sense?
I'm in a similar situation for a natural language processing project. In my case I keep around a (HashMap,IntMap) bijection between strings and their integerization, and then have something like EnumMap which is used elsewhere since the Ints are newtyped. This approach has been more than fast enough for my needs; i.e., the performance bottlenecks are elsewhere so far as I can tell. I too often have to deal with pairs, triples, etc, of IDs--- but I need to do so with a trie's API so, alas, I can't use a HashMap for that part. I don't think that giving out keys sequentially is the worst-case for IntMap[1]. In any case, I've been meaning to post my project to Hackage for about a year now. I was derailed a bit after I gave a talk about my EDSL for smoothing probability distributions and Ed Kmett asked me to factor it out into a separate package so he could use it. I've since broken the project up into a number of smaller more-reusable packages--- which is good in the long run, but I haven't had the time to put the pieces back together in order to post it to Hackage. Other than the smoothing EDSL, one of the packages is just for interning and integerizing strings. Since you're doing the same thing, we should talk off-list about API issues so that we can combine our work into a single package. [1] At any given point all the allocated keys will form a compact subspace of Int which all share the high-order bits. Since IntMap is a big-endian trie with node-fusion, that means you'll have a complete tree for the low order bits and only one comparison for the long spine of all the high-order bits. The number of comparisons you need to make is always O(B) where B is the number of bits that differ among the subspace, but since the subspace is compact that means you're making the most-efficient use of the differing bits and so B = O(log N) where N is the number of allocated keys. In fact, this is one of the best cases for IntMap. The worst case is if you hand out keys in an order which maximizes B (e.g., handing out keys in an order like 00000,10000,01000,00100,00010,...), since that means B will be min(N,W) where W is the word-size. Any time you ensure that B = O(log N) it will be a best-case. For the case of sequential keys this amounts to having a complete tree, but for other cases it amounts to having an essentially complete tree ---where by "essential" we mean that the incompleteness always comes in long shared spines, such as the spine above the complete tree for the sequentially ordered case. If you gave out keys in the order (map bitwiseReverse [0..]) then you'd get an essentially-complete tree at the top, with long spines coming off for each of the leaves. You could also come up with other crazy orderings where you use, say, the top three and bottom three bits but hold the middle bits fixed. The greater the number of discontiguous regions of variance the worse the performance will be, because even though we can still make it B = O(log N) we're increasing the constant factor since we have to do a comparison for each of the non-varying spines. Thus, [0..] or (map bitwiseReverse [0..]) are the best since they have only one non-varying spine. As described in the paper, Okasaki chose to use a big-endian trie instead of a little-endian trie because the constant factors are better for the [0..] case often used in projects that need to hand out unique IDs like machine learning, natural language processing, and compilers. -- Live well, ~wren