
I wrote:
Were you thinking that you could serialize those things and then just use murmurhash on them? Try it, but my guess is it won't work well. Murmurhash is designed for strings, not for those things. It's not a cryptographic hash; it's designed to have good statistical properties for a certain very specific types of input. If you abuse it, you'll likely lose those good properties. The parameters for murmerhash were tweaked using keys that were English words, and very short strings of random bytes. There is no evidence it will work well with other kinds of keys.
Thomas Schilling wrote:
Where did you get this from?
From experience developing hashes for container types.
The criteria listed at http://code.google.com/p/smhasher/wiki/SMHasher all talk about arbitrary bytes. MurmurHash is designed for arbitrary byte strings, not English words. It is not a cryptographic hash, sure, but I can't see any evidence that it only works with text.
Never mind their claims, look at their test cases. What they tested ought to work well. What they didn't test may or may not work well, and almost certainly will behave badly at least some of the time. Any fast non-cryptographic hash will by necessarily have many kinds of input that will behave badly. The idea is to test across a large sampling of the kinds of keys that typify the input you expect and tweak the constants in the hash to avoid too many collisions in those cases. As far as I could see, they tested against English words, and against fairly short strings of random bytes. And for "random bytes", the technique they used to generate them is important; changing to a different pseudorandom generator sometimes changes the results significantly. Furthermore, a fast serialization of containers is non-random, so you will be exercising the hash in a specific way that has not been tested. It could be it will be good enough, or it could be you can tweak your serialization or add another thin layer of hashing to make it all work. Thanks, Yitz