Re: RFC: Adding a Hashable type class and HashMap/HashSet data types to HP

29 Nov 2010

      I wrote:
...
...
Were you thinking that you could serialize those things and
then just use murmurhash on them?
Try it, but my guess is
it won't work well. Murmurhash is designed for strings, not for
those things. It's not a cryptographic hash; it's designed to
have good statistical properties for a certain very specific types
of input. If you abuse it, you'll likely lose those good properties.
The parameters for murmerhash were tweaked using keys that
were English words, and very short strings of random bytes.
There is no evidence it will work well with other kinds of keys.
Thomas Schilling wrote:
...
Where did you get this from?
...
From experience developing hashes for container
types.
...
 The criteria listed at
http://code.google.com/p/smhasher/wiki/SMHasher all talk about
arbitrary bytes.  MurmurHash is designed for arbitrary byte strings,
not English words.  It is not a cryptographic hash, sure, but I can't
see any evidence that it only works with text.
Never mind their claims, look at their test cases.
What they tested ought to work well. What they didn't
test may or may not work well, and almost
certainly will behave badly at least some of the time.

Any fast non-cryptographic hash will by necessarily have
many kinds of input that will behave badly. The idea
is to test across a large sampling of the kinds of keys
that typify the input you expect and tweak the constants
in the hash to avoid too many collisions in those cases.

As far as I could see, they tested against English words,
and against fairly short strings of random bytes. And
for "random bytes", the technique they used to generate them
is important; changing to a different pseudorandom generator
sometimes changes the results significantly.

Furthermore, a fast serialization of containers is non-random,
so you will be exercising the hash in a specific way
that has not been tested. It could be it will be good enough, or
it could be you can tweak your serialization or add another
thin layer of hashing to make it all work.

Thanks,
Yitz

Re: RFC: Adding a Hashable type class and HashMap/HashSet data types to HP

Yitzchak Gale