
On Aug 29, 2007, at 6:38 PM, Ian Lynagh wrote:
Hi,
On Tue, Aug 28, 2007 at 11:41:22AM -0400, Jan-Willem Maessen wrote:
golden :: Int32 golden = 1013904242 -- = round ((sqrt 5 - 1) * 2^32) :: Int32 -- was -1640531527 = round ((sqrt 5 - 1) * 2^31) :: Int32 -- but that has bad mulHi properties (even adding 2^32 to get its inverse) -- Whereas the above works well and contains no hash duplications for -- [-32767..65536]
hashInt32 :: Int32 -> Int32 hashInt32 x = mulHi x golden + x
This gives
map hashInt [0..16] [0,1,2,3,4,6,7,8,9,11,12,13,14,16,17,18,19]
-- (length $ group $ sort $ map hashInt [-32767..65536]) == 65536 + 32768
This test also passes for the
golden :: Int32 golden = -1640531527
hashInt :: Int -> Int32 hashInt x = fromIntegral x * golden
implementation, which has a very pretty distribution; graph at the bottom of http://www.brpreiss.com/books/opus4/html/page214.html
Recall that we're using the low-order bits of the hash code to index into the table. If the keys are always, say, multiples of 8 then the hash codes will always be multiples of 8 as well. We usually compensate in this case by using the *high-order* bits of the hash code for hash table indexing. I seem to recall considering this, but discovering that there were naive hash functions floating about somewhere that meant it would be a bad idea in practice (only the low bits would contain any data, and we'd hash everything to 0). I admit that I didn't think about just post-multiplying the result of the passed-in hash function by golden before looking at high bits. That of course results in gratuitous work if we've actually taken pains to design a decent hash function.
hashString :: String -> Int32 hashString = foldl' f golden where f m c = fromIntegral (fromEnum c) * magic + hashInt32 m magic = 0xdeadbeef
Why use magic rather than golden?
It didn't work as well, that's all: *Data.HashTable.Multiplicative> testp golden 969 *Data.HashTable.Multiplicative> testp 0xdeadbeef 0 Having had the "hash table superstitions" conversation with a colleague several times, my hypothesis is that we want to choose unrelated multipliers in this case. Trying testp restricted to a modulus of 2^18 shows that it's probably a bit of a wash in practice.
This makes sense to me:
hashString :: String -> Int32 hashString = foldl' f golden where f m c = (fromIntegral (ord c) `xor` m) * golden
Is anything obviously wrong with it?
Make that xor a + and it seems to work fine (carry chains are your friends), yielding fewer collisions in the 2^18 case. I'm much less worried that all your characters are multiples of 2^k. I'm assuming you mean (-1640531527) for golden. I haven't actually tried performance bakeoffs for any of these. Perhaps the original poster has an actual application in mind where the actual difference could be observed? That was what prompted my original revisions to the library in the first place. My instinct is that mulHi should actually be cheaper than it appears on most architectures, and so the computational cost should be pretty comparable, but your hashString is obviously cheaper to evaluate. -Jan