Re: Data.HashTable.hashInt seems somewhat sub-optimal

On Thu, Aug 16, 2007 at 05:06:53PM +0100, Simon.Frankau@barclayscapital.com wrote:
I tried submitting this bug through the tracker, but it seemed to give me permissions errors - probably a firewall issue here.
Due to spammers you need to log in to the bug tracker (user guest, password guest), so that's another possibility.
Prelude> Data.HashTable.hashInt 0 0 Prelude> Data.HashTable.hashInt 1 -1 Prelude> Data.HashTable.hashInt 2 -1 Prelude> Data.HashTable.hashInt 3 -2 Prelude> Data.HashTable.hashInt 4 -2 Prelude> Data.HashTable.hashInt 5 -2 Prelude> Data.HashTable.hashInt 6 -3 Prelude> Data.HashTable.hashInt 7 -3 Prelude> Data.HashTable.hashInt 8 -4 Prelude> Data.HashTable.hashInt 9 -4 Prelude> Data.HashTable.hashInt 10 -4 Prelude> Data.HashTable.hashInt 200 -77 Prelude> Data.HashTable.hashInt 201 -77 Prelude> Data.HashTable.hashInt 202 -78
I prefer to use hashing to decrease the likelihood of collisions, not increase them. ;)
I think the original algorithm was quite possibly supposed to use the bottom 32 bits of the result,
Based on http://www.brpreiss.com/books/opus4/html/page213.html http://www.brpreiss.com/books/opus4/html/page214.html it looks like you're right. I'll change it from hashInt :: Int -> Int32 hashInt x = mulHi (fromIntegral x) golden to hashInt :: Int -> Int32 hashInt x = fromIntegral x * golden which gives better results: Prelude> Data.HashTable.hashInt 0 0 Prelude> Data.HashTable.hashInt 1 -1640531527 Prelude> Data.HashTable.hashInt 2 1013904242 Prelude> Data.HashTable.hashInt 3 -626627285 Prelude> Data.HashTable.hashInt 4 2027808484 Prelude> Data.HashTable.hashInt 5 387276957 Prelude> Data.HashTable.hashInt 6 -1253254570 I'm also suspicious of this, though: -- | A sample hash function for Strings. We keep multiplying by the -- golden ratio and adding. -- -- Note that this has not been extensively tested for reasonability, -- but Knuth argues that repeated multiplication by the golden ratio -- will minimize gaps in the hash space. hashString :: String -> Int32 hashString = foldl' f 0 where f m c = fromIntegral (ord c + 1) * golden + mulHi m golden should this be where f m c = (fromIntegral (ord c + 1) + m) * golden ? Does Knuth (TAOCP?) say? Thanks Ian

Hello, On Monday 20 August 2007 13:15, Ian Lynagh wrote:
... I'm also suspicious of this, though:
-- | A sample hash function for Strings. We keep multiplying by the -- golden ratio and adding. -- -- Note that this has not been extensively tested for reasonability, -- but Knuth argues that repeated multiplication by the golden ratio -- will minimize gaps in the hash space. hashString :: String -> Int32 hashString = foldl' f 0 where f m c = fromIntegral (ord c + 1) * golden + mulHi m golden
should this be
where f m c = (fromIntegral (ord c + 1) + m) * golden
? Does Knuth (TAOCP?) say?
In the 2nd edition of Knuth's The Art of Computer Programming, Vol 3, Sorting and Searching there is a discussion of hash functions on pp. 514-520. One of the techniques suggested for hashing a one-word (i.e. essentially fixed-size) key is the following multiplicative scheme: h(K) = floor ( M*(((A/w)*K)) mod 1) ) where w is the word-size (say, 2^32), M is the desired limit of the hash function (for efficiency, probably a suitable power of 2) and, finally, A is some integer constant. What happens here is that we consider the (word) K as a fraction with the binary point at the left end of the word rather than at the right, thus getting a fraction with a value between 0 and 1. This value we then multiply by A and cut off the integer part, once again getting a fractional value between 0 and 1. And finally, we multiply by M and cut away the fractional part to get an integer value between 0 and M-1. And, sure, Knuth suggests various variants of selecting the multiplier A related to the golden ratio (sqrt(5)-1)/2 = 0.6180... to gain suitable spreading of hashes for keys in arithmetic progressions. (K, K+d, K+2d, ...). But what we are dealing with in the hashString function is what Knuth would call a multiword or variable-length key. Such cases, Knuth suggests, "can be handled by multiple-precision extensions of [e.g. the multiplicative scheme] above, but it is generally adequate to speed things up by combining the individual words together into a single word, then doing a single multiplication ... as above." Neither of the above definitions of f implement a multiple-precision extension of the multiplicative hashing scheme that involves the golden ration. And none of the methods suggested by Knuth for combining multiple words into single words or otherwise compute hashes for multiword keys involve the golden ration. So I cannot find obvious traces of Knuth having anything at all to say about either of the f's.
...
Best regards Thorkil

Sigh. I've included a fix below, but I don't have a standard GHC checkout available to me (only the test sandbox I used to get the code in the first place) so I haven't generated a patch. Note the mentioned "reasonability tests", which hopefully should forestall such obvious infelicities in future. On Aug 26, 2007, at 1:42 PM, Thorkil Naur wrote:
Hello,
On Monday 20 August 2007 13:15, Ian Lynagh wrote:
... I'm also suspicious of this, though:
-- | A sample hash function for Strings. We keep multiplying by the -- golden ratio and adding. -- -- Note that this has not been extensively tested for reasonability, -- but Knuth argues that repeated multiplication by the golden ratio -- will minimize gaps in the hash space. hashString :: String -> Int32 hashString = foldl' f 0 where f m c = fromIntegral (ord c + 1) * golden + mulHi m golden
should this be
where f m c = (fromIntegral (ord c + 1) + m) * golden
? Does Knuth (TAOCP?) say?
In the 2nd edition of Knuth's The Art of Computer Programming, Vol 3, Sorting and Searching there is a discussion of hash functions on pp. 514-520. One of the techniques suggested for hashing a one-word (i.e. essentially fixed-size) key is the following multiplicative scheme:
h(K) = floor ( M*(((A/w)*K)) mod 1) )
where w is the word-size (say, 2^32), M is the desired limit of the hash function (for efficiency, probably a suitable power of 2) and, finally, A is some integer constant. What happens here is that we consider the (word) K as a fraction with the binary point at the left end of the word rather than at the right, thus getting a fraction with a value between 0 and 1. This value we then multiply by A and cut off the integer part, once again getting a fractional value between 0 and 1. And finally, we multiply by M and cut away the fractional part to get an integer value between 0 and M-1. And, sure, Knuth suggests various variants of selecting the multiplier A related to the golden ratio (sqrt(5)-1)/2 = 0.6180... to gain suitable spreading of hashes for keys in arithmetic progressions. (K, K+d, K+2d, ...).
In the fix below I ended up using twice the golden ratio (a value of A greater than one). The inverse of the golden ratio (which is 1 + golden) didn't work well at all.
But what we are dealing with in the hashString function is what Knuth would call a multiword or variable-length key. Such cases, Knuth suggests, "can be handled by multiple-precision extensions of [e.g. the multiplicative scheme] above, but it is generally adequate to speed things up by combining the individual words together into a single word, then doing a single multiplication ... as above."
But combining things into a single word requires having a good combining mechanism, which is hard in general---particularly with a type like Char that appears to have a large range but in practice only has a very small one.
Neither of the above definitions of f implement a multiple- precision extension of the multiplicative hashing scheme that involves the golden ration.
I was disinclined to compute the golden ratio out to a number of digits suitable to combine together all the list elements, and I'm skeptical it'd actually work well if I did (I think we'd lose most of the information from the low-order bits, actually). But I'd invite someone to give it a try.
And none of the methods suggested by Knuth for combining multiple words into single words or otherwise compute hashes for multiword keys involve the golden ration.
He suggests adding or xoring the elements of a string together before hashing. In practice this has a bunch of known terrible failure modes. We rather particularly don't want to use a commutative combining operator, or (as he notes) "XY" and "YX" will have the same hash. I tried various variations on string hashing before arriving at the one below (which is not too different from the old one, but contains an unrelated pre-multiplier). I actually did do a bakeoff between multiple hashing schemes, and have a version of Data.HashTable that requires separate import of a hashing technique. The multiplicative hash worked better in the bakeoff than either the modulus hash used in older versions of the library, or a couple of versions of the Bob Jenkins hash. Could I convince someone with a checked out repository to cut and paste this into Data.HashTable and generate a patch / check it in? Thanks, -Jan Here's the fix: -- ------------------------------------------------------------------------ ----- -- Sample hash functions -- $hash_functions -- -- This implementation of hash tables uses the low-order /n/ bits of the hash -- value for a key, where /n/ varies as the hash table grows. A good hash -- function therefore will give an even distribution regardless of /n/. -- -- If your keyspace is integrals such that the low-order bits between -- keys are highly variable, then you could get away with using 'fromIntegral' -- as the hash function. -- -- We provide some sample hash functions for 'Int' and 'String' below. golden :: Int32 golden = 1013904242 -- = round ((sqrt 5 - 1) * 2^32) :: Int32 -- was -1640531527 = round ((sqrt 5 - 1) * 2^31) :: Int32 -- but that has bad mulHi properties (even adding 2^32 to get its inverse) -- Whereas the above works well and contains no hash duplications for -- [-32767..65536] hashInt32 :: Int32 -> Int32 hashInt32 x = mulHi x golden + x -- | A sample (and useful) hash function for Int and Int32, -- implemented by extracting the uppermost 32 bits of the 64-bit -- result of multiplying by a 33-bit constant. The constant is from -- Knuth, derived from the golden ratio: -- > golden = round ((sqrt 5 - 1) * 2^32) -- We get good key uniqueness on small inputs -- (a problem with previous versions): -- (length $ group $ sort $ map hashInt [-32767..65536]) == 65536 + 32768 -- hashInt :: Int -> Int32 hashInt x = hashInt32 (fromIntegral x) -- hi 32 bits of a x-bit * 32 bit -> 64-bit multiply mulHi :: Int32 -> Int32 -> Int32 mulHi a b = fromIntegral (r `shiftR` 32) where r :: Int64 r = fromIntegral a * fromIntegral b -- | A sample hash function for Strings. We keep multiplying by the -- golden ratio and adding. The implementation is: -- -- > hashString = foldl' f golden -- > where f m c = fromIntegral (fromEnum c) * magic + hashInt32 m -- > magic = 0xdeadbeef -- -- Where hashInt32 works just as hashInt shown above. -- -- Knuth argues that repeated multiplication by the golden ratio -- will minimize gaps in the hash space, and thus it's a good choice -- for combining together multiple keys to form one. -- -- Here we know that individual characters c are often small, and this -- produces frequent collisions if we use fromEnum c alone. A -- particular problem are the shorter low ASCII and ISO-8859-1 -- character strings. We pre-multiply by a magic twiddle factor to -- obtain a good distribution. In fact, given the following test: -- -- > testp :: Int32 -> Int -- > testp k = (n - ) . length . group . sort . map hs . take n $ ls -- > where ls = [] : [c : l | l <- ls, c <- ['\0'..'\xff']] -- > hs = foldl' f golden -- > f m c = fromIntegral (fromEnum c) * k + hashInt32 m -- > n = 100000 -- -- We discover that testp magic = 0. hashString :: String -> Int32 hashString = foldl' f golden where f m c = fromIntegral (fromEnum c) * magic + hashInt32 m magic = 0xdeadbeef

Hi, On Tue, Aug 28, 2007 at 11:41:22AM -0400, Jan-Willem Maessen wrote:
golden :: Int32 golden = 1013904242 -- = round ((sqrt 5 - 1) * 2^32) :: Int32 -- was -1640531527 = round ((sqrt 5 - 1) * 2^31) :: Int32 -- but that has bad mulHi properties (even adding 2^32 to get its inverse) -- Whereas the above works well and contains no hash duplications for -- [-32767..65536]
hashInt32 :: Int32 -> Int32 hashInt32 x = mulHi x golden + x
This gives
map hashInt [0..16] [0,1,2,3,4,6,7,8,9,11,12,13,14,16,17,18,19]
-- (length $ group $ sort $ map hashInt [-32767..65536]) == 65536 + 32768
This test also passes for the golden :: Int32 golden = -1640531527 hashInt :: Int -> Int32 hashInt x = fromIntegral x * golden implementation, which has a very pretty distribution; graph at the bottom of http://www.brpreiss.com/books/opus4/html/page214.html
hashString :: String -> Int32 hashString = foldl' f golden where f m c = fromIntegral (fromEnum c) * magic + hashInt32 m magic = 0xdeadbeef
Why use magic rather than golden? This makes sense to me: hashString :: String -> Int32 hashString = foldl' f golden where f m c = (fromIntegral (ord c) `xor` m) * golden Is anything obviously wrong with it? Thanks Ian

On Aug 29, 2007, at 6:38 PM, Ian Lynagh wrote:
Hi,
On Tue, Aug 28, 2007 at 11:41:22AM -0400, Jan-Willem Maessen wrote:
golden :: Int32 golden = 1013904242 -- = round ((sqrt 5 - 1) * 2^32) :: Int32 -- was -1640531527 = round ((sqrt 5 - 1) * 2^31) :: Int32 -- but that has bad mulHi properties (even adding 2^32 to get its inverse) -- Whereas the above works well and contains no hash duplications for -- [-32767..65536]
hashInt32 :: Int32 -> Int32 hashInt32 x = mulHi x golden + x
This gives
map hashInt [0..16] [0,1,2,3,4,6,7,8,9,11,12,13,14,16,17,18,19]
-- (length $ group $ sort $ map hashInt [-32767..65536]) == 65536 + 32768
This test also passes for the
golden :: Int32 golden = -1640531527
hashInt :: Int -> Int32 hashInt x = fromIntegral x * golden
implementation, which has a very pretty distribution; graph at the bottom of http://www.brpreiss.com/books/opus4/html/page214.html
Recall that we're using the low-order bits of the hash code to index into the table. If the keys are always, say, multiples of 8 then the hash codes will always be multiples of 8 as well. We usually compensate in this case by using the *high-order* bits of the hash code for hash table indexing. I seem to recall considering this, but discovering that there were naive hash functions floating about somewhere that meant it would be a bad idea in practice (only the low bits would contain any data, and we'd hash everything to 0). I admit that I didn't think about just post-multiplying the result of the passed-in hash function by golden before looking at high bits. That of course results in gratuitous work if we've actually taken pains to design a decent hash function.
hashString :: String -> Int32 hashString = foldl' f golden where f m c = fromIntegral (fromEnum c) * magic + hashInt32 m magic = 0xdeadbeef
Why use magic rather than golden?
It didn't work as well, that's all: *Data.HashTable.Multiplicative> testp golden 969 *Data.HashTable.Multiplicative> testp 0xdeadbeef 0 Having had the "hash table superstitions" conversation with a colleague several times, my hypothesis is that we want to choose unrelated multipliers in this case. Trying testp restricted to a modulus of 2^18 shows that it's probably a bit of a wash in practice.
This makes sense to me:
hashString :: String -> Int32 hashString = foldl' f golden where f m c = (fromIntegral (ord c) `xor` m) * golden
Is anything obviously wrong with it?
Make that xor a + and it seems to work fine (carry chains are your friends), yielding fewer collisions in the 2^18 case. I'm much less worried that all your characters are multiples of 2^k. I'm assuming you mean (-1640531527) for golden. I haven't actually tried performance bakeoffs for any of these. Perhaps the original poster has an actual application in mind where the actual difference could be observed? That was what prompted my original revisions to the library in the first place. My instinct is that mulHi should actually be cheaper than it appears on most architectures, and so the computational cost should be pretty comparable, but your hashString is obviously cheaper to evaluate. -Jan

Hello Jan-Willem, you may be interested to read hashing papers mentioned at http://www.encode.ru/forums/index.php?action=vthread&forum=1&topic=413 -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On Aug 30, 2007, at 5:08 AM, Bulat Ziganshin wrote:
Hello Jan-Willem,
you may be interested to read hashing papers mentioned at http://www.encode.ru/forums/index.php?action=vthread&forum=1&topic=413
Not only did I read them, I tried out the Bob Jenkins hash function! My conclusion is that for Data.HashTable we do not generally need a strong hash; we need a reasonable hash that's cheap to evaluate. The mulHi operation is cheap on most architectures (eg a single multiply on 32-bit x86). Remember, for a hash table we're trading off hashing cost with the cost of chain traversal. There are plenty of other applications of hashing where we need to do a better job than this. -Jan
-- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Hello Jan-Willem, Thursday, August 30, 2007, 4:28:28 PM, you wrote:
http://www.encode.ru/forums/index.php?action=vthread&forum=1&topic=413
Not only did I read them, I tried out the Bob Jenkins hash function!
i don't propose to use this function, this page just contain a lot of various hash functions together with discussions. in particular, there is info about multiplicative hash which is close to your hashing scheme -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Oh, yes, in case others are interested in playing with the Jenkins hash, here's Haskell code for it. If you think I should have stuck more "seq's" and "!"s into it, by all means go to town (I compile with optimization and everything is strict). -Jan -- | The burtleburtle.net hash function devised by Bob Jenkins and -- used in perl et al. This is written gracefully in a very -- imperative way, and looks quite ugly when functionalized. mix :: Int32 -> Int32 -> Int32 -> (Int32 -> Int32 -> Int32 -> a) -> a mix a0 b0 c0 k0 = let mixR k a b c = (a-b-c) `xor` (c `shiftR` k) mixL k b c a = (b-c-a) `xor` (a `shiftL` k) mix3 k1 k2 k3 k a b c = let a' = mixR k1 a b c b' = mixL k2 b c a' c' = mixR k3 c a' b' in k a' b' c' in (mix3 13 8 13 $ mix3 12 16 5 $ mix3 3 10 15 $ k0) a0 b0 c0 golden :: Int32 golden = -1640531527 hashInt :: Int -> Int32 hashInt x = mix golden 0 (fromIntegral x) $ \_ _ r -> r -- | A hash function for Strings based on a slightly modified version -- of the burtleburtle string hash. We use the same mix, but we mix -- every 3 Chars (not 12) since Haskell Chars are unicode. That does -- make this hash 4x more expensive in the common case. -- -- hashString :: String -> Int32 hashString str = hs str golden 0 0 where hs (a':b':c':str) a b c = mix (a + orrd a') (b + orrd b') (c + orrd c') $ hs str hs [b',c'] a b c = mix a (b + orrd b') (c + orrd c') $ \_ _ r -> r hs [c'] a b c = mix a b (c + orrd c') $ \_ _ r -> r hs [] _ _ c = c orrd :: Char -> Int32 orrd = fromIntegral . fromEnum

On Wed, Aug 29, 2007 at 08:08:51PM -0400, Jan-Willem Maessen wrote:
Recall that we're using the low-order bits of the hash code to index into the table.
Aha, I see, thanks. OK, I'll push your patch after validating. Thanks Ian
participants (4)
-
Bulat Ziganshin
-
Ian Lynagh
-
Jan-Willem Maessen
-
Thorkil Naur