Re: Data.HashTable.hashInt seems somewhat sub-optimal

28 Aug 2007

      Sigh.

I've included a fix below, but I don't have a standard GHC checkout  
available to me (only the test sandbox I used to get the code in the  
first place) so I haven't generated a patch.  Note the mentioned  
"reasonability tests", which hopefully should forestall such obvious  
infelicities in future.

On Aug 26, 2007, at 1:42 PM, Thorkil Naur wrote:
...
Hello,
On Monday 20 August 2007 13:15, Ian Lynagh wrote:
...
...
I'm also suspicious of this, though:
-- | A sample hash function for Strings.  We keep multiplying  
by the
    -- golden ratio and adding.
    --
    -- Note that this has not been extensively tested for  
reasonability,
    -- but Knuth argues that repeated multiplication by the golden  
ratio
    -- will minimize gaps in the hash space.
    hashString :: String -> Int32
    hashString = foldl' f 0
      where f m c = fromIntegral (ord c + 1) * golden + mulHi m  
golden
should this be
where f m c = (fromIntegral (ord c + 1) + m) * golden
? Does Knuth (TAOCP?) say?
In the 2nd edition of Knuth's The Art of Computer Programming, Vol  
3, Sorting
and Searching there is a discussion of hash functions on pp.  
514-520. One of
the techniques suggested for hashing a one-word (i.e. essentially  
fixed-size)
key is the following multiplicative scheme:
h(K) = floor ( M*(((A/w)*K)) mod 1) )
where w is the word-size (say, 2^32), M is the desired limit of the  
hash
function (for efficiency, probably a suitable power of 2) and,  
finally, A is
some integer constant. What happens here is that we consider the  
(word) K as
a fraction with the binary point at the left end of the word rather  
than at
the right, thus getting a fraction with a value between 0 and 1.  
This value
we then multiply by A and cut off the integer part, once again  
getting a
fractional value between 0 and 1. And finally, we multiply by M and  
cut away
the fractional part to get an integer value between 0 and M-1. And,  
sure,
Knuth suggests various variants of selecting the multiplier A  
related to the
golden ratio (sqrt(5)-1)/2 = 0.6180... to gain suitable spreading  
of hashes
for keys in arithmetic progressions. (K, K+d, K+2d, ...).
In the fix below I ended up using twice the golden ratio (a value of  
A greater than one).  The inverse of the golden ratio (which is 1 +  
golden) didn't work well at all.
...
But what we are dealing with in the hashString function is what  
Knuth would
call a multiword or variable-length key. Such cases, Knuth  
suggests, "can be
handled by multiple-precision extensions of [e.g. the  
multiplicative scheme]
above, but it is generally adequate to speed things up by combining  
the
individual words together into a single word, then doing a single
multiplication ... as above."
But combining things into a single word requires having a good  
combining mechanism, which is hard in general---particularly with a  
type like Char that appears to have a large range but in practice  
only has a very small one.
...
Neither of the above definitions of f implement a multiple- 
precision extension
of the multiplicative hashing scheme that involves the golden ration.
I was disinclined to compute the golden ratio out to a number of  
digits suitable to combine together all the list elements, and I'm  
skeptical it'd actually work well if I did (I think we'd lose most of  
the information from the low-order bits, actually).  But I'd invite  
someone to give it a try.
...
And
none of the methods suggested by Knuth for combining multiple words  
into
single words or otherwise compute hashes for multiword keys involve  
the
golden ration.
He suggests adding or xoring the elements of a string together before  
hashing.  In practice this has a bunch of known terrible failure  
modes.  We rather particularly don't want to use a commutative  
combining operator, or (as he notes) "XY" and "YX" will have the same  
hash.  I tried various variations on string hashing before arriving  
at the one below (which is not too different from the old one, but  
contains an unrelated pre-multiplier).

I actually did do a bakeoff between multiple hashing schemes, and  
have a version of Data.HashTable that requires separate import of a  
hashing technique.  The multiplicative hash worked better in the  
bakeoff than either the modulus hash used in older versions of the  
library, or a couple of versions of the Bob Jenkins hash.

Could I convince someone with a checked out repository to cut and  
paste this into Data.HashTable and generate a patch / check it in?

Thanks,

-Jan

Here's the fix:

--  
------------------------------------------------------------------------ 
-----
-- Sample hash functions

-- $hash_functions
--
-- This implementation of hash tables uses the low-order /n/ bits of  
the hash
-- value for a key, where /n/ varies as the hash table grows.  A good  
hash
-- function therefore will give an even distribution regardless of /n/.
--
-- If your keyspace is integrals such that the low-order bits between
-- keys are highly variable, then you could get away with using  
'fromIntegral'
-- as the hash function.
--
-- We provide some sample hash functions for 'Int' and 'String' below.

golden :: Int32
golden = 1013904242 -- = round ((sqrt 5 - 1) * 2^32) :: Int32
-- was -1640531527 = round ((sqrt 5 - 1) * 2^31) :: Int32
-- but that has bad mulHi properties (even adding 2^32 to get its  
inverse)
-- Whereas the above works well and contains no hash duplications for
-- [-32767..65536]

hashInt32 :: Int32 -> Int32
hashInt32 x = mulHi x golden + x

-- | A sample (and useful) hash function for Int and Int32,
-- implemented by extracting the uppermost 32 bits of the 64-bit
-- result of multiplying by a 33-bit constant.  The constant is from
-- Knuth, derived from the golden ratio:
-- > golden = round ((sqrt 5 - 1) * 2^32)
-- We get good key uniqueness on small inputs
-- (a problem with previous versions):
--  (length $ group $ sort $ map hashInt [-32767..65536]) == 65536 +  
32768
--
hashInt :: Int -> Int32
hashInt x = hashInt32 (fromIntegral x)

-- hi 32 bits of a x-bit * 32 bit -> 64-bit multiply
mulHi :: Int32 -> Int32 -> Int32
mulHi a b = fromIntegral (r `shiftR` 32)
   where r :: Int64
         r = fromIntegral a * fromIntegral b

-- | A sample hash function for Strings.  We keep multiplying by the
-- golden ratio and adding.  The implementation is:
--
-- > hashString = foldl' f golden
-- >   where f m c = fromIntegral (fromEnum c) * magic + hashInt32 m
-- >         magic = 0xdeadbeef
--
-- Where hashInt32 works just as hashInt shown above.
--
-- Knuth argues that repeated multiplication by the golden ratio
-- will minimize gaps in the hash space, and thus it's a good choice
-- for combining together multiple keys to form one.
--
-- Here we know that individual characters c are often small, and this
-- produces frequent collisions if we use fromEnum c alone.  A
-- particular problem are the shorter low ASCII and ISO-8859-1
-- character strings.  We pre-multiply by a magic twiddle factor to
-- obtain a good distribution.  In fact, given the following test:
--
-- > testp :: Int32 -> Int
-- > testp k = (n - ) . length . group . sort . map hs . take n $ ls
-- >   where ls = [] : [c : l | l <- ls, c <- ['\0'..'\xff']]
-- >         hs = foldl' f golden
-- >         f m c = fromIntegral (fromEnum c) * k + hashInt32 m
-- >         n = 100000
--
-- We discover that testp magic = 0.

hashString :: String -> Int32
hashString = foldl' f golden
   where f m c = fromIntegral (fromEnum c) * magic + hashInt32 m
         magic = 0xdeadbeef

Re: Data.HashTable.hashInt seems somewhat sub-optimal

Jan-Willem Maessen