Re: [Haskell-cafe] Very imperfect hash function

From: Hans Aberg
On 28 Jan 2010, at 20:07, Steve Schafer wrote:
The data are currently in a large lookup table. To save space, I'd like to convert that into a sort of hash function:
hash :: key -> value
My question is this: Is there any kind of generic approach that can make use of the knowledge about the internal redundancy of the keys to come up with an efficient function?
There are minimal perfect hash functions; there are some libraries mentioned here, though they are not in Haskell code: http://en.wikipedia.org/wiki/Perfect_hash_function
This is suitable when you do a lot of lookups with few key updates. An alternative might be Data.Map, where lookups have time complexity O(log n), n = size of map.
The minimal perfect hash function looks like the real algorithmic solution, but I would just use Data.IntMap for this. Cheers, John

On 29 Jan 2010, at 12:52, John Lato wrote:
There are minimal perfect hash functions; there are some libraries mentioned here, though they are not in Haskell code: http://en.wikipedia.org/wiki/Perfect_hash_function
This is suitable when you do a lot of lookups with few key updates. An alternative might be Data.Map, where lookups have time complexity O(log n), n = size of map.
The minimal perfect hash function looks like the real algorithmic solution, but I would just use Data.IntMap for this.
That looks interesting too. Yet another idea: use arrays http://haskell.org/haskellwiki/Arrays Then build a hash table, say just taking mod k n, and have values in some lookup map. If n > set of keys, average time complexity is O(1), and arrays should be very fast. Hans

On Fri, Jan 29, 2010 at 12:46 PM, Hans Aberg
On 29 Jan 2010, at 12:52, John Lato wrote:
There are minimal perfect hash functions; there are some libraries mentioned here, though they are not in Haskell code: http://en.wikipedia.org/wiki/Perfect_hash_function
This is suitable when you do a lot of lookups with few key updates. An alternative might be Data.Map, where lookups have time complexity O(log n), n = size of map.
The minimal perfect hash function looks like the real algorithmic solution, but I would just use Data.IntMap for this.
That looks interesting too. Yet another idea: use arrays http://haskell.org/haskellwiki/Arrays Then build a hash table, say just taking mod k n, and have values in some lookup map. If n > set of keys, average time complexity is O(1), and arrays should be very fast.
I just want to be sure I understand this. For this plan, you aren't intending to use a perfect hash function at all, correct? Are you basically just suggesting to stick everything in an array with the key as an index? Or are you suggesting an actual hash table? If it's the latter, I'm not certain where the array fits into the picture. I'm pretty sure I'm missing something here. In either case, I don't think this would be as good as what I thought was your original suggestion, i.e. using a minimal perfect hash function mphf that hashes keys to a value 0..v-1. With v=number of values, valArr = array of all values, then lookup k = valArr ! mphf k Cheers, John

On 29 Jan 2010, at 15:57, John Lato wrote:
That looks interesting too. Yet another idea: use arrays http://haskell.org/haskellwiki/Arrays Then build a hash table, say just taking mod k n, and have values in some lookup map. If n > set of keys, average time complexity is O(1), and arrays should be very fast.
I just want to be sure I understand this. For this plan, you aren't intending to use a perfect hash function at all, correct?
Yes, this is another idea. In a hash table, it is not important to have different indices, only that that table entries are as flat as possible.
Are you basically just suggesting to stick everything in an array with the key as an index?
You still need to fold the key values onto some interval.
Or are you suggesting an actual hash table?
The hash function folds the keys onto an interval. Since you have Int values k, you might just use a mod k n function for that.
If it's the latter, I'm not certain where the array fits into the picture. I'm pretty sure I'm missing something here.
There is a module Data.HashTable. The array is just to make lookup fast. Like in: ST s (STArray s HashKey (Map Key Value))
In either case, I don't think this would be as good as what I thought was your original suggestion, i.e. using a minimal perfect hash function mphf that hashes keys to a value 0..v-1. With v=number of values, valArr = array of all values, then
lookup k = valArr ! mphf k
Right, but you may have to avoid implementing the perfect hash function by yourself, if it is only available in C. :-) There is a FFI, though. Hans

On Fri, Jan 29, 2010 at 9:26 AM, Hans Aberg
On 29 Jan 2010, at 15:57, John Lato wrote:
Are you basically just suggesting to stick everything in an array with the key as an index?
You still need to fold the key values onto some interval.
Not with a suitably large array ;-)
Or are you suggesting an actual hash table?
The hash function folds the keys onto an interval. Since you have Int values k, you might just use a mod k n function for that.
If it's the latter, I'm not certain where the array fits into the picture. I'm pretty sure I'm missing something here.
There is a module Data.HashTable. The array is just to make lookup fast. Like in: ST s (STArray s HashKey (Map Key Value))
I was misunderstanding your intention here. This is an interesting approach, but I would still try an IntMap first. Cheers, John

On 31 Jan 2010, at 20:07, John Lato wrote:
Or are you suggesting an actual hash table?
The hash function folds the keys onto an interval. Since you have Int values k, you might just use a mod k n function for that.
If it's the latter, I'm not certain where the array fits into the picture. I'm pretty sure I'm missing something here.
There is a module Data.HashTable. The array is just to make lookup fast. Like in: ST s (STArray s HashKey (Map Key Value))
I was misunderstanding your intention here. This is an interesting approach, but I would still try an IntMap first.
A simple hash-function for strings is to simply exclusive-or the bytes and then reduce modulo a prime number, which will be the table size. The average lookup time complexity is the one of the hash table. You can then compare with the IntMap, which may have time complexity O(4), the number of bytes in an Int32. There might be a big difference in the constant factor, though: an array seems to be much faster. I first made multivariate polynomial division using lists, because it is easy to work with, and an example of some stuff built on top took tens of seconds to run in Hugs. When reworking it using STArray (mutable) and Array (immutable), it has no noticeable delay. Though I am using Map and Set, too, the choice of container seems otherwise be dictated by what is most convenient to program. Hans

On Feb 1, 2010, at 9:04 AM, Hans Aberg wrote:
A simple hash-function for strings is to simply exclusive-or the bytes and then reduce modulo a prime number,
Simply exclusive-oring the bytes will give you at most 256 distinct results. (For an ASCII source, 128 distinct results.) After that, there hardly seems to be any point in reduction modulo a prime. This approach can't tell a CAT from an ACT or a DOG from a GOD, which is another strike against it. (It also can't tell a TITTLE from a TILE, or a BOTTLE from a BOLE, for obvious reasons.)

On 2 Feb 2010, at 03:05, Richard O'Keefe wrote:
A simple hash-function for strings is to simply exclusive-or the bytes and then reduce modulo a prime number,
Simply exclusive-oring the bytes will give you at most 256 distinct results. (For an ASCII source, 128 distinct results.) After that, there hardly seems to be any point in reduction modulo a prime.
Right - I just gave an example of how simple hash functions may be created. The original question deals with Int32s, not strings. As already mentioned before, there are more advance libraries here http://en.wikipedia.org/wiki/Perfect_hash_function but they are not written in Haskell. So you have to rewrite them or use the FFI.
This approach can't tell a CAT from an ACT or a DOG from a GOD, which is another strike against it. (It also can't tell a TITTLE from a TILE, or a BOTTLE from a BOLE, for obvious reasons.)
The hash function just tries to produce random lookup values, which are used to flatten the average depth of the lookup table at the entries of the array. So there is a tradeoff between a fast hash function, and one that does a good job. Hans
participants (3)
-
Hans Aberg
-
John Lato
-
Richard O'Keefe