RFC: Adding a Hashable type class and HashMap/HashSet data types to HP

Hi all, This is a request for early feedback on something that I'd like to propose for the next major HP release. Milan Straka showed [1] that it's possible to design a persistent set (and thus also map) for string-like keys that's about twice as fast as the current Data.Set using hashing and a Patricia tree (i.e. IntMap). The Clojure community has also managed to create a fast persistent map, based on Bagwell's hash array mapped trie. I'd like to propose that we get the necessary infrastructure in place for creating fast and easy to use hash-based data structure. Here's what I think is needed: * A type class for things that can be hashed, Hashable. class Hashable a where hash :: a -> Word An alternative would be to have `hash` return an Int, which would make the method easier to use with Haskell data structures, which are indexed using Ints, but Word feels like the more correct type to me. The Hashable type class would have to go in either base, containers, or a new package (there's currently a hashable package on Hackage.) * Instances of Hashable for common types, in particular ByteString and Text. I already have fast instances for Text and ByteString, based on MurmurHash2 [2]. * Two new data types in the containers package, Data.HashMap and Data.HashSet. These would be persistent data structures, initially based on IntMap, and have the same or similar API to Data.Map and Data.Set.. In the future we could also add mutable HashTables, operating in the ST monad, or other hash-based data types (persistent or mutable). Any comments? Johan 1. http://research.microsoft.com/~simonpj/papers/containers/containers.pdf 2. http://sites.google.com/site/murmurhash/

I like this idea. As I mentioned on IRC, I'd call the class Hash rather than
Hashable. I'm also with you on the Word return type. It may be less
convenient but maybe this will be a tiny step towards the "great Word
revolt" (in which all conceptually unsigned things in the prelude and
standard libraries actually become unsigned) that I hope will occur sometime
in the near future.
On Thu, Nov 18, 2010 at 5:05 PM, Johan Tibell
Hi all,
This is a request for early feedback on something that I'd like to propose for the next major HP release.
Milan Straka showed [1] that it's possible to design a persistent set (and thus also map) for string-like keys that's about twice as fast as the current Data.Set using hashing and a Patricia tree (i.e. IntMap). The Clojure community has also managed to create a fast persistent map, based on Bagwell's hash array mapped trie.
I'd like to propose that we get the necessary infrastructure in place for creating fast and easy to use hash-based data structure. Here's what I think is needed:
* A type class for things that can be hashed, Hashable.
class Hashable a where hash :: a -> Word
An alternative would be to have `hash` return an Int, which would make the method easier to use with Haskell data structures, which are indexed using Ints, but Word feels like the more correct type to me.
The Hashable type class would have to go in either base, containers, or a new package (there's currently a hashable package on Hackage.)
* Instances of Hashable for common types, in particular ByteString and Text.
I already have fast instances for Text and ByteString, based on MurmurHash2 [2].
* Two new data types in the containers package, Data.HashMap and Data.HashSet. These would be persistent data structures, initially based on IntMap, and have the same or similar API to Data.Map and Data.Set..
In the future we could also add mutable HashTables, operating in the ST monad, or other hash-based data types (persistent or mutable).
Any comments?
Johan
1. http://research.microsoft.com/~simonpj/papers/containers/containers.pdf 2. http://sites.google.com/site/murmurhash/ _______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

On 11/18/10 18:39, Daniel Peebles wrote:
I like this idea. As I mentioned on IRC, I'd call the class Hash rather than Hashable. I'm also with you on the Word return type. It may be less convenient but maybe this will be a tiny step towards the "great Word revolt" (in which all conceptually unsigned things in the prelude and standard libraries actually become unsigned) that I hope will occur sometime in the near future.
"great Word revolt"? Don't do it. Not that way. It's a risky proposition in C, and the Haskell types work the same way, except they're called "Word" instead of "unsigned". For example, on my machine,
(1 - 2) :: Word 18446744073709551615
However, I'm mildly in favor of hash results being Words, because they're not supposed to be used as meaningful numbers anyway. Int is good for numbers that are moderately meaningful. These hashes are more like bit-sequences, where -- if arithmetic is used at all -- Word's modular nature is what we *want*, and its equal treatment of the top bit is what we want (vs. being related to the number's sign). -Isaac

Daniel Peebles
I like this idea. As I mentioned on IRC, I'd call the class Hash rather than Hashable
The crypto-api already uses the class "Hash" for cryptographic hashes. If you are talking about data to be hashed, and not the algorithm to do the hashing, then "Hashable" is both correct and unique.
I'm also with you on the Word return type. It may be less convenient but maybe this will be a tiny step towards the "great Word revolt" (in which all conceptually unsigned things in the prelude and standard libraries actually become unsigned) that I hope will occur sometime in the near future.
Whats next, making sure all bounded things don't flow over/under? Pfft. Data.Word.NonFlowing Cheers, Thomas

On Thu, Nov 18, 2010 at 11:05 PM, Johan Tibell
* A type class for things that can be hashed, Hashable.
class Hashable a where hash :: a -> Word
BTW,
An issue that needs to be resolved is: "Where should this change go?" We have a
couple of options:
a) Bring http://hackage.haskell.org/package/hashable into the platform
b) Put the module into base -- probably a -1 for political/logistical reasons
c) Put it into containers -- probably a -1 because it would force a
containers dependency where there was no containers dependency before.
Any other ideas?
G
--
Gregory Collins

On Fri, Nov 19, 2010 at 10:48 AM, Gregory Collins
On Thu, Nov 18, 2010 at 11:05 PM, Johan Tibell
wrote: * A type class for things that can be hashed, Hashable.
class Hashable a where hash :: a -> Word
BTW,
An issue that needs to be resolved is: "Where should this change go?" We have a couple of options:
a) Bring http://hackage.haskell.org/package/hashable into the platform
b) Put the module into base -- probably a -1 for political/logistical reasons
c) Put it into containers -- probably a -1 because it would force a containers dependency where there was no containers dependency before.
Right. Putting it into containers would force text/bytestring to depend on containers or vice versa, in order to provide instances. Base or a separate package are the only options I think. Johan

On Fri, 2010-11-19 at 10:48 +0100, Gregory Collins wrote:
On Thu, Nov 18, 2010 at 11:05 PM, Johan Tibell
wrote: * A type class for things that can be hashed, Hashable.
class Hashable a where hash :: a -> Word
Shouldn't it be: class Eq a => Hashable a where hash :: a -> Word Or is there useful case where we want to hash something but it cannot be member of Eq?
BTW,
An issue that needs to be resolved is: "Where should this change go?" We have a couple of options:
a) Bring http://hackage.haskell.org/package/hashable into the platform
b) Put the module into base -- probably a -1 for political/logistical reasons
I'd say base - there is Data.HashTable in base so adding function: new' :: Hashable key => IO (HashTable key val) would belong there
c) Put it into containers -- probably a -1 because it would force a containers dependency where there was no containers dependency before.
Any other ideas?
G
Regards

On Sat, Nov 20, 2010 at 3:53 PM, Maciej Piechotka
On Fri, 2010-11-19 at 10:48 +0100, Gregory Collins wrote:
On Thu, Nov 18, 2010 at 11:05 PM, Johan Tibell
wrote: * A type class for things that can be hashed, Hashable.
class Hashable a where hash :: a -> Word
Shouldn't it be:
class Eq a => Hashable a where hash :: a -> Word
Or is there useful case where we want to hash something but it cannot be member of Eq?
I can think of a few. Regardless, constraining the type class in this way doesn't really buy us anything. Johan

On 11/20/10 1:13 PM, Johan Tibell wrote:
On Sat, Nov 20, 2010 at 3:53 PM, Maciej Piechotka
wrote: Shouldn't it be:
class Eq a => Hashable a where hash :: a -> Word
Or is there useful case where we want to hash something but it cannot be member of Eq?
I can think of a few. Regardless, constraining the type class in this way doesn't really buy us anything.
Unless there's a really good reason why a class should require a superclass, it's best to leave them off. Even if one can't think of good examples at the time of designing the class. True inheritance/subsumption hierarchies like Functor/Applicative/Monad are an example of a good reason to have them. Functor/Foldable/Traversable is another. Eq/Ord is another, though some consider it questionable (other than the omission of a PartialOrd class, I'm not sure why). On the gripping hand, Eq/Num and Show/Num are excellent examples of where adding unnecessary constraints just makes things harder for everyone. I see no reason why the ability to hash a value should entail decidable equality checking for those values, nor how 'hash' would allow us to flawlessly reconstruct definitions of (==)/(/=). -- Live well, ~wren

On Sat, Nov 20, 2010 at 02:53:03PM +0000, Maciej Piechotka wrote:
On Fri, 2010-11-19 at 10:48 +0100, Gregory Collins wrote:
On Thu, Nov 18, 2010 at 11:05 PM, Johan Tibell
wrote: An issue that needs to be resolved is: "Where should this change go?" We have a couple of options:
a) Bring http://hackage.haskell.org/package/hashable into the platform
b) Put the module into base -- probably a -1 for political/logistical reasons
I'd say base
I would suggest not putting it into base if it isn't necessary to do so. It's much easier to merge things later (if desired) than to split them apart, and base is already far too big. Thanks Ian

On Sat, Nov 20, 2010 at 06:32:15PM +0000, Ian Lynagh wrote:
I would suggest not putting it into base if it isn't necessary to do so. It's much easier to merge things later (if desired) than to split them apart, and base is already far too big.
In that case it should be a small extra package in the GHC bootlibs, since it will be a prerequisite for both bytestring and containers.

On 11/20/10 1:32 PM, Ian Lynagh wrote:
On Sat, Nov 20, 2010 at 02:53:03PM +0000, Maciej Piechotka wrote:
On Fri, 2010-11-19 at 10:48 +0100, Gregory Collins wrote:
On Thu, Nov 18, 2010 at 11:05 PM, Johan Tibell
wrote: An issue that needs to be resolved is: "Where should this change go?" We have a couple of options:
a) Bring http://hackage.haskell.org/package/hashable into the platform
b) Put the module into base -- probably a -1 for political/logistical reasons
I'd say base
I would suggest not putting it into base if it isn't necessary to do so. It's much easier to merge things later (if desired) than to split them apart, and base is already far too big.
+1. Especially given the recent trend towards breaking things out of base, moving this in is suspect. As folks have been moving towards base-4 we've been able to finally start getting rid of the slew of Cabal flags for figuring out when base does or does not include various libraries. Don't start inventing new ones :) -- Live well, ~wren

Johan Tibell wrote:
* Instances of Hashable for common types, in particular ByteString and Text.
And of course, all of the numeric types, unit, and Bool, Maybe, and Either. More interesting are Maybe, Either, lists, tuples, Data.Map, Data.Set, etc. These are far from trivial; I did some work on it for Python. If you think you can just do something fairly simple and reasonable and hope that it will have the right statistical properties, I promise you that it won't work. Unless someone knows of a better methodology, you need to develop a battery of QuickCheck-like tests to build complexity into data structures in various ways, then use trial and error to come up with an algorithm that gives results within some tolerance for all of the tests while still being very fast.
I already have fast instances for Text and ByteString, based on MurmurHash2 [2].
Ah, I see from that site that they used pretty much the same kinds of techniques. They actually used chi-square, I didn't get that fancy. Theoretically, if the hashes for Maybe and Either are good enough, all the other container hashes could be built out of those. But I doubt that would be very easy. Anyway, the main work is building all the tests, which you would have to do in any case. Regards, Yitz

On 20 November 2010 19:26, Yitzchak Gale
Johan Tibell wrote:
* Instances of Hashable for common types, in particular ByteString and Text.
And of course, all of the numeric types, unit, and Bool, Maybe, and Either.
More interesting are Maybe, Either, lists, tuples, Data.Map, Data.Set, etc. These are far from trivial; I did some work on it for Python. If you think you can just do something fairly simple and reasonable and hope that it will have the right statistical properties, I promise you that it won't work.
Are you referring to the hash function or to the fact that you want structurally different types to map to different hashes? I.e., hash (Data.Binary.encode x) probably would lead to unintended collisions. E.g., isomorphic values might get the same hash, right?

I wrote:
More interesting are Maybe, Either, lists, tuples, Data.Map, Data.Set, etc. These are far from trivial; I did some work on it for Python. If you think you can just do something fairly simple and reasonable and hope that it will have the right statistical properties, I promise you that it won't work.
Thomas Schilling wrote:
Are you referring to the hash function or to the fact that you want structurally different types to map to different hashes?
The former. In Python, where objects of varying types will be inserted into the same hash table, the latter is also important. But why should we care about that in Haskell? If people circumvent the type system by serializing before hashing, I don't think we can help them very much. Or perhaps I am misunderstanding you. Thanks, Yitz

On 25 November 2010 13:10, Yitzchak Gale
I wrote:
More interesting are Maybe, Either, lists, tuples, Data.Map, Data.Set, etc. These are far from trivial; I did some work on it for Python. If you think you can just do something fairly simple and reasonable and hope that it will have the right statistical properties, I promise you that it won't work.
Thomas Schilling wrote:
Are you referring to the hash function or to the fact that you want structurally different types to map to different hashes?
The former. In Python, where objects of varying types will be inserted into the same hash table, the latter is also important. But why should we care about that in Haskell? If people circumvent the type system by serializing before hashing, I don't think we can help them very much. Or perhaps I am misunderstanding you.
Right, murmurhash2 is a very good and well-tested hash function. There is currently a beta version version 3 because a weakness has been found for some cases ("repetitions of 4-byte patterns have a much higher chance of collision than expected"). I think if we base our hash function on Murmur3, we don't need to worry much about the rest. Regarding hashing of ADTs, I agree with you that hash () == hash False should be no problem. The general principles would be: 1. prop_hashEq x y = x == y ==> hash x == hash y and for a data type with constructors C_1 .. C_n 2. hash (C_i x_i1 ... x_iN) = hash i <> hash x_i1 <> ... <> hash x_iN where <> is the hash combine function (probably (.)) Principle (1) may require to use some form of normalisation before applying principle (2), e.g., for Set, differently balanced trees should still produce the same hash. Does this look reasonable? Also, I'm fine with the return of hash being an architecture-specific Word type, but we also need variants with a known bit-width of the resulting hash. For example, a Word32 hash has way more collisions than a Word64 hash, and this is important for some applications. / Thomas -- Push the envelope. Watch it bend.

On Thu, Nov 25, 2010 at 5:15 PM, Thomas Schilling
Regarding hashing of ADTs, I agree with you that hash () == hash False should be no problem. The general principles would be:
1. prop_hashEq x y = x == y ==> hash x == hash y
and for a data type with constructors C_1 .. C_n
2. hash (C_i x_i1 ... x_iN) = hash i <> hash x_i1 <> ... <> hash x_iN
where <> is the hash combine function (probably (.))
Principle (1) may require to use some form of normalisation before applying principle (2), e.g., for Set, differently balanced trees should still produce the same hash.
Does this look reasonable?
(1) is a property that most data structures would require. I don't think it needs to be enforced in the Hashable type class.
Also, I'm fine with the return of hash being an architecture-specific Word type, but we also need variants with a known bit-width of the resulting hash. For example, a Word32 hash has way more collisions than a Word64 hash, and this is important for some applications.
I agree. I don't know if the specific sized hashes should be in the Hashable type class though. Typically you only need these for a few string like types. Johan

I wrote:
More interesting are Maybe, Either, lists, tuples, Data.Map, Data.Set, etc. These are far from trivial; I did some work on it for Python. If you think you can just do something fairly simple and reasonable and hope that it will have the right statistical properties, I promise you that it won't work.
Thomas Schilling wrote:
Right, murmurhash2 is a very good and well-tested hash function.
It's good for strings. So we can use it for String and Data.Text. But how about [Data.Map Data.Text (String, [(Int, Bool)])] or whatever. Do you have a hash function for that? When you have containers that can be used to build structures of arbitrary shape, writing a hash function that works in general is a hard problem. Were you thinking that you could serialize those things and then just use murmurhash on them? Try it, but my guess is it won't work well. Murmurhash is designed for strings, not for those things. It's not a cryptographic hash; it's designed to have good statistical properties for a certain very specific types of input. If you abuse it, you'll likely lose those good properties. The parameters for murmerhash were tweaked using keys that were English words, and very short strings of random bytes. There is no evidence it will work well with other kinds of keys. I'd be very happy to be proven wrong though. Regards, Yitz

On 25 November 2010 17:59, Yitzchak Gale
Were you thinking that you could serialize those things and then just use murmurhash on them?
Pretty much, yes.
Try it, but my guess is it won't work well. Murmurhash is designed for strings, not for those things. It's not a cryptographic hash; it's designed to have good statistical properties for a certain very specific types of input. If you abuse it, you'll likely lose those good properties. The parameters for murmerhash were tweaked using keys that were English words, and very short strings of random bytes. There is no evidence it will work well with other kinds of keys.
Where did you get this from? The criteria listed at http://code.google.com/p/smhasher/wiki/SMHasher all talk about arbitrary bytes. MurmurHash is designed for arbitrary byte strings, not English words. It is not a cryptographic hash, sure, but I can't see any evidence that it only works with text. / Thomas -- Push the envelope. Watch it bend.

I wrote:
Were you thinking that you could serialize those things and then just use murmurhash on them? Try it, but my guess is it won't work well. Murmurhash is designed for strings, not for those things. It's not a cryptographic hash; it's designed to have good statistical properties for a certain very specific types of input. If you abuse it, you'll likely lose those good properties. The parameters for murmerhash were tweaked using keys that were English words, and very short strings of random bytes. There is no evidence it will work well with other kinds of keys.
Thomas Schilling wrote:
Where did you get this from?
From experience developing hashes for container types.
The criteria listed at http://code.google.com/p/smhasher/wiki/SMHasher all talk about arbitrary bytes. MurmurHash is designed for arbitrary byte strings, not English words. It is not a cryptographic hash, sure, but I can't see any evidence that it only works with text.
Never mind their claims, look at their test cases. What they tested ought to work well. What they didn't test may or may not work well, and almost certainly will behave badly at least some of the time. Any fast non-cryptographic hash will by necessarily have many kinds of input that will behave badly. The idea is to test across a large sampling of the kinds of keys that typify the input you expect and tweak the constants in the hash to avoid too many collisions in those cases. As far as I could see, they tested against English words, and against fairly short strings of random bytes. And for "random bytes", the technique they used to generate them is important; changing to a different pseudorandom generator sometimes changes the results significantly. Furthermore, a fast serialization of containers is non-random, so you will be exercising the hash in a specific way that has not been tested. It could be it will be good enough, or it could be you can tweak your serialization or add another thin layer of hashing to make it all work. Thanks, Yitz

On Thu, Nov 25, 2010 at 6:59 PM, Yitzchak Gale
But how about [Data.Map Data.Text (String, [(Int, Bool)])] or whatever. Do you have a hash function for that? When you have containers that can be used to build structures of arbitrary shape, writing a hash function that works in general is a hard problem.
Is this an interesting problem to optimize for? Johan

On Thu, Nov 25, 2010 at 2:23 PM, Johan Tibell
Is this an interesting problem to optimize for?
I think so. The purpose of having the Hashable typeclass is so that the universe of typed values that we can hash is open. It thus behooves us to figure out how to do that safely. For instance, someone using the Hashable class should be able to figure out how to hash this: data HttpUrl = HttpUrl { urlScheme :: ByteString, urlHost :: ByteString, urlPort :: Int, urlPath :: ByteString, urlQuery :: Maybe ByteString } Using the Hashable class, I can see how to hash the individual elements of this, but not how to safely glue them together into a hash for the whole value. Over in the Java world, where they depend to an arguably ridiculous degree on hashing, we often see horror stories of some hash function having bad dispersal properties (or worse, equality and hashing not matching) and causing nasty knock-on performance effects due to too many values hashing to the same few numbers in the range. I notice that the Hashable class doesn't have a way to supply a seed. If it did, we could possibly chain the result of one hash as the seed of the next, allowing us to construct a complete hash in an obvious way (although introducing a possibly undesirable data dependency, making it difficult or impossible to hash a large structure in parallel).

On Fri, Nov 26, 2010 at 1:35 PM, Bryan O'Sullivan
On Thu, Nov 25, 2010 at 2:23 PM, Johan Tibell
wrote: Is this an interesting problem to optimize for?
I think so. The purpose of having the Hashable typeclass is so that the universe of typed values that we can hash is open. It thus behooves us to figure out how to do that safely. For instance, someone using the Hashable class should be able to figure out how to hash this: data HttpUrl = HttpUrl { urlScheme :: ByteString, urlHost :: ByteString, urlPort :: Int, urlPath :: ByteString, urlQuery :: Maybe ByteString } Using the Hashable class, I can see how to hash the individual elements of this, but not how to safely glue them together into a hash for the whole value.
I hope I'm not just stating the obvious - My reflections on this thread, as someone who might have to write an instance of 'Hashable' someday: 1. The design discussed in this thread puts the burden of choosing the hashing algorithm onto the instance writer, every single time. This is probably the correct choice, as it makes migration to a new algorithm incremental and possible. And it means that data for which murmurhash is inappropriate don't use it. 2. The above issue with the HttpUrl issue becomes much easier for me, as an implementor, if solid instances for tuples are given. This doesn't solve the underlying technical issue, but it does mean that we only need to solve it once, and that library and application implementors don't need to care about the solution (unless it doesn't work well for some case). My instances become mechanical translations into (tagged) tuples. So I think I like the proposed interface, assuming that a good instance for tuples is possible. Antoine

On Fri, Nov 26, 2010 at 8:35 PM, Bryan O'Sullivan
I think so. The purpose of having the Hashable typeclass is so that the universe of typed values that we can hash is open.
I don't disagree that being able to hash everything is nice, but I don't think it's crucial. My main interest in having a Hashable type class is so we can have containers that can be keyed by hashable things, for the types where this make sense (e.g. string like types where comparison is expensive.) If all that a Hashable type class would give me is the ability to store ByteStrings and Texts in a HashMap, that alone would be enough motivation for having one in my opinion.
I notice that the Hashable class doesn't have a way to supply a seed. If it did, we could possibly chain the result of one hash as the seed of the next, allowing us to construct a complete hash in an obvious way (although introducing a possibly undesirable data dependency, making it difficult or impossible to hash a large structure in parallel).
Having a hashWith method that takes a seed sounds reasonable. I guess a hash algorithm that doesn't make use of a seed (are there any?) could just throw away the seed in the implementation of hashWith. Johan

Johan Tibell wrote:
I don't disagree that being able to hash everything is nice, but I don't think it's crucial. My main interest in having a Hashable type class is so we can have containers that can be keyed by hashable things, for the types where this make sense (e.g. string like types where comparison is expensive.) If all that a Hashable type class would give me is the ability to store ByteStrings and Texts in a HashMap, that alone would be enough motivation for having one in my opinion.
Yes I agree. But on the other hand, it would be a shame to provide good built-in hashes only for those, even if we leave the type class open. There is a lot of space between "just strings" and "everything". Thanks, Yitz

On 26 Nov 2010, at 23:14, Johan Tibell wrote:
On Fri, Nov 26, 2010 at 8:35 PM, Bryan O'Sullivan
wrote: I think so. The purpose of having the Hashable typeclass is so that the universe of typed values that we can hash is open.
I don't disagree that being able to hash everything is nice, but I don't think it's crucial. My main interest in having a Hashable type class is so we can have containers that can be keyed by hashable things, for the types where this make sense (e.g. string like types where comparison is expensive.) If all that a Hashable type class would give me is the ability to store ByteStrings and Texts in a HashMap, that alone would be enough motivation for having one in my opinion.
I am probably showing my ignorance here, but I am puzzled as to how a hash value of type Word32 helps anyone. When I learnt CS a long time ago (and I even implemented a Hashable class for Haskell in the early 1990s), a hash value was always used as the index into a constant-time lookup table. I certainly don't want tables of size 2^32 in any programs I write, even if I do have 12G of memory in my machine. Way back when, my Hashable method needed to know what size of table you wanted in order to generate the hash. Of course, you could resize tables upwards if required too. So I'm imagining you have a different idea in mind for how to store a HashMap? Will it be a rebalancing tree structure, using the ordering on Hashes as keys? This changes the computational complexity of both storage and lookup, and I begin to wonder whether you will gain any performance ultimately. Regards, Malcolm

Hashes are almost always word-sized these days. If you want to use
them as an index into a fixed-size table, you mod or otherwise
truncate the hash.
If instead you want to use the hash value to store into a patricia
trie like IntMap, you would like to use the entire word to avoid
collisions. The reason the CPU likes word-sized hash codes is that you
can compare them using register instructions. On 64-bit machines we
should try to use Word64!
G
On Thu, Dec 2, 2010 at 9:40 AM, Malcolm Wallace
I am probably showing my ignorance here, but I am puzzled as to how a hash value of type Word32 helps anyone. When I learnt CS a long time ago (and I even implemented a Hashable class for Haskell in the early 1990s), a hash value was always used as the index into a constant-time lookup table. I certainly don't want tables of size 2^32 in any programs I write, even if I do have 12G of memory in my machine.
Way back when, my Hashable method needed to know what size of table you wanted in order to generate the hash. Of course, you could resize tables upwards if required too.
So I'm imagining you have a different idea in mind for how to store a HashMap? Will it be a rebalancing tree structure, using the ordering on Hashes as keys? This changes the computational complexity of both storage and lookup, and I begin to wonder whether you will gain any performance ultimately.
Regards, Malcolm
_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries
--
Gregory Collins

On Thu, 2010-11-25 at 16:15 +0000, Thomas Schilling wrote:
and for a data type with constructors C_1 .. C_n
2. hash (C_i x_i1 ... x_iN) = hash i <> hash x_i1 <> ... <> hash x_iN
where <> is the hash combine function (probably (.))
(.) seems to fail type-checking.
Principle (1) may require to use some form of normalisation before applying principle (2), e.g., for Set, differently balanced trees should still produce the same hash.
Does this look reasonable?
For autogeneration - yes. However it may happen that for optimization & special cases it is not correct. For example: data ByteString = BS (ForeignPtr Word8) Int Int I would expect that: instance Hashable ByteString where hash = foldr' (\w h -> h <> hash w) (hash []) rather then instance Hashable ByteString where hash (BS f o l) = hash f <> hash f <> hash o <> hash l I'm not sure what benefits have second principle - except forcing 'normalisation' which is not possible - either ForeignPtr is not hashable and ByteString is not hashable as it requires IO or ForeignPtr is hashable by pointer [not contents] so image of hash on bytestings depends only on length alone if normalisation is applied (as ptr & offset may vary). Regards

On 26 November 2010 17:18, Maciej Piechotka
On Thu, 2010-11-25 at 16:15 +0000, Thomas Schilling wrote:
and for a data type with constructors C_1 .. C_n
2. hash (C_i x_i1 ... x_iN) = hash i <> hash x_i1 <> ... <> hash x_iN
where <> is the hash combine function (probably (.))
(.) seems to fail type-checking.
Right, sorry, I was actually having a different type in mind. Something like: newtype HashBuilder = HashBuilder { unHB :: Word -> Word } class Hashable a where addHash :: a -> HashBuilder The argument would be the seed. We'd then have: hash :: Hashable a => a -> Word hash a = finaliseHash $ unHB (addHash a) seed where seed = 0xdeadbeef finaliseHash x = ... -- some final mixing operation As Bryan mentioned, though, this does not allow building hashes in parallel. It would also hard-code the hash function. -- Push the envelope. Watch it bend.

On Fri, 2010-11-26 at 23:05 +0000, Thomas Schilling wrote:
On 26 November 2010 17:18, Maciej Piechotka
wrote: On Thu, 2010-11-25 at 16:15 +0000, Thomas Schilling wrote:
and for a data type with constructors C_1 .. C_n
2. hash (C_i x_i1 ... x_iN) = hash i <> hash x_i1 <> ... <> hash x_iN
where <> is the hash combine function (probably (.))
(.) seems to fail type-checking.
Right, sorry, I was actually having a different type in mind. Something like:
newtype HashBuilder = HashBuilder { unHB :: Word -> Word } class Hashable a where addHash :: a -> HashBuilder
The argument would be the seed. We'd then have:
hash :: Hashable a => a -> Word hash a = finaliseHash $ unHB (addHash a) seed where seed = 0xdeadbeef finaliseHash x = ... -- some final mixing operation
As Bryan mentioned, though, this does not allow building hashes in parallel. It would also hard-code the hash function.
Possibly the autogenerated function would be: hash (C_n x_1 x_2 ... x_n) = hash "C_n" <> hash x_1 <> hash x_2 <> ... <> hash x_n where <> is function Word -> Word -> Word (like xor). Regards PS. As with list there is problem with hash "[]" and hash ":" possibly: instance Hashable a => Hashable [a] where hash = foldl' (<>) 0xdeadbeef . map hash

This is what I had in mind: https://gist.github.com/717475
On 27 November 2010 00:28, Maciej Piechotka
On Fri, 2010-11-26 at 23:05 +0000, Thomas Schilling wrote:
On 26 November 2010 17:18, Maciej Piechotka
wrote: On Thu, 2010-11-25 at 16:15 +0000, Thomas Schilling wrote:
and for a data type with constructors C_1 .. C_n
2. hash (C_i x_i1 ... x_iN) = hash i <> hash x_i1 <> ... <> hash x_iN
where <> is the hash combine function (probably (.))
(.) seems to fail type-checking.
Right, sorry, I was actually having a different type in mind. Something like:
newtype HashBuilder = HashBuilder { unHB :: Word -> Word } class Hashable a where addHash :: a -> HashBuilder
The argument would be the seed. We'd then have:
hash :: Hashable a => a -> Word hash a = finaliseHash $ unHB (addHash a) seed where seed = 0xdeadbeef finaliseHash x = ... -- some final mixing operation
As Bryan mentioned, though, this does not allow building hashes in parallel. It would also hard-code the hash function.
Possibly the autogenerated function would be:
hash (C_n x_1 x_2 ... x_n) = hash "C_n" <> hash x_1 <> hash x_2 <> ... <> hash x_n
where <> is function Word -> Word -> Word (like xor).
Regards
PS. As with list there is problem with hash "[]" and hash ":" possibly:
instance Hashable a => Hashable [a] where hash = foldl' (<>) 0xdeadbeef . map hash
-- Push the envelope. Watch it bend.

On Fri, Nov 26, 2010 at 5:59 PM, Thomas Schilling
This is what I had in mind: https://gist.github.com/717475
Cute. Of course (<>) suggests that you use a Monoid instance instead and drop the special operator :-)
participants (14)
-
Antoine Latter
-
Bryan O'Sullivan
-
Daniel Peebles
-
Gregory Collins
-
Ian Lynagh
-
Isaac Dupree
-
Johan Tibell
-
Maciej Piechotka
-
Malcolm Wallace
-
Ross Paterson
-
Thomas DuBuisson
-
Thomas Schilling
-
wren ng thornton
-
Yitzchak Gale