Proposal: Add the unordered-containers package and the hashable package to the Haskell Platform

Johan Tibell

19 Mar 2013 19 Mar '13

3:01 p.m.

Hi all, As requested by Mark, here's a proposal to include unordered-containers and hashable in the platform: http://trac.haskell.org/haskell-platform/wiki/Proposals/unordered-containers The timeline is pretty short, so lets see if we can agree on this before Monday March 25th, 2013. -- Johan

Attachments:

attachment.html (text/html — 506 bytes)

Show replies by date

Gregory Collins

19 Mar 19 Mar

3:15 p.m.

Enthusiastic +1. An even more enthusiastic +1 (+10?) if some of the ongoing performance issues caused by 1.2 could get fixed. :) On Tue, Mar 19, 2013 at 4:01 PM, Johan Tibell wrote:

...

Hi all,

As requested by Mark, here's a proposal to include unordered-containers and hashable in the platform:

http://trac.haskell.org/haskell-platform/wiki/Proposals/unordered-containers

The timeline is pretty short, so lets see if we can agree on this before Monday March 25th, 2013.

-- Johan

_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

-- Gregory Collins

Milan Straka

3:30 p.m.

New subject: Proposal: Add the unordered-containers package and the hashable package to the Haskell Platform

Hi all,

...

-----Original message----- From: Johan Tibell Sent: 19 Mar 2013, 08:01

Hi all,

As requested by Mark, here's a proposal to include unordered-containers and hashable in the platform:

http://trac.haskell.org/haskell-platform/wiki/Proposals/unordered-containers

The timeline is pretty short, so lets see if we can agree on this before Monday March 25th, 2013.

+1 here :) Cheers, Milan

Felipe Almeida Lessa

3:41 p.m.

+1, of course =). On Tue, Mar 19, 2013 at 12:30 PM, Milan Straka wrote:

...

Hi all,

...
-----Original message----- From: Johan Tibell Sent: 19 Mar 2013, 08:01

Hi all,

As requested by Mark, here's a proposal to include unordered-containers and hashable in the platform:

http://trac.haskell.org/haskell-platform/wiki/Proposals/unordered-containers

The timeline is pretty short, so lets see if we can agree on this before Monday March 25th, 2013.

+1 here :)

Cheers, Milan

_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

-- Felipe.

Thomas Schilling

3:41 p.m.

On 19 March 2013 16:01, Johan Tibell wrote:

...

http://trac.haskell.org/haskell-platform/wiki/Proposals/unordered-containers

The links to the repos are wrong. It should be "tibbe" instead of "tibbel". I am +1 to both. Bryan's recent change to change "hashable" to use SipHash is certainly the right default. There were some complaints about performance for use cases where security is not an issue. What are the options for users that wish to use a different hash function? According to the paper, SipHash is about 2x slower than CityHash.

Johan Tibell

3:51 p.m.

Hi Thomas, On Tue, Mar 19, 2013 at 8:41 AM, Thomas Schilling wrote:

...

On 19 March 2013 16:01, Johan Tibell wrote:

...
http://trac.haskell.org/haskell-platform/wiki/Proposals/unordered-containers

The links to the repos are wrong. It should be "tibbe" instead of "tibbel".

Fixed.

...

Bryan's recent change to change "hashable" to use SipHash is certainly the right default. There were some complaints about performance for use cases where security is not an issue. What are the options for users that wish to use a different hash function? According to the paper, SipHash is about 2x slower than CityHash.

2x is *a lot*. 2x is about the performance difference between Map and HashMap. Since the raison d'etre for HashMap is that it's faster than Map, if we'd see a 2x slowdown in HashMap there would be little reason to use it. For example, 'delete' for HashMap ByteString got almost 2x slower with hashable-1.2. Since 'delete' does more than just hashing, that means that SipHash is quite a bit slower than the current (insecure) hash function. Another example: with GHC 7.6.2 HashMap String is almost unusable slow (5x slower than before). This is likely due to a GHC bug, but it's something we need to investigate. At the moment I don't encourage people to upgrade to hashable-1.2.0.5. The right way to go is probably to make this a user decision. Many applications (e.g. data processing) has no need for the security guarantee so paying for it makes little sense. Cheers, Johan

Thomas Schilling

4:19 p.m.

Oh, I just realised that this proposal is to include the older version of hashable. In principle, I'm not against that, but I do wonder what the upgrade path is. I don't think the performance problems can be fixed in general -- that's just the price of security. So it becomes critical what the upgrade path looks like. Do users get a slowdown of 2x by default and then have to manually make it faster again if something is not security sensitive? Do users have to explicitly opt in for security (a bad default, IMO)? Do we have any idea how that switch may affect the API? On 19 March 2013 16:51, Johan Tibell wrote:

...

Hi Thomas,

On Tue, Mar 19, 2013 at 8:41 AM, Thomas Schilling wrote:

...
On 19 March 2013 16:01, Johan Tibell wrote:

...
http://trac.haskell.org/haskell-platform/wiki/Proposals/unordered-containers

The links to the repos are wrong. It should be "tibbe" instead of "tibbel".

Fixed.

...
Bryan's recent change to change "hashable" to use SipHash is certainly the right default. There were some complaints about performance for use cases where security is not an issue. What are the options for users that wish to use a different hash function? According to the paper, SipHash is about 2x slower than CityHash.

2x is *a lot*. 2x is about the performance difference between Map and HashMap. Since the raison d'etre for HashMap is that it's faster than Map, if we'd see a 2x slowdown in HashMap there would be little reason to use it.

For example, 'delete' for HashMap ByteString got almost 2x slower with hashable-1.2. Since 'delete' does more than just hashing, that means that SipHash is quite a bit slower than the current (insecure) hash function. Another example: with GHC 7.6.2 HashMap String is almost unusable slow (5x slower than before). This is likely due to a GHC bug, but it's something we need to investigate. At the moment I don't encourage people to upgrade to hashable-1.2.0.5.

The right way to go is probably to make this a user decision. Many applications (e.g. data processing) has no need for the security guarantee so paying for it makes little sense.

Cheers, Johan

Johan Tibell

4:43 p.m.

On Tue, Mar 19, 2013 at 9:19 AM, Thomas Schilling wrote:

...

Oh, I just realised that this proposal is to include the older version of hashable. In principle, I'm not against that, but I do wonder what the upgrade path is. I don't think the performance problems can be fixed in general -- that's just the price of security. So it becomes critical what the upgrade path looks like. Do users get a slowdown of 2x by default and then have to manually make it faster again if something is not security sensitive? Do users have to explicitly opt in for security (a bad default, IMO)? Do we have any idea how that switch may affect the API?

...

From an API standpoints, users of unordered-containers will be unaffected by any changes to hashable and people who write Hashable instances are unlikely to be affected either, as long as they write their instances in terms of hashWithSalt.

...

From a performance standpoint, it depends on what we decide to make the default behavior of hashable be and for what types (e.g. we could use SIpHash for string types but a simple hash for Int-like types). This is the issue that's not already settled and the reason I'm holding hashable back to 1.1 for the platform. I'm leaning towards fast by-default as hashing is just one of many security issues web frameworks already consciously have to deal with. It's also an issue which have many alternative fixes, which don't require a stronger hash function.

-- Johan

Gábor Lehel

20 Mar 20 Mar

10:02 a.m.

Compatibility issues aside, is there any reason newtypes aren't a good solution here? You could get away with just one: {-# LANGUAGE FlexibleInstances #-} newtype FasterInsecureHashing a = FIH { unFIH :: a } instance Hashable Int where hashWithSalt = ...stuff... instance Hashable (FasterInsecureHashing Int) where hash = unFIH instance Hashable ByteString where hashWithSalt = ...SipHash... instance Hashable (FasterInsecureHashing ByteString) where hashWithSalt = ...CityHash... On Tue, Mar 19, 2013 at 5:43 PM, Johan Tibell wrote:

...

On Tue, Mar 19, 2013 at 9:19 AM, Thomas Schilling
...
wrote:

...
Oh, I just realised that this proposal is to include the older version of hashable. In principle, I'm not against that, but I do wonder what the upgrade path is. I don't think the performance problems can be fixed in general -- that's just the price of security. So it becomes critical what the upgrade path looks like. Do users get a slowdown of 2x by default and then have to manually make it faster again if something is not security sensitive? Do users have to explicitly opt in for security (a bad default, IMO)? Do we have any idea how that switch may affect the API?

From an API standpoints, users of unordered-containers will be unaffected by any changes to hashable and people who write Hashable instances are unlikely to be affected either, as long as they write their instances in terms of hashWithSalt.

From a performance standpoint, it depends on what we decide to make the default behavior of hashable be and for what types (e.g. we could use SIpHash for string types but a simple hash for Int-like types). This is the issue that's not already settled and the reason I'm holding hashable back to 1.1 for the platform. I'm leaning towards fast by-default as hashing is just one of many security issues web frameworks already consciously have to deal with. It's also an issue which have many alternative fixes, which don't require a stronger hash function.

-- Johan

-- Your ship was destroyed in a monadic eruption.

Herbert Valerio Riedel

12:25 p.m.

Gábor Lehel writes:

...

Compatibility issues aside, is there any reason newtypes aren't a good solution here? You could get away with just one:

this may be a bit off-topic, but I've been wondering for some time now, how to compose newtype-based typeclass instances? for instance, now we have a special newtype for 'Int's,

...

instance Hashable (FasterInsecureHashing Int) where hash = unFIH

then for some reson we have a another package, which uses newtypes to provide alternative instances for newtypes, let's say the binary package starts defining a 'Binary' newtype-wrapped instance for serializing to PDP-byteordering, .i.e. instance Binary (PdpSerialization Int) where put i = ... get = ... How do would I combine those two newtypes, if I wanted to have a nested data-structure such as IntMap Int (Int,[(Int,Int)]) hashed with the FasterInsecureHashing variant, as well as serialized with the PdpSerialization instances? cheers, hvr

Gábor Lehel

2:37 p.m.

On Wed, Mar 20, 2013 at 1:25 PM, Herbert Valerio Riedel wrote:

...

Gábor Lehel writes:

...
Compatibility issues aside, is there any reason newtypes aren't a good solution here? You could get away with just one:

this may be a bit off-topic, but I've been wondering for some time now, how to compose newtype-based typeclass instances?

for instance, now we have a special newtype for 'Int's,

...
instance Hashable (FasterInsecureHashing Int) where hash = unFIH

then for some reson we have a another package, which uses newtypes to provide alternative instances for newtypes, let's say the binary package starts defining a 'Binary' newtype-wrapped instance for serializing to PDP-byteordering, .i.e.

instance Binary (PdpSerialization Int) where put i = ... get = ...

How do would I combine those two newtypes, if I wanted to have a nested data-structure such as

IntMap Int (Int,[(Int,Int)])

hashed with the FasterInsecureHashing variant, as well as serialized with the PdpSerialization instances?

cheers, hvr

instance Hashable a => Hashable (PdpSerialization a) where hash = hash . unPDP hashWithSalt s = hashWithSalt s . unPDP instance Binary a => Binary (FasterInsecureHashing a) where put = put . unFIH get = fmap FIH get Something like that? For classes orthogonal to their intended purpose, instances for the newtypes could just forward to the base impl (of course raising the usual dependency versus orphan instance issues, but that's orthogonal ;). And then just nest the newtypes as you would expect. Alternately, if the libraries don't provide these instances, and you don't want orphans, you could use something like this solution in your own code: {-# LANGUAGE FlexibleContexts #-} newtype FIHPDP a = FIHPDP { unFIHPDP :: a } instance Hashable (FasterInsecureHashing a) => Hashable (FIHPDP a) where hashWithSalt s = hashWithSalt s . FasterInsecureHashing . unFIHPDP instance Binary (PdpSerialization a) => Binary (FIHPDP a) where put = put . PdpSerialization . unFIHPDP get = fmap (FIHPDP . unPDP) get FWIW it's also possible to avoid FlexibleInstances and FlexibleContexts: class Hashable a where hash :: a -> Int hashWithSalt :: Int -> a -> Int -- only for impl purposes class FasterInsecureHashable a where fasterInsecureHash :: a -> Int fasterInsecureHashWithSalt :: Int -> a -> Int newtype FasterInsecureHashing a = FIH { unFIH :: a } instance FasterInsecureHashable a => Hashable (FasterInsecureHashing a) where hash = fasterInsecureHash hashWithSalt = fasterInsecureHashWithSalt instance Hashable ByteString where hashWithSalt = ...SipHash... instance FasterInsecureHashable ByteString where fasterInsecureHashWithSalt = ...CityHash... instance FasterInsecureHashable a => Hashable FIHPDP a where ... And now let's stop side-tracking. -- Your ship was destroyed in a monadic eruption.

Jan-Willem Maessen

1:42 p.m.

On Wed, Mar 20, 2013 at 6:02 AM, Gábor Lehel wrote:

...

Compatibility issues aside, is there any reason newtypes aren't a good solution here?

Well, the tricky bit is that there are really two distinct uses of hashing in the field in general and under discussion here in particular. The first (which is the one that's important for unordered-containers and almost all practical uses of hashing *within* a program) is to provide fast keys for hash tables, where the hash function is backed by equality checks and the like and is thus a question of performance rather than correctness. For this a reasonably good, but not necessarily cryptographically secure, hashing method is more than sufficient. The second is to summarize large quantities of data compactly in a way that can't efficiently be forged. This generally requires a lot more bits than an Int will provide. My feeling is that Hashable is pretty good for the first purpose, but not actually great for the second one. HashWithSalt is the right primitive for Hashable primarily because it avoids a ton of problems with hash mixing when you combine hashes for sub-pieces of your data, and because good hash algorithms of both kinds are almost always specified over streams of bytes. Indeed, cryptographic hashing arguably could exist comfortably just on serialized byte sequences, though I personally find that view a little bit limiting. So I'd favor a view where Hashable is explicitly specified to be used for the first purpose, and other libraries provide crypto-quality hashing. There's an unfortunate tendency on this and related threads to pretend that we can get good cryptographic hashing out of Hashable – whereas it just slows hashing down to the point where UnorderedSet no longer outperforms Set while providing no actual security. -Jan-Willem Maessen

Johan Tibell

3:29 p.m.

On Wed, Mar 20, 2013 at 6:42 AM, Jan-Willem Maessen wrote:

...

On Wed, Mar 20, 2013 at 6:02 AM, Gábor Lehel wrote:

...
Compatibility issues aside, is there any reason newtypes aren't a good solution here?

Well, the tricky bit is that there are really two distinct uses of hashing in the field in general and under discussion here in particular. The first (which is the one that's important for unordered-containers and almost all practical uses of hashing *within* a program) is to provide fast keys for hash tables, where the hash function is backed by equality checks and the like and is thus a question of performance rather than correctness. For this a reasonably good, but not necessarily cryptographically secure, hashing method is more than sufficient.

hashable is definitely defined for this use case. Unfortunately there's a class of DsS attacks (hash flooding) that work like this: 1. The application (e.g. a webserver) stores some user input in a hash table (e.g. the HTTP headers received in the request). 2. The application uses a weak hash function (i.e. a non-cryptographic hash function) in the hash table. 3. The user feeds the application a set of keys that all collide, making the hash table behave as O(n) or even O(n^2), thereby DoS:ing the server. SipHash is one way to address these kinds of attacks. There are other means as well. For example, many general DoS protection mechanisms (timeouts, IP banning, etc) also work on these kind of attacks. -- Johan

Jason Dusek

4:22 p.m.

2013/3/19 Thomas Schilling :

...

Oh, I just realised that this proposal is to include the older version of hashable. In principle, I'm not against that, but I do wonder what the upgrade path is. I don't think the performance problems can be fixed in general -- that's just the price of security. So it becomes critical what the upgrade path looks like. Do users get a slowdown of 2x by default and then have to manually make it faster again if something is not security sensitive? Do users have to explicitly opt in for security (a bad default, IMO)? Do we have any idea how that switch may affect the API?

It seems reasonable for the secure algorithm to be handled with something explicit -- a newtype, a `secureHash' function -- so that a developer has real confidence that it's being used. What if the default changes back? -- Jason Dusek pgp // solidsnack // C1EBC57DC55144F35460C8DF1FD4C6C1FED18A2B

Thomas Schilling

3:47 p.m.

OK. I think a reasonable approach would be the following: - add hashable-1.1 (ie., without SipHash) to the platform - later create a new release of hashable, that is fast by default and provides SipHash functionality via a newtype wrapper (or it could be a new package that defines the newtype and all the standard instances) What's important is that the default behaviour (fast vs. secure) won't change in a later version of the platform. We should also make it clear that hashable (even with siphash) does not aim to implement secure hashing. I.e., no replacement for proper HMACs, SHA1, etc. On 19 March 2013 16:51, Johan Tibell wrote:

...

Hi Thomas,

On Tue, Mar 19, 2013 at 8:41 AM, Thomas Schilling wrote:

...
On 19 March 2013 16:01, Johan Tibell wrote:

...
http://trac.haskell.org/haskell-platform/wiki/Proposals/unordered-containers

The links to the repos are wrong. It should be "tibbe" instead of "tibbel".

Fixed.

...
Bryan's recent change to change "hashable" to use SipHash is certainly the right default. There were some complaints about performance for use cases where security is not an issue. What are the options for users that wish to use a different hash function? According to the paper, SipHash is about 2x slower than CityHash.

2x is *a lot*. 2x is about the performance difference between Map and HashMap. Since the raison d'etre for HashMap is that it's faster than Map, if we'd see a 2x slowdown in HashMap there would be little reason to use it.

For example, 'delete' for HashMap ByteString got almost 2x slower with hashable-1.2. Since 'delete' does more than just hashing, that means that SipHash is quite a bit slower than the current (insecure) hash function. Another example: with GHC 7.6.2 HashMap String is almost unusable slow (5x slower than before). This is likely due to a GHC bug, but it's something we need to investigate. At the moment I don't encourage people to upgrade to hashable-1.2.0.5.

The right way to go is probably to make this a user decision. Many applications (e.g. data processing) has no need for the security guarantee so paying for it makes little sense.

Cheers, Johan

Thomas Schilling

5:02 p.m.

To make this more precise the next version of hashable (say, 1.3) would include this: newtype SipHashed a = SipHashed a class SipHashable a where sipHashWithSalt :: Int -> a -> Int instance SipHashable a => Hashable (SipHashed a) where hashWithSalt salt (SipHashed x) = sipHashWithSalt salt x Then all Hashable instances are taken from hashable-1.1. All Hashable instances from hashable-1.2 are renamed to become instances of SipHashable. Alternatively, hashable-1.2 could be renamed to hashable-sip-1.0 or so. Note that I do not propose to make this change now. We include hashable-1.1 now, and the above will be the upgrade path to include secure hashing. On 20 March 2013 16:47, Thomas Schilling wrote:

...

OK. I think a reasonable approach would be the following:

- add hashable-1.1 (ie., without SipHash) to the platform - later create a new release of hashable, that is fast by default and provides SipHash functionality via a newtype wrapper (or it could be a new package that defines the newtype and all the standard instances)

What's important is that the default behaviour (fast vs. secure) won't change in a later version of the platform.

We should also make it clear that hashable (even with siphash) does not aim to implement secure hashing. I.e., no replacement for proper HMACs, SHA1, etc.

On 19 March 2013 16:51, Johan Tibell wrote:

...
Hi Thomas,

On Tue, Mar 19, 2013 at 8:41 AM, Thomas Schilling wrote:

...
On 19 March 2013 16:01, Johan Tibell wrote:

...
http://trac.haskell.org/haskell-platform/wiki/Proposals/unordered-containers

The links to the repos are wrong. It should be "tibbe" instead of "tibbel".

Fixed.

...
Bryan's recent change to change "hashable" to use SipHash is certainly the right default. There were some complaints about performance for use cases where security is not an issue. What are the options for users that wish to use a different hash function? According to the paper, SipHash is about 2x slower than CityHash.

2x is *a lot*. 2x is about the performance difference between Map and HashMap. Since the raison d'etre for HashMap is that it's faster than Map, if we'd see a 2x slowdown in HashMap there would be little reason to use it.

For example, 'delete' for HashMap ByteString got almost 2x slower with hashable-1.2. Since 'delete' does more than just hashing, that means that SipHash is quite a bit slower than the current (insecure) hash function. Another example: with GHC 7.6.2 HashMap String is almost unusable slow (5x slower than before). This is likely due to a GHC bug, but it's something we need to investigate. At the moment I don't encourage people to upgrade to hashable-1.2.0.5.

The right way to go is probably to make this a user decision. Many applications (e.g. data processing) has no need for the security guarantee so paying for it makes little sense.

Cheers, Johan

Felipe Almeida Lessa

5:08 p.m.

Although I'd prefer a secure-by-default implementation, this is a good compromise. Then unordered-containers' docs should reference SipHashed in a prominent. Cheers, On Wed, Mar 20, 2013 at 2:02 PM, Thomas Schilling wrote:

...

To make this more precise the next version of hashable (say, 1.3) would include this:

newtype SipHashed a = SipHashed a

class SipHashable a where sipHashWithSalt :: Int -> a -> Int

instance SipHashable a => Hashable (SipHashed a) where hashWithSalt salt (SipHashed x) = sipHashWithSalt salt x

Then all Hashable instances are taken from hashable-1.1. All Hashable instances from hashable-1.2 are renamed to become instances of SipHashable.

Alternatively, hashable-1.2 could be renamed to hashable-sip-1.0 or so.

Note that I do not propose to make this change now. We include hashable-1.1 now, and the above will be the upgrade path to include secure hashing.

On 20 March 2013 16:47, Thomas Schilling wrote:

...
OK. I think a reasonable approach would be the following:

- add hashable-1.1 (ie., without SipHash) to the platform - later create a new release of hashable, that is fast by default and provides SipHash functionality via a newtype wrapper (or it could be a new package that defines the newtype and all the standard instances)

What's important is that the default behaviour (fast vs. secure) won't change in a later version of the platform.

We should also make it clear that hashable (even with siphash) does not aim to implement secure hashing. I.e., no replacement for proper HMACs, SHA1, etc.

On 19 March 2013 16:51, Johan Tibell wrote:

...
Hi Thomas,

On Tue, Mar 19, 2013 at 8:41 AM, Thomas Schilling wrote:

...
On 19 March 2013 16:01, Johan Tibell wrote:

...
http://trac.haskell.org/haskell-platform/wiki/Proposals/unordered-containers

The links to the repos are wrong. It should be "tibbe" instead of "tibbel".

Fixed.

...
Bryan's recent change to change "hashable" to use SipHash is certainly the right default. There were some complaints about performance for use cases where security is not an issue. What are the options for users that wish to use a different hash function? According to the paper, SipHash is about 2x slower than CityHash.

2x is *a lot*. 2x is about the performance difference between Map and HashMap. Since the raison d'etre for HashMap is that it's faster than Map, if we'd see a 2x slowdown in HashMap there would be little reason to use it.

For example, 'delete' for HashMap ByteString got almost 2x slower with hashable-1.2. Since 'delete' does more than just hashing, that means that SipHash is quite a bit slower than the current (insecure) hash function. Another example: with GHC 7.6.2 HashMap String is almost unusable slow (5x slower than before). This is likely due to a GHC bug, but it's something we need to investigate. At the moment I don't encourage people to upgrade to hashable-1.2.0.5.

The right way to go is probably to make this a user decision. Many applications (e.g. data processing) has no need for the security guarantee so paying for it makes little sense.

Cheers, Johan

_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

-- Felipe.

Isaac Dupree

21 Mar 21 Mar

9:26 p.m.

Regarding hashWithSalt determinism: hashable 1.1: "The general contract of hash is: * This integer need not remain consistent from one execution of an application to another execution of the same application. [...] The contract for hashWithSalt is the same as for hash, with the additional requirement that any instance that defines hashWithSalt must make use of the salt in its implementation." [1] hashable 1.2: "The general contract of hashWithSalt is: * If a value is hashed using the same salt during distinct runs of an application, the result must remain the same. (This is necessary to make it possible to store hashes on persistent media.) [...]" Which contract do we want? The size of Int varies across platforms, so it is difficult to make an instance of these Hashable classes that returns the same hash on e.g. x86 and x86_64. [3] (I imagine it can be done by making sure the high bits of the returned Int are always 0 or sign-extended... depending on how unordered-containers interprets hashes [if unordered-containers is the relevant hash consumer]. This seems fragile.) Thankfully, if Data.Map is only twice as slow as Data.HashMap, I wouldn't feel too bad about using Data.Map for its determinism and good worst-case asymptotics. The type of hashes could be Word32 or Word64, or if it's better for performance+security and we don't care about determinism, Word. Thoughts? -Isaac [1] http://hackage.haskell.org/packages/archive/hashable/1.1.2.5/doc/html/Data-H... [2] http://hackage.haskell.org/packages/archive/hashable/1.2.0.4/doc/html/Data-H... [3] C++11's std::hash interface has the same issue, except that at least it returns an *unsigned* inconsistent-width type, size_t. I know this because I'm making a cross-platform-deterministic C++11 program... I think that if I use hash-tables there, I'll probably have to make a local copy of a hash-table implementation. Or audit my code to ensure that it never enumerates a hash-table in a way that is affected by ordering. If it were Haskell I could ensure enumeration-order-invariance using a class CommutativeMonoid or CommutativeSemigroup... ;-)

Johan Tibell

9:32 p.m.

On Thu, Mar 21, 2013 at 2:26 PM, Isaac Dupree < ml@isaac.cedarswampstudios.org> wrote:

...

Regarding hashWithSalt determinism:

hashable 1.1: "The general contract of hash is: * This integer need not remain consistent from one execution of an application to another execution of the same application. [...] The contract for hashWithSalt is the same as for hash, with the additional requirement that any instance that defines hashWithSalt must make use of the salt in its implementation." [1]

hashable 1.2: "The general contract of hashWithSalt is: * If a value is hashed using the same salt during distinct runs of an application, the result must remain the same. (This is necessary to make it possible to store hashes on persistent media.) [...]"

Which contract do we want?

When I wrote the 1.1 contract I gave me lots of leeway to change the implementation in the future. The actual implementation was that the hash was always the same between run to run for a given architecture. I'm not terribly happy with the "This is necessary to make it possible to store hashes on persistent media." part of the 1.2 contract. I should probably not have let that go in. If you're persisting your hashes you should use a hash function that guarantees exactly which algorithm is used. I think the contract should be: the hash function is guaranteed to return the same hash code for a given value as long as the code is compiled with the same version of hashable, unless the user explicit turns on hash randomization (i.e. random seed read from /dev/urandom). I don't think we should make any guarantees that a new version of hashable won't change the hash function used. As for word sizes the only practical thing is to use the native word size, as anything else is much too slow (i.e. Int64 is terribly slow on 32-bit platforms). -- Johan

Conrad Parker

22 Mar 22 Mar

12:43 a.m.

On 22 March 2013 05:32, Johan Tibell wrote:

...

On Thu, Mar 21, 2013 at 2:26 PM, Isaac Dupree wrote:

...
Regarding hashWithSalt determinism:

hashable 1.1: "The general contract of hash is: * This integer need not remain consistent from one execution of an application to another execution of the same application. [...] The contract for hashWithSalt is the same as for hash, with the additional requirement that any instance that defines hashWithSalt must make use of the salt in its implementation." [1]

hashable 1.2: "The general contract of hashWithSalt is: * If a value is hashed using the same salt during distinct runs of an application, the result must remain the same. (This is necessary to make it possible to store hashes on persistent media.) [...]"

Which contract do we want?

When I wrote the 1.1 contract I gave me lots of leeway to change the implementation in the future. The actual implementation was that the hash was always the same between run to run for a given architecture.

I'm not terribly happy with the "This is necessary to make it possible to store hashes on persistent media." part of the 1.2 contract. I should probably not have let that go in. If you're persisting your hashes you should use a hash function that guarantees exactly which algorithm is used.

I think the contract should be: the hash function is guaranteed to return the same hash code for a given value as long as the code is compiled with the same version of hashable, unless the user explicit turns on hash randomization (i.e. random seed read from /dev/urandom). I don't think we should make any guarantees that a new version of hashable won't change the hash function used.

It seems like there's two common uses for hashes: a) secure, one-way digests b) fast keys for balanced containers It's easy to think of programs which would make use of both kinds of hash for different things. The existing API with the nice simple short-named "hash" function will continue to: * annoy people who prefer one or the other, but read the docs and find out they're not getting what they want * infuriate people who expect one or the other, but don't read the docs then complain that Haskell is slow or insecure * burden people who would make use of either kind of hash appropriately, but now need to read the docs or source to find out what to do I'd suggest to annoy everyone equally by deprecating the existing API and introducing separately named functions like: secureHash insecureHash / fastHash As it stands, anyone who cares about any of this has to pay attention to the particular version of hashable they are using, which sounds like a recipe for failure. This thread puts it perhaps around level 9 on: http://sourcefrog.net/weblog/software/aesthetics/interface-levels.html Why not just throw it out and introduce a version 2.0 with an unambiguous API? Conrad.

Isaac Dupree

2:40 p.m.

On 03/21/2013 05:32 PM, Johan Tibell wrote:

...

I think the contract should be: the hash function is guaranteed to return the same hash code for a given value as long as the code is compiled with the same version of hashable, unless the user explicit turns on hash randomization (i.e. random seed read from /dev/urandom). I don't think we should make any guarantees that a new version of hashable won't change the hash function used.

That sounds like the right contract for a hashtable hashing function to have.

...

As for word sizes the only practical thing is to use the native word size, as anything else is much too slow (i.e. Int64 is terribly slow on 32-bit platforms).

Yeah. And Int32/Word32 hashes don't work for containers with more than 2^32 items. Alright.

...

If you're persisting your hashes you should use a hash function that guarantees exactly which algorithm is used.

Yep. But it won't/can't use this Hashable API (e.g. via newtype CrossPlatformDeterministicallySipHashed), because of the variable word size. I accept this. -Isaac

4497

Age (days ago)

4500

Last active (days ago)

List overview

Download

20 comments

11 participants

participants (11)

Conrad Parker
Felipe Almeida Lessa
Gregory Collins
Gábor Lehel
Herbert Valerio Riedel
Isaac Dupree
Jan-Willem Maessen
Jason Dusek
Johan Tibell
Milan Straka
Thomas Schilling