
Hi All, I'm trying to figure out how to maximum performance out of one of my inner loops which involves string hashing. Consider the following hash function, which is a transliteration of a good one written in C: --8x--8x--8x--8x--8x--8x--8x--8x--8x module HashStr where import Data.Bits import Data.ByteString as BLOB import Data.Word data Triple = Triple !Word64 !Word64 !Word64 hashStr :: ByteString -> Word64 hashStr str = hashBlock (Triple gold gold gold) str where gold = 0x9e3779b97f4a7c13 hashBlock (Triple a b c) str | BLOB.length str > 0 = hashBlock (mix (Triple a' b' c')) rest | otherwise = c where a' = a + BLOB.foldl' make 0 (slice 0) b' = b + BLOB.foldl' make 0 (slice 1) c' = c + BLOB.foldl' make 0 (slice 2) make x w = (x `shiftL` 8) + fromIntegral w slice n = BLOB.take 8 $ BLOB.drop (8 * n) str rest = BLOB.drop 24 str mix :: Triple -> Triple mix = (\(Triple a b c) -> Triple (a - c) b c) . (\(Triple a b c) -> Triple (a `xor` (c `shiftR` 43)) b c) . (\(Triple a b c) -> Triple a (b - c) c) . (\(Triple a b c) -> Triple a (b - a) c) . (\(Triple a b c) -> Triple a (b `xor` (a `shiftL` 9)) c) . (\(Triple a b c) -> Triple a b (c - a)) . (\(Triple a b c) -> Triple a b (c - b)) . (\(Triple a b c) -> Triple a b (c `xor` (b `shiftR` 8))) . (\(Triple a b c) -> Triple (a - b) b c) . (\(Triple a b c) -> Triple (a - c) b c) . (\(Triple a b c) -> Triple (a `xor` (c `shiftR` 38)) b c) . (\(Triple a b c) -> Triple a (b - c) c) . (\(Triple a b c) -> Triple a (b - a) c) . (\(Triple a b c) -> Triple a (b `xor` (a `shiftL` 23)) c) . (\(Triple a b c) -> Triple a b (c - a)) . (\(Triple a b c) -> Triple a b (c - b)) . (\(Triple a b c) -> Triple a b (c `xor` (b `shiftR` 5))) . (\(Triple a b c) -> Triple (a - b) b c) . (\(Triple a b c) -> Triple (a - c) b c) . (\(Triple a b c) -> Triple (a `xor` (c `shiftR` 35)) b c) . (\(Triple a b c) -> Triple a (b - c) c) . (\(Triple a b c) -> Triple a (b - a) c) . (\(Triple a b c) -> Triple a (b `xor` (a `shiftL` 49)) c) . (\(Triple a b c) -> Triple a b (c - a)) . (\(Triple a b c) -> Triple a b (c - b)) . (\(Triple a b c) -> Triple a b (c `xor` (b `shiftR` 11))) . (\(Triple a b c) -> Triple (a - b) b c) . (\(Triple a b c) -> Triple (a - c) b c) . (\(Triple a b c) -> Triple (a `xor` (c `shiftR` 12)) b c) . (\(Triple a b c) -> Triple a (b - c) c) . (\(Triple a b c) -> Triple a (b - a) c) . (\(Triple a b c) -> Triple a (b `xor` (a `shiftL` 18)) c) . (\(Triple a b c) -> Triple a b (c - a)) . (\(Triple a b c) -> Triple a b (c - b)) . (\(Triple a b c) -> Triple a b (c `xor` (b `shiftR` 22))) --8x--8x--8x--8x--8x--8x--8x--8x--8x Obviously, we'd like all those lambdas and composes to be inlined, along with all the intermediate Triple structures. So, how do you convince ghc to do this? Alternatively, how would you *translate* rather than transliterate, the mix function? -- Dr Thomas Conway drtomc@gmail.com Silence is the perfectest herald of joy: I were but little happy, if I could say how much.

On Mon, Jun 18, 2007 at 11:55:05AM +1000, Thomas Conway wrote:
Hi All,
I'm trying to figure out how to maximum performance out of one of my inner loops which involves string hashing.
Consider the following hash function, which is a transliteration of a good one written in C: [ Code elided ] Obviously, we'd like all those lambdas and composes to be inlined, along with all the intermediate Triple structures. So, how do you convince ghc to do this? Alternatively, how would you *translate* rather than transliterate, the mix function?
Just pass the -O option to GHC. (-O2 for better results). On my system there are no lambdas or (.) left, as confirmed with -ddump-simpl. However, GHC's 64 bit type is implemented using full foreign calls (and thus rather expensive...) on 32 bit systems! So if possible don't do that. Also, GHC produces this type signature for $wmix: HashStr.$wmix :: GHC.Word.Word64 -> GHC.Prim.Word64# -> GHC.Prim.Word64# -> HashStr.Triple That is, the unboxing is incomplete. This is very very fishy; I'm submitting a bug. Stefan

FWIW, here's a link to the original c code: http://www.burtleburtle.net/bob/hash/evahash.html -- Dr Thomas Conway drtomc@gmail.com Silence is the perfectest herald of joy: I were but little happy, if I could say how much.

Thomas Conway wrote:
I'm trying to figure out how to maximum performance out of one of my inner loops which involves string hashing.
Consider the following hash function, which is a transliteration of a good one written in C:
Do you need the hash function for a hash table or for fingerprints/signatures? In the former case, Tries are a much better choice. For launching your own trie, see also Ralf Hinze. Generalizing generalized tries. Journal of Functional Programming, 10(4):327-351, July 2000 http://www.informatik.uni-bonn.de/~ralf/publications/GGTries.ps.gz Currently, there's no standard Data.Trie library but it's already under consideration http://hackage.haskell.org/trac/ghc/ticket/721 Regards, apfelmus

On 6/18/07, apfelmus
Do you need the hash function for a hash table or for fingerprints/signatures? In the former case, Tries are a much better choice. For launching your own trie, see also
I'm actually using them for bucket addressing for external indexing with a linear hash table. (Yes, the hashing does count, because buckets are cached in memory.) Actually, if one wants a concurrent dictionary, using something in the vein of type HashTable k v = TVar (Array Int (TVar [(k,v)])) has very good performance. It always seems something of a shame that if you want all the benefits of lazy functional programming, you too often have to settle for O(n log n) data structures. <non-haskell-slight-rant> Incidentally, while I've got your attention, I note that if you use a good quality hash function like the one I ripped off, you don't need to use [mostly] prime numbers for sizing your hash tables, and you can use powers of two instead, which simplifies a bunch of things. This is kind of obvious when you think about it, but every hash function I came across as an undergraduate or even as a post-grad, with the exception of md5 et al, was not good. The dogma was that *good* hash functions are too expensive for everyday use. So a word of advice to all you worthy tertiary educators - this is not the 1970s any more - good, cheap hash functions do exist, so update your course notes. :-) </non-haskell-slight-rant> We return you now to your normal haskell programming.... cheers, T. -- Dr Thomas Conway drtomc@gmail.com Silence is the perfectest herald of joy: I were but little happy, if I could say how much.

On Mon, 2007-06-18 at 19:35 +1000, Thomas Conway wrote:
On 6/18/07, apfelmus
wrote: Do you need the hash function for a hash table or for fingerprints/signatures? In the former case, Tries are a much better choice. For launching your own trie, see also
I'm actually using them for bucket addressing for external indexing with a linear hash table. (Yes, the hashing does count, because buckets are cached in memory.)
Actually, if one wants a concurrent dictionary, using something in the vein of
type HashTable k v = TVar (Array Int (TVar [(k,v)]))
has very good performance. It always seems something of a shame that if you want all the benefits of lazy functional programming, you too often have to settle for O(n log n) data structures.
<non-haskell-slight-rant> Incidentally, while I've got your attention, I note that if you use a good quality hash function like the one I ripped off, you don't need to use [mostly] prime numbers for sizing your hash tables, and you can use powers of two instead, which simplifies a bunch of things. This is kind of obvious when you think about it, but every hash function I came across as an undergraduate or even as a post-grad, with the exception of md5 et al, was not good. The dogma was that *good* hash functions are too expensive for everyday use. So a word of advice to all you worthy tertiary educators - this is not the 1970s any more - good, cheap hash functions do exist, so update your course notes. :-) </non-haskell-slight-rant>
Indeed, "Performance in Practice of String Hashing Functions" http://citeseer.ist.psu.edu/530453.html

Thomas Conway wrote:
On 6/18/07, apfelmus
wrote: Do you need the hash function for a hash table or for fingerprints/signatures? In the former case, Tries are a much better choice. For launching your own trie, see also
I'm actually using them for bucket addressing for external indexing with a linear hash table. (Yes, the hashing does count, because buckets are cached in memory.)
It always seems something of a shame that if you want all the benefits of lazy functional programming, you too often have to settle for O(n log n) data structures.
Trie it is, not balanced tree. A logarithm in this would be new to me. :) As a side node, Mr. Exp says: 64 is large enough for the size needs of any logarithm.
type HashTable k v = TVar (Array Int (TVar [(k,v)]))
Don't you want a TArray Int [(k,v)]? In any case, you could be able to set up an infinite trie and have lazy evaluation allocate space as needed: type Trie a = Branch (TVar a) (Trie a) (Trie a) -- an infinite tree with different TVars at every branch {-# NOINLINE tree -#} tree :: a -> Trie a tree x = Binary (unsafePerformIO $ newTVarIO x) (tree x) (tree x) The intention is that the different threads concurrently evaluate the suspension as far as they need to lookup/insert a key. The associated value can be put into the fresh TVar they find there. The tree structure itself is left untouched. Of course, it is imperative that all threads see the same TVars. I don't know how thunk updates are handled in a concurrent setting, but as they are write-once only and referentially transparent, i see no major problem with them. Regards, apfelmus PS: Hm, it's probably safer to code the lazy evaluation in the trie manually. As a side effect, you are then able to garbage collect unused but expanded parts of the trie from time to time.

On 6/19/07, apfelmus
Trie it is, not balanced tree. A logarithm in this would be new to me. :)
True enough, my braino.
As a side node, Mr. Exp says: 64 is large enough for the size needs of any logarithm.
Que?
type HashTable k v = TVar (Array Int (TVar [(k,v)]))
Don't you want a TArray Int [(k,v)]?
Essentially the same.
In any case, you could be able to set up an infinite trie and have lazy evaluation allocate space as needed:
type Trie a = Branch (TVar a) (Trie a) (Trie a)
Tree-like structure's are quite hostile to highly concurrent manipulation. It helps to introduce TVar indirections at each level: data Trie a = Branch (TVar a) (TVar (Trie a)) (TVar (Trie a)) Then you can update a subtree without having to modify the spine of the tree. There is some very fine work on this by Kim Larsen (and others), see for example http://citeseer.ist.psu.edu/2986.html T. -- Dr Thomas Conway drtomc@gmail.com Silence is the perfectest herald of joy: I were but little happy, if I could say how much.

Thomas Conway wrote:
On 6/19/07, apfelmus
wrote: Trie it is, not balanced tree. A logarithm in this would be new to me. :)
True enough, my braino.
So, accessing a key in a trie is O(key size in bits), not much different from a hash table.
As a side node, Mr. Exp says: 64 is large enough for the size needs of any logarithm.
Que?
A pun(y) formulation of the fact that (log n <= 64) for (almost) any finite map situation you'll ever encounter because 2^64 = 16 Exabyte.
In any case, you could be able to set up an infinite trie and have lazy evaluation allocate space as needed:
type Trie a = Branch (TVar a) (Trie a) (Trie a)
Tree-like structure's are quite hostile to highly concurrent manipulation.
It helps to introduce TVar indirections at each level:
data Trie a = Branch (TVar a) (TVar (Trie a)) (TVar (Trie a))
Then you can update a subtree without having to modify the spine of the tree.
What I wanted to point out is that the spine simply doesn't need to be modified at all. In other words, you have an infinite tree residing in memory and insertion/deletion happens by updating the TVars that hold the values (and probably should be of type TVar (Maybe a) then). Of course, there can't be infinite trees in memory :) but lazy evaluation makes it appear as if. In fact, every branch will be modified at most once before the first read and subsequent accesses are read-only. I don't know whether or how this is implemented in concurrent GHC at the moment, but I think that this can safely be implemented with a lock whereas dead-lock means that the program denotes _|_ anyway. Regard, apfelmus

On 6/19/07, apfelmus

On Jun 17, 2007, at 9:55 PM, Thomas Conway wrote:
Hi All,
I'm trying to figure out how to maximum performance out of one of my inner loops which involves string hashing. ... mix :: Triple -> Triple
This looks like a version of the Bob Jenkins hash function from burtleburtle.net. I implemented the 32-bit version of this as follows: mix :: Int32 -> Int32 -> Int32 -> (Int32 -> Int32 -> Int32 -> a) -> a mix a0 b0 c0 k0 = let mixR k a b c = (a-b-c) `xor` (c `shiftR` k) mixL k b c a = (b-c-a) `xor` (a `shiftL` k) mix3 k1 k2 k3 k a b c = let a' = mixR k1 a b c b' = mixL k2 b c a' c' = mixR k3 c a' b' in k a' b' c' in (mix3 13 8 13 $ mix3 12 16 5 $ mix3 3 10 15 $ k0) a0 b0 c0 I mention this because your code writes the whole thing out longhand---which might be faster, or might not, but certainly misses the highest-level structural patterns in the original. Your use of a data type to represent triples is probably better nowadays than my rather quirky use of CPS (in other words, this could have been a function Triple -> Triple instead of the rather odd type you see above). That said, I assume you instrumented your code and determined that hash collisions are actually a bottleneck for you, and that a hash table is the right structure to begin with? I fell back on much- simpler multiplicative hashing schemes for Data.HashTable. A multiply is much faster than vast amounts of bit-fiddling---but of course its collision behavior isn't nearly as good and this can be a problem with large data sets. And note that the multiplicative hashing currently used in Data.HashTable doesn't require prime table sizes; in fact we use powers of two and table doubling. When last I checked the result was faster than Data.Map, but not by much, and using strings probably wipes out that advantage vs. tries. -Jan-Willem Maessen

On 6/19/07, Jan-Willem Maessen
This looks like a version of the Bob Jenkins hash function from burtleburtle.net. I implemented the 32-bit version of this as follows:
Indeed. It's the 64-bit version. 32 bits is oh-so-last-century. ;-)
mix :: Int32 -> Int32 -> Int32 -> (Int32 -> Int32 -> Int32 -> a) -> a [deletia] I mention this because your code writes the whole thing out longhand---which might be faster, or might not, but certainly misses the highest-level structural patterns in the original. Your use of a data type to represent triples is probably better nowadays than my rather quirky use of CPS (in other words, this could have been a function Triple -> Triple instead of the rather odd type you see above).
Well, the main difference, is the CPS version just folds the uses of (.) into the individual groups of arithmetic. Actually, without knowing what GHC *actually* does, it is conceivable that a compiler could do better with the CPS version, precisely because there's one less layer of abstraction to inline/fold away. I'll have to give it a go if I get a chance (this is code for my Real Job (TM), and tuning the life out of the code isn't necessary right now, but I thought I'd float this, as much because I might learn something, as anything. The thinkon flux in this list is pretty favourable).
That said, I assume you instrumented your code and determined that hash collisions are actually a bottleneck for you, and that a hash table is the right structure to begin with?
I'm implementing a species of database (trade secrets, blah, blah, blah), which needs to handle *large* data sets, and actually, an external B-tree is probably better than an external hash table. I decided to do a hash table first though to iron out some of the issues to do with concurrent external structures. A linear hash table is pretty simple compared to a B-tree. The Jenkins' hash function comes into it because you really want to avoid overfull buckets. It's also one of those cases, where you'd like the compiler to do a good job. If the compiler can't do a good job of straight line operations on [essentially] built in data types, then what hope have we of convincing anyone, including ourselves, that Haskell is fit for Real Programs (TM).
When last I checked the result was faster than Data.Map, but not by much, and using strings probably wipes out that advantage vs. tries.
<edna-e-mode-voice> No Strings, darling! No Strings. </edna-e-mode-voice> cheers, T. -- Dr Thomas Conway drtomc@gmail.com Silence is the perfectest herald of joy: I were but little happy, if I could say how much.

Hello Jan-Willem, Tuesday, June 19, 2007, 1:17:25 AM, you wrote:
table is the right structure to begin with? I fell back on much- simpler multiplicative hashing schemes for Data.HashTable. A
btw, are you seen http://isthe.com/chongo/tech/comp/fnv/ ? he suggest to use non-zero value as starting hash value and provides concrete constants for base and multiplier -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com
participants (6)
-
apfelmus
-
Bulat Ziganshin
-
Derek Elkins
-
Jan-Willem Maessen
-
Stefan O'Rear
-
Thomas Conway