
Hello haskell-cafe, solving one more task that uses English dictionary, i've thought: why we don't yet have pure hashtable library? There is imperative hashtables, pretty complex as they need to rebuild entire table as it grows. There is also simple assoc lists and tree/trie implementations, but there is no simple non-modifiable hashes. how should it look: * hashtable is represented as an array of assoc lists: Array Int [(a,b)] * interface to extract data from ht is the same as from assoc list: lookup :: HT a b -> a -> Maybe b * ht may be built from assoc list. we should just know it's size beforehand in order to create Array of reasonable size. constructor also need a hashing function: create :: [(a,b)] -> Int -> (a->Int) -> HT a b given these constraints, it should be just a 10-20 lines of code, and provide much better efficiency than any tree/trie implementations -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Hello Jason, Wednesday, August 27, 2008, 11:55:31 AM, you wrote:
given these constraints, it should be just a 10-20 lines of code, and provide much better efficiency than any tree/trie implementations
Much better efficiency in what way?
instead of going through many levels of tree/trie, lookup function will just select array element by hash value and look through a few elements in assoc list: data HT a b = HT (a->Int) -- hash function (Array Int [(a,b)]) HT.lookup (HT hash arr) a = List.lookup (arr!hash a) a -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On 27 Aug 2008, at 10:09, Bulat Ziganshin wrote:
Hello Jason,
Wednesday, August 27, 2008, 11:55:31 AM, you wrote:
given these constraints, it should be just a 10-20 lines of code, and provide much better efficiency than any tree/trie implementations
Much better efficiency in what way?
instead of going through many levels of tree/trie, lookup function will just select array element by hash value and look through a few elements in assoc list:
data HT a b = HT (a->Int) -- hash function (Array Int [(a,b)])
HT.lookup (HT hash arr) a = List.lookup (arr!hash a) a Which makes two assumptions. One is that your array is big enough (believable), and the other, that your font is big enough.
Bob

From: haskell-cafe-bounces@haskell.org [mailto:haskell-cafe-bounces@haskell.org] On Behalf Of Thomas Davie
Much better efficiency in what way?
instead of going through many levels of tree/trie, lookup function will just select array element by hash value and look through a few elements in assoc list:
data HT a b = HT (a->Int) -- hash function (Array Int [(a,b)])
HT.lookup (HT hash arr) a = List.lookup (arr!hash a) a
Which makes two assumptions. One is that your array is big enough (believable), and the other, that your font is big enough.
... and the other, that your font is big enough.
Que? This is lost on me. Care to explain? Alistair ***************************************************************** Confidentiality Note: The information contained in this message, and any attachments, may contain confidential and/or privileged material. It is intended solely for the person(s) or entity to which it is addressed. Any review, retransmission, dissemination, or taking of any action in reliance upon this information by persons or entities other than the intended recipient(s) is prohibited. If you received this in error, please contact the sender and delete the material from any computer. *****************************************************************

On 27 Aug 2008, at 10:39, Bayley, Alistair wrote:
From: haskell-cafe-bounces@haskell.org [mailto:haskell-cafe-bounces@haskell.org] On Behalf Of Thomas Davie
Much better efficiency in what way?
instead of going through many levels of tree/trie, lookup function will just select array element by hash value and look through a few elements in assoc list:
data HT a b = HT (a->Int) -- hash function (Array Int [(a,b)])
HT.lookup (HT hash arr) a = List.lookup (arr!hash a) a
Which makes two assumptions. One is that your array is big enough (believable), and the other, that your font is big enough.
... and the other, that your font is big enough.
Que? This is lost on me. Care to explain?
Sorry, I probably should have sent that, it was a dig at the fact that the message was sent with all the text in font-size 930 or so. Bob

Bulat Ziganshin wrote:
solving one more task that uses English dictionary, i've thought: why we don't yet have pure hashtable library? There is imperative hashtables, [...] how should it look:
* hashtable is represented as an array of assoc lists: Array Int [(a,b)]
Don't immutable arrays get rather inefficient when modified?
[...]
Just a tought: You might want to have a look at the bloom filter implementation. On one hand, it is an alternative for your dictionary and on the other and, its implementation uses hash functions and arrays as well. IIRC it does that in a state monad that keeps the array mutable and finally freezes it before usage, which might be a good idea for pure hash table as well. //Stephan -- Früher hieß es ja: Ich denke, also bin ich. Heute weiß man: Es geht auch so. - Dieter Nuhr

* hashtable is represented as an array of assoc lists: Array Int [(a,b)]
Don't immutable arrays get rather inefficient when modified?
Bulat was specifically asking for "simple *non-modifiable* hashes" http://article.gmane.org/gmane.comp.lang.haskell.cafe/43612 J.W.

Hello Stephan, Wednesday, August 27, 2008, 1:52:23 PM, you wrote:
and on the other and, its implementation uses hash functions and arrays as well. IIRC it does that in a state monad that keeps the array mutable and finally freezes it before usage, which might be a good idea for pure hash table as well.
actually, any building of immutable arrays internally done in this way. we just don't need to write this low-level function ourselves as Array library provides us pure function to build array from list/assoclist -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On Aug 27, 2008, at 3:41 AM, Bulat Ziganshin wrote:
Hello haskell-cafe,
solving one more task that uses English dictionary, i've thought: why we don't yet have pure hashtable library? There is imperative hashtables, pretty complex as they need to rebuild entire table as it grows. There is also simple assoc lists and tree/trie implementations, but there is no simple non-modifiable hashes.
I know that Lennart had such a hashtable implementation as part of the hbcc source tree (so dating back to the late stream age or the very very early monad age), though I think it relied upon hbc's LazyArray. One obvious way to make non-modifiable hash tables useful is to "eat your own tail" non-strictly--- aggregate a set of hash table entries, construct a hash table from them, and plumb the resulting hash table into the original computation by tying the knot. This works really well if you can construct the bucket lists lazily and if you specify the table size up front. You can't make this trick work at all for tree-based maps in general, because the structure of the tree depends upon all the keys inserted. You also can't make this trick work if you base the size of the hash table on the number of keys inserted, maximum bucket load, etc. Finally, it doesn't work with strict arrays at all. So a nice niche for a small and clever static hash table. -Jan

Hello Jan-Willem, Wednesday, August 27, 2008, 4:06:11 PM, you wrote:
One obvious way to make non-modifiable hash tables useful is to "eat your own tail" non-strictly--- aggregate a set of hash table entries, construct a hash table from them, and plumb the resulting hash table into the original computation by tying the knot. This works really well if you can construct the bucket lists lazily and if you specify the table size up front.
i think it's impossible since you need to scan whole assoclist to build list of entries for any value of hash function. actually, the same is true for any array building code - while the *values* of array elements may be calculated lazy, *positions* cannot -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On Aug 27, 2008, at 4:31 PM, Bulat Ziganshin wrote:
Hello Jan-Willem,
Wednesday, August 27, 2008, 4:06:11 PM, you wrote:
One obvious way to make non-modifiable hash tables useful is to "eat your own tail" non-strictly--- aggregate a set of hash table entries, construct a hash table from them, and plumb the resulting hash table into the original computation by tying the knot. This works really well if you can construct the bucket lists lazily and if you specify the table size up front.
i think it's impossible since you need to scan whole assoclist to build list of entries for any value of hash function.
I think this is because I wasn't quite clear enough about the problem I was solving. I think you'll agree that we can use an assocList non- strictly in the following sense: * We can do lookups that succeed so long as we can compute all keys up to and including the key that matches. * We never do non-strict lookups that fail, as that would require examining the entire assocList. Now I can build a hashtable with the same property from an assocList, albeit very inefficiently, using code like this (untested): lazyArray :: (Ix i) => (i,i) -> [(i,v)] -> Array i [v] lazyArray bnds kvs = array bnds [ (k, map snd . filter ((k==) . fst) $ kvs) | k <- range bnds ] makeHash :: (Eq k, Hashable k) => [(k,v)] -> Array Int [(k,v)] makeHash assocList = lazyArray (0,n-1) labeledAssocList where labeledAssocList = [ (hashToSize n k, (k,v)) | (k,v) <- assocList ] We label each assocList element with its corresponding hash bucket (labeledAssocList); each bucket then contains exactly the elements of assocList that map to that bucket, in exactly the order in which they occurred in assocList. The LazyArray library in hbc essentially did exactly what the lazyArray function I've shown above does, only the input list is traversed once rather than having a separate traversal for each bucket. This can actually be implemented using the ST monad augmented by unsafeFreezeSTArray and unsafeInterleaveST; indeed the "State in Haskell" paper by Peyton Jones and Launchbury gives the implementation of a very similar function. I have code for LazyArray based on the above paper that works with GHC, but I haven't needed it in a while. -Jan

Bulat Ziganshin wrote:
Hello haskell-cafe,
solving one more task that uses English dictionary, i've thought: why we don't yet have pure hashtable library? There is imperative hashtables, pretty complex as they need to rebuild entire table as it grows. There is also simple assoc lists and tree/trie implementations, but there is no simple non-modifiable hashes.
how should it look: * hashtable is represented as an array of assoc lists: Array Int [(a,b)]
* interface to extract data from ht is the same as from assoc list: lookup :: HT a b -> a -> Maybe b
* ht may be built from assoc list. we should just know it's size beforehand in order to create Array of reasonable size. constructor also need a hashing function:
create :: [(a,b)] -> Int -> (a->Int) -> HT a b
given these constraints, it should be just a 10-20 lines of code, and provide much better efficiency than any tree/trie implementations
Prove it. To get "much better efficient" than a trie, the hash function has to be so fast that it is faster than following (log n) pointers, and yet also so "perfect" that it doesn't generate too many collisions. As you correctly say, a simple implementation is easy to do, so why not do it and see how it performs? :) Jules

Hello Jules, Wednesday, August 27, 2008, 7:21:46 PM, you wrote:
given these constraints, it should be just a 10-20 lines of code, and provide much better efficiency than any tree/trie implementations
Prove it.
To get "much better efficient" than a trie, the hash function has to be so fast that it is faster than following (log n) pointers
afaiu, trie search follows n pointers - i.e. every char in string forms a new trie level. add to this that every search need to scan a list of all possible chars - or you need to use the same hashing. so, we have the following lookup times: trie: O(len)*O(width) hashed trie: O(len) hash: O(len) where len is length of string searched and width is average amount of chars at each trie level -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin wrote:
Hello Jules,
Wednesday, August 27, 2008, 7:21:46 PM, you wrote:
given these constraints, it should be just a 10-20 lines of code, and provide much better efficiency than any tree/trie implementations
Prove it.
To get "much better efficient" than a trie, the hash function has to be so fast that it is faster than following (log n) pointers
afaiu, trie search follows n pointers
No. "n" is the number of strings in my data set (dictionary). If I have "n" strings the average string length is asymptotically, in some sense, "log n". Of course for particular data sets it's may be more . But "log n" is the length of the shortest unique coding, it's also the number of characters you typically need to traverse before you have reached a unique prefix, at which point your trie can short-circuit. I appreciate that I didn't define my terminology ;) You might have a different "n". I repeat my challenge "Prove it". I will be interested to see how much a good immutable hash outperforms Data.Map. I would then also be interested to see how much it outperforms a decent Data.Map (such as your own AVL one) and a decent trie. Jules

Hello Jules, Wednesday, August 27, 2008, 7:59:55 PM, you wrote:
To get "much better efficient" than a trie, the hash function has to be so fast that it is faster than following (log n) pointers
afaiu, trie search follows n pointers
No.
"n" is the number of strings in my data set (dictionary).
If I have "n" strings the average string length is asymptotically, in some sense, "log n". Of course for particular data sets it's may be more .
But "log n" is the length of the shortest unique coding, it's also the number of characters you typically need to traverse before you have reached a unique prefix, at which point your trie can short-circuit.
I appreciate that I didn't define my terminology ;) You might have a different "n".
I repeat my challenge "Prove it". I will be interested to see how much a good immutable hash outperforms Data.Map.
I would then also be interested to see how much it outperforms a decent Data.Map (such as your own AVL one) and a decent trie.
1) i don't have time/interest to do it, so i just pushed the idea 2) what you have proved there is that for set of n randomly chosen strings trie lookup need O(log n) time - the same is true for hash 3) practical performance of trie will suffer by the need to follow several trie levels each providing several elements to check. OTOH hash lookup will usually need only 1-2 checks, including one check on full string size 4) using ByteString for hash indexes would make lookups much faster -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Am Mittwoch, 27. August 2008 17:36 schrieb Bulat Ziganshin:
Hello Jules,
Wednesday, August 27, 2008, 7:21:46 PM, you wrote:
given these constraints, it should be just a 10-20 lines of code, and provide much better efficiency than any tree/trie implementations
Prove it.
To get "much better efficient" than a trie, the hash function has to be so fast that it is faster than following (log n) pointers
afaiu, trie search follows n pointers - i.e. every char in string forms a new trie level. add to this that every search need to scan a list of all possible chars - or you need to use the same hashing. so, we have the following lookup times:
trie: O(len)*O(width) hashed trie: O(len) hash: O(len)
Wouldn't the hashtable have lookup time O(len+bucketsize)?
where len is length of string searched and width is average amount of chars at each trie level

Hello Daniel, Wednesday, August 27, 2008, 8:01:24 PM, you wrote:
trie: O(len)*O(width) hashed trie: O(len) hash: O(len)
Wouldn't the hashtable have lookup time O(len+bucketsize)?
i suppose that bucketsize=O(1): constructor should get [approximate] size of hashed assoclist and it will select appropriate Array size -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On Wed 2008-08-27 16:21, Jules Bean wrote:
Bulat Ziganshin wrote:
Hello haskell-cafe,
solving one more task that uses English dictionary, i've thought: why we don't yet have pure hashtable library? There is imperative hashtables, pretty complex as they need to rebuild entire table as it grows. There is also simple assoc lists and tree/trie implementations, but there is no simple non-modifiable hashes.
how should it look: * hashtable is represented as an array of assoc lists: Array Int [(a,b)]
* interface to extract data from ht is the same as from assoc list: lookup :: HT a b -> a -> Maybe b
* ht may be built from assoc list. we should just know it's size beforehand in order to create Array of reasonable size. constructor also need a hashing function:
create :: [(a,b)] -> Int -> (a->Int) -> HT a b
given these constraints, it should be just a 10-20 lines of code, and provide much better efficiency than any tree/trie implementations
Prove it.
To get "much better efficient" than a trie, the hash function has to be so fast that it is faster than following (log n) pointers, and yet also so "perfect" that it doesn't generate too many collisions.
Many people have probably seen this and it has nothing to do with Haskell, but it is a good performance comparison of a simple hash to an optimized trie. http://www.nothings.org/computer/judy/ The conclusion (we're only interested in lookup times) is that the trie is preferable for sequential lookups, slower for random access hits, and about the same for random access misses. Also, the hash was uniformly better for small sizes (< 10k). BTW a single cache miss has a latency around 250 cycles, you can compute a hell of a hash for that. Jed
participants (10)
-
Bayley, Alistair
-
Bulat Ziganshin
-
Daniel Fischer
-
Jan-Willem Maessen
-
Jason Dusek
-
Jed Brown
-
Johannes Waldmann
-
Jules Bean
-
Stephan Friedrichs
-
Thomas Davie