
Hello. When I insert 50 values in an empty hashtable, I get longest chains of length 10. Like this: import Data.HashTable ( HashTable ) import qualified Data.HashTable as H import System.Random main = do t <- H.new ( == ) H.hashInt sequence $ replicate 50 $ do x <- randomRIO ( 0 , 1000 ) H.update t x () lc <- H.longestChain t print lc output: [(831,()),(346,()),(773,()),(475,()),(812,()),(307,()),(113,()),(637,()),(100,())] This is not what I expected (quite the contrary). Did I miss something here? The access time is linear in the length of the longest chain? Perhaps the initial table (produced by "new") is too small, so the table is always overloaded? Looking at http://java.sun.com/javase/6/docs/api/java/util/HashMap.html they start out with table size 16 by default *and* (more importantly) they have a constructor that allows to set the initial table size and the load factor. Best regards, Johannes Waldmann.

You probably want to install HsJudy if you need a good hashtable: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/HsJudy Cheers, Chris

On Mar 6, 2008, at 9:28 AM, Johannes Waldmann wrote:
Hello.
When I insert 50 values in an empty hashtable, I get longest chains of length 10. Like this:
import Data.HashTable ( HashTable ) import qualified Data.HashTable as H import System.Random
main = do t <- H.new ( == ) H.hashInt sequence $ replicate 50 $ do x <- randomRIO ( 0 , 1000 ) H.update t x () lc <- H.longestChain t print lc
output: [(831,()),(346,()),(773,()),(475,()),(812,()),(307,()),(113,()),(637, ()),(100,())]
The Data.HashTable code was tuned to favor relatively high bucket loads. Is that a good idea (I ask, having done some of this tuning myself)? It's been a long time since the code was re-tuned, and it's likely some of the decisions ought to change in light of changes in GHC's compiler, GC, and so forth. At the moment: -- ----------------------------------------------------------------------------- -- Parameters tABLE_MAX :: Int32 tABLE_MAX = 32 * 1024 * 1024 -- Maximum size of hash table tABLE_MIN :: Int32 tABLE_MIN = 8 hLOAD :: Int32 hLOAD = 7 -- Maximum average load of a single hash bucket hYSTERESIS :: Int32 hYSTERESIS = 64 -- entries to ignore in load computation {- Hysteresis favors long association-list-like behavior for small tables. -} It's likely modern versions of GHC would fare better with smaller settings of tABLE_MIN, hLOAD, and hYSTERESIS. I seem to recall that doubling the table used to be rather hard on the GC. -Jan
This is not what I expected (quite the contrary). Did I miss something here? The access time is linear in the length of the longest chain? Perhaps the initial table (produced by "new") is too small, so the table is always overloaded?
Looking at http://java.sun.com/javase/6/docs/api/java/util/HashMap.html they start out with table size 16 by default *and* (more importantly) they have a constructor that allows to set the initial table size and the load factor.
Best regards, Johannes Waldmann. _______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

jmaessen:
On Mar 6, 2008, at 9:28 AM, Johannes Waldmann wrote:
Hello.
When I insert 50 values in an empty hashtable, I get longest chains of length 10. Like this:
import Data.HashTable ( HashTable ) import qualified Data.HashTable as H import System.Random
main = do t <- H.new ( == ) H.hashInt sequence $ replicate 50 $ do x <- randomRIO ( 0 , 1000 ) H.update t x () lc <- H.longestChain t print lc
output: [(831,()),(346,()),(773,()),(475,()),(812,()),(307,()),(113,()),(637, ()),(100,())]
The Data.HashTable code was tuned to favor relatively high bucket loads. Is that a good idea (I ask, having done some of this tuning myself)? It's been a long time since the code was re-tuned, and it's likely some of the decisions ought to change in light of changes in GHC's compiler, GC, and so forth. At the moment:
After modifying it for the shootout, I noticed its missing a lot of inline stuff (i.e. tends not to get inlined or specialised). Needs a big overhaul.

On Mar 6, 2008, at 2:15 PM, Don Stewart wrote:
jmaessen:
On Mar 6, 2008, at 9:28 AM, Johannes Waldmann wrote:
Hello.
When I insert 50 values in an empty hashtable, I get longest chains of length 10. Like this:
import Data.HashTable ( HashTable ) import qualified Data.HashTable as H import System.Random
main = do t <- H.new ( == ) H.hashInt sequence $ replicate 50 $ do x <- randomRIO ( 0 , 1000 ) H.update t x () lc <- H.longestChain t print lc
output: [(831,()),(346,()),(773,()),(475,()),(812,()),(307,()),(113,()), (637, ()),(100,())]
The Data.HashTable code was tuned to favor relatively high bucket loads. Is that a good idea (I ask, having done some of this tuning myself)? It's been a long time since the code was re-tuned, and it's likely some of the decisions ought to change in light of changes in GHC's compiler, GC, and so forth. At the moment:
After modifying it for the shootout, I noticed its missing a lot of inline stuff (i.e. tends not to get inlined or specialised).
Agreed. I'd be curious to see which is more important in practice: picking the right bucket and loading parameters or doing better specialization. Because the hash function is wrapped by the table itself (as opposed to eg. using a Hashable class---I think HBC did this) it's a bit harder to get meaningful specialization: the table itself acts as the dictionary for the hash function, which is the thing one wants inlined. A bit of hand-rolled worker/wrapper might do wonders, though. -Jan
Needs a big overhaul.

On 06/03/2008, Johannes Waldmann
Hello.
When I insert 50 values in an empty hashtable, I get longest chains of length 10. Like this:
The Data.HashTable included with GHC probably shouldn't be. In practice, Data.Map outperforms it in essentially all cases (Data.HashTable stops working beyond a certain size and so any asymptotic benefits, if they exist at all, don't have time to kick in). Moreover, Data.Map has a nice pure interface. The finite maps provided by Data.Map are not actually hash tables, but instead binary balanced trees, but who really cares? :) - Cale

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
In practice, Data.Map outperforms it in essentially all cases (Data.HashTable stops working beyond a certain size and so any asymptotic benefits, if they exist at all, don't have time to kick in).
What! I learned at school, and I teach my students, * balanced binary tree: costs are log(size) * hashtable: cost are essentially constant therefore, hashtable should be preferred in almost all cases (if you know a reasonable hash function and except where you need persistency, of course) the difference should be visible even for moderate sizes (e.g. 1000). With a reasonable implementation I expect log(1000) = 10 comparisons (and dereferencings) for the tree; while the hashtable should have the computation of the hash value (ideally, in registers), one memory access, and one comparison. ... but indeed some experiments with Data.Map and Data.Hashtable support the point you made. That's either good for Data.Map or bad for Data.Hashtable. PS: I did not manage to compile HsJudy-1.0 with ghc-6.8.2 (some hsc file missing - perhaps that should be auto-generated? how?) Best regards, Johannes. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.4-svn0 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFH0GWZ3ZnXZuOVyMIRAn6FAJ4khFeY0F9dKnB0XyztmUEJq7SiGwCeMm7j cinyds0gCbVeHMcCLY4jP2w= =K9V5 -----END PGP SIGNATURE-----

Hi
I learned at school, and I teach my students, * balanced binary tree: costs are log(size) * hashtable: cost are essentially constant
All true, but log(1000) = 10 [nitpick, its nearly always log base 2, not base 10]. 10 ~= constant.
... but indeed some experiments with Data.Map and Data.Hashtable support the point you made. That's either good for Data.Map or bad for Data.Hashtable.
I think both of those are true. Hashtable isn't a natural functional data structure, so is ignored, so is not improved. I suspect a Trie would be a much better data structure for most of the uses of Hashtable, but alas there is not one in the standard libraries. Thanks Neil

On Thu, Mar 6, 2008 at 9:48 PM, Neil Mitchell
Hi
I learned at school, and I teach my students, * balanced binary tree: costs are log(size) * hashtable: cost are essentially constant
All true, but log(1000) = 10 [nitpick, its nearly always log base 2, not base 10]. 10 ~= constant.
... but indeed some experiments with Data.Map and Data.Hashtable support the point you made. That's either good for Data.Map or bad for Data.Hashtable.
I think both of those are true. Hashtable isn't a natural functional data structure, so is ignored, so is not improved. I suspect a Trie would be a much better data structure for most of the uses of Hashtable, but alas there is not one in the standard libraries.
How about this one? http://www.haskell.org/ghc/docs/latest/html/libraries/containers/Data-IntMap... You can use that to build a purely functional hash table quite easily I suspect (giving you worst case constant time lookup, and no fixed size nonsense, or re-hashing etc.). -- Sebastian Sylvan +44(0)7857-300802 UIN: 44640862

On Mar 6, 2008, at 5:16 PM, Sebastian Sylvan wrote:
On Thu, Mar 6, 2008 at 9:48 PM, Neil Mitchell
wrote: Hi I learned at school, and I teach my students, * balanced binary tree: costs are log(size) * hashtable: cost are essentially constant
All true, but log(1000) = 10 [nitpick, its nearly always log base 2, not base 10]. 10 ~= constant.
... but indeed some experiments with Data.Map and Data.Hashtable support the point you made. That's either good for Data.Map or bad for Data.Hashtable.
I think both of those are true. Hashtable isn't a natural functional data structure, so is ignored, so is not improved. I suspect a Trie would be a much better data structure for most of the uses of Hashtable, but alas there is not one in the standard libraries.
How about this one? http://www.haskell.org/ghc/docs/latest/html/libraries/containers/Data-IntMap...
You can use that to build a purely functional hash table quite easily I suspect (giving you worst case constant time lookup, and no fixed size nonsense, or re-hashing etc.).
Indeed, during my last go-round on Data.HashTable I used an IntMap to verify several different implementations. But I only ended up putting back some small changes to hashing due to lack of time. -Jan-Willem Maessen
-- Sebastian Sylvan +44(0)7857-300802 UIN: 44640862 _______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

Johannes Waldmann wrote:
I learned at school, and I teach my students, * balanced binary tree: costs are log(size) * hashtable: costs are essentially constant
Correction: * hashtable: costs are key size in bits which is always >= log(size) (because: pigeonhole principle) (The size being the number of elements stored, not a preallocated table size. But those two should be proportional anyway.) So, hash tables have no asymptotic benefits, it's all in the constant factors. Regards, apfelmus
participants (8)
-
apfelmus
-
Cale Gibbard
-
Chris Kuklewicz
-
Don Stewart
-
Jan-Willem Maessen
-
Johannes Waldmann
-
Neil Mitchell
-
Sebastian Sylvan