Re: [Haskell-cafe] Re: Hashtable woes

Ketil Malde wrote:
Chris Kuklewicz
writes: Is Jan-Willem Maessen's Hash available anywhere? I could benchmark it.
Did you ever get around to run the benchmark? I browsed around a bit, and found that the knucleotide is probably the worst GHC benchmark in the shootout (even TCL beats GHC by a factor of two!) - which is disheartening, because I rely a lot on associative data structures (usually Data.Map) in my programs.
Or have Adrian Hey's AVL-trees been tried?
-k
No, I did not try it. This message from Simon Marlow
Jan-Willem's HashTable attached. It uses unsafeThaw/unsafeFreeze tricks to avoid the GC overheads, for this you need an up to date GHC due to a bug in the garbage collector: grab a STABLE snapshot (6.4.1 won't work). Or remove the unsafeThaw/unsafeFreeze to use it with 6.4.1, and be prepared to bump the heap size.
In GHC 6.6 the unsafeThaw/unsafeFreeze tricks aren't required, because the GC is essentially doing it for you - we put a write barrier in the IOArray implementation.
indicates that it triggers a bug in 6.4.1, which is what the shootout is using. And I suspected bumping the heap size just won't cut it for the amount of data we are processing. But I did not test that suspicion. We never pounded on Data.Map, but I suspect it cannot be as bad as Data.Hashtable. -- Chris

indicates that it triggers a bug in 6.4.1
Ah, I missed that. For my word counting indexes, I've settled on Data.Map, calculating an Int or Integer hash for each word (depending on word length, which is fixed). I haven't given it nearly the effort the shootout programs have seen, though, so I'm not sure how optimal it is. Other experiences with FiniteMap/Data.Map etc seem to indicate that they are in the same ball park as Python's hashes.
We never pounded on Data.Map, but I suspect it cannot be as bad as Data.Hashtable.
Hmm...perhaps it is worth it, then? The benchmark may specify "hash table", but I think it is fair to interpret it as "associative data structure" - after all, people are using "associative arrays" that (presumably) don't guarantee a hash table underneath, and it can be argued that Data.Map is the canonical way to achieve that in Haskell. -k -- If I haven't seen further, it is by standing in the footprints of giants

On 2/10/06, Ketil Malde
Hmm...perhaps it is worth it, then? The benchmark may specify "hash table", but I think it is fair to interpret it as "associative data structure" - after all, people are using "associative arrays" that (presumably) don't guarantee a hash table underneath, and it can be argued that Data.Map is the canonical way to achieve that in Haskell.
Based on this advice, I wrote a k-nucleotide entry using the rough structure of the OCaml entry, but with the manual IO from Chris and Don's "Haskell #2" entry. It runs in under 4 seconds on my machine, more than ten times the speed of the fastest Data.HashTable entry. -- Brian T. Sniffen bts@alum.mit.edu or brian.sniffen@gmail.com http://www.evenmere.org/~bts

Brian Sniffen wrote:
On 2/10/06, Ketil Malde
wrote: Hmm...perhaps it is worth it, then? The benchmark may specify "hash table", but I think it is fair to interpret it as "associative data structure" - after all, people are using "associative arrays" that (presumably) don't guarantee a hash table underneath, and it can be argued that Data.Map is the canonical way to achieve that in Haskell.
Based on this advice, I wrote a k-nucleotide entry using the rough structure of the OCaml entry, but with the manual IO from Chris and Don's "Haskell #2" entry. It runs in under 4 seconds on my machine, more than ten times the speed of the fastest Data.HashTable entry.
I haven't been following this too closely, but could someone provide me with (or point me to) the badly performing Data.HashTable example, so we can measure our improvements? Cheers, Simon

Simon Marlow wrote:
Brian Sniffen wrote:
On 2/10/06, Ketil Malde
wrote: Hmm...perhaps it is worth it, then? The benchmark may specify "hash table", but I think it is fair to interpret it as "associative data structure" - after all, people are using "associative arrays" that (presumably) don't guarantee a hash table underneath, and it can be argued that Data.Map is the canonical way to achieve that in Haskell.
Based on this advice, I wrote a k-nucleotide entry using the rough structure of the OCaml entry, but with the manual IO from Chris and Don's "Haskell #2" entry. It runs in under 4 seconds on my machine, more than ten times the speed of the fastest Data.HashTable entry.
I haven't been following this too closely, but could someone provide me with (or point me to) the badly performing Data.HashTable example, so we can measure our improvements?
Cheers, Simon
From the shooutout itself:
http://shootout.alioth.debian.org/gp4/benchmark.php?test=knucleotide&lang=ghc&id=3 and http://shootout.alioth.debian.org/gp4/benchmark.php?test=knucleotide&lang=ghc&id=2 (I forget the exact different between them)
From the wiki (the Current Entry):
http://haskell.org/hawiki/KnucleotideEntry#head-dfcdad61d34153143175bb9f8237...

Not sure how relevant this is, but I see there is a recently released hash library here that might be a candidate for FFIing? https://sourceforge.net/projects/goog-sparsehash/ | An extremely memory-efficient hash_map implementation. 2 bits/entry | overhead! The SparseHash library contains several hash-map | implementations, including implementations that optimize for space | or speed. -k -- If I haven't seen further, it is by standing in the footprints of giants

On Wed, Feb 15, 2006 at 09:42:10AM +0100, Ketil Malde wrote:
Not sure how relevant this is, but I see there is a recently released hash library here that might be a candidate for FFIing?
https://sourceforge.net/projects/goog-sparsehash/
| An extremely memory-efficient hash_map implementation. 2 bits/entry | overhead! The SparseHash library contains several hash-map | implementations, including implementations that optimize for space | or speed.
If we want really fast maps, we should be using this. it beats the competition by far: http://judy.sourceforge.net/ John -- John Meacham - ⑆repetae.net⑆john⑈

On Feb 15, 2006, at 3:42 AM, Ketil Malde wrote:
Not sure how relevant this is, but I see there is a recently released hash library here that might be a candidate for FFIing?
The real issue isn't the algorithms involved; I saw the best performance from the stupidest hash algorithm (well, and switching to multiplicative hashing rather than mod-k). The problem is GC of hash table elements. FFI-ing this library would give us really good algorithms, but the GC would all indirect through the FFI and I'd expect that to make things *worse*, not better. -Jan
| An extremely memory-efficient hash_map implementation. 2 bits/entry | overhead! The SparseHash library contains several hash-map | implementations, including implementations that optimize for space | or speed.
-k -- If I haven't seen further, it is by standing in the footprints of giants
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
participants (6)
-
Brian Sniffen
-
Chris Kuklewicz
-
Jan-Willem Maessen
-
John Meacham
-
Ketil Malde
-
Simon Marlow