lookup tables & style guidelines

Dear Haskell Devs ^_^, 1) what is the most performant lookup table/hashtable/dictionary solution for Haskell? 1.1) should I use size-balanced binary trees for that or is there a more common way? 2) are there any established style guidelines for haskell code? Best Regards, Cetin Sert

cetin.sert:
Dear Haskell Devs ^_^,
1) what is the most performant lookup table/hashtable/dictionary solution for Haskell?
Data.IntMap is awfully good.
1.1) should I use size-balanced binary trees for that or is there a more common way?
I would. Data.Map/Data.IntMap
2) are there any established style guidelines for haskell code?
http://haskell.org/haskellwiki/Category:Style Or pick an author you like, and look at their code on hackage.haskell.org -- Don

Thanks Don...
You are amazing... o_O always so quick with replies...
I was using GraphViz to generate some directed graphs*, knowing what to use
for a dict/map will help speed things up!
Cetin
* (for analytic tableaux in Okitsune+)
+ (need a better name, maybe I should ask Haskell-Cafe for one in the near
future)
2008/4/23 Don Stewart
cetin.sert:
Dear Haskell Devs ^_^,
1) what is the most performant lookup table/hashtable/dictionary solution for Haskell?
Data.IntMap is awfully good.
1.1) should I use size-balanced binary trees for that or is there a more common way?
I would. Data.Map/Data.IntMap
2) are there any established style guidelines for haskell code?
http://haskell.org/haskellwiki/Category:Style
Or pick an author you like, and look at their code on hackage.haskell.org
-- Don

Don Stewart
1) what is the most performant lookup table/hashtable/dictionary solution for Haskell?
Data.IntMap is awfully good.
Is it benchmarked anywhere? Compared to the Judy bindings, or Adrian Hey's AVL trees, or Data.Hashtable? I rewrote (roughly) a Python program in Haskell, and it was my impression back then that Python's associative arrays was faster than Haskell maps - but this could well have been back in the FiniteMap days, and I don't think I benchmarked very precisely. Anyway, there's a Google Summer-of-code project that will hopefully produce some benchmarks of the different alternatives. Data.Map tends to consume a lot of memory as well. But - Data.(Int)Map is likely to be the easiest available - I'd try that first, and if things are still too slow, profile, and then look for alternatives. -k -- If I haven't seen further, it is by standing in the footprints of giants

On Apr 24, 2008, at 2:31 , Ketil Malde wrote:
Don Stewart
writes: 1) what is the most performant lookup table/hashtable/ dictionary solution for Haskell?
Data.IntMap is awfully good.
Is it benchmarked anywhere? Compared to the Judy bindings, or Adrian Hey's AVL trees, or Data.Hashtable?
I don't know if anyone has benched them recently, but some time back there was a comparison of Data.HashTable, Data.Map, and Data.IntMap posted to one of the Haskell lists; HashTable was usually the slowest, and IntMap the fastest because it could take advantage of optimizations due to the key being a specific type (among other things, IIRC it's unboxed). This also reduces the memory usage compared to the more general Data.Map. -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

Ketil Malde wrote:
Don Stewart
writes: 1) what is the most performant lookup table/hashtable/dictionary solution for Haskell?
Data.IntMap is awfully good.
Is it benchmarked anywhere? Compared to the Judy bindings, or Adrian Hey's AVL trees, or Data.Hashtable?
I must get around to putting the AVL clones of Data.Map/Set in Hackage sometime. Meanwhile anyone wanting to roll their own maps with an API of their chosing could do a lot worse than the raw AVL lib.. http://hackage.haskell.org/cgi-bin/hackage-scripts/package/AvlTree-3.1 Also, if you're likely to be using union/intersection a lot you should know that Data.Map/Set are very slow for this because they use the not efficient hedge algorithm :-) There's a really cool demo I found from wikipedia that shows why it is that AVL trees perform so well in the "pathological" situation of sorted insertions.. http://www.csi.uottawa.ca/~stan/csi2514/applets/avl/BT.html If you try it you can see that after 2^n-1 sorted insertions you always get a perfectly balanced tree. This explains these benchmark results.. http://groups.google.co.uk/group/comp.lang.functional/msg/74a422ea04ff1217 DData is what became Data.Map/Set and it would appear to be the worst performing of the four tree types tested there :-( Don is right about Data.IntMap/IntSet. For Ints it will be much faster than Data.Map/Set or even (polymorphic) AVL tree. But an AVL tree of unboxed Ints gives faster lookup than IntMap/Set and about the same insert/delete times.. http://www.haskell.org/pipermail/libraries/2005-October/004518.html Union and Intersection times for AVL aren't so great, but I think I know what to do about that (especially intersection). But the real way ahead has to be Tries for non-trivial keys and (I suspect) AVL trees of unboxed Ints for simple keys (serialisable as 1 machine word). This is what that GSoC project is all about. At the moment we have the exact opposite, Tries for Ints and balanced trees for non-trivial keys. Oh well.. Regards -- Adrian Hey

On Apr 24, 2008, at 11:33 AM, Adrian Hey wrote:
Also, if you're likely to be using union/intersection a lot you should know that Data.Map/Set are very slow for this because they use the not efficient hedge algorithm :-)
OK, I'm going to bite here: What's the efficient algorithm for union on balanced trees, given that hedge union was chosen as being more efficient than naive alternatives (split and merge, repeated insertion)? My going hypothesis has been "hedge union is an inefficient algorithm, except that it's better than all those other inefficient algorithms". For IntSet/IntMap of course the split structure of the tree is fixed (we can think of these as being compressed versions of a complete binary tree) and union and intersection are quite efficient. -Jan-Willem Maessen

Jan-Willem Maessen wrote:
On Apr 24, 2008, at 11:33 AM, Adrian Hey wrote:
Also, if you're likely to be using union/intersection a lot you should know that Data.Map/Set are very slow for this because they use the not efficient hedge algorithm :-)
OK, I'm going to bite here: What's the efficient algorithm for union on balanced trees, given that hedge union was chosen as being more efficient than naive alternatives (split and merge, repeated insertion)? My going hypothesis has been "hedge union is an inefficient algorithm, except that it's better than all those other inefficient algorithms".
Divide and conquer seems to be the most efficient, though not the algorithm presented in the Adams paper. Hedge algorithm performs many more comparisons than are needed, which is obviously bad if you don't know how expensive those comparisons are going to be. IIRC it was something like 4..5 times as many of 2 sets of a million or so random Ints. But even in favourable circumstances (tree elements are boxed Ints) divide and conquer on AVL trees seemed much faster than Hedge on Data.Set. Of course ideally we would want implementations of Hedge for AVL and divide and conquer for Data.Set too, but I didn't feel inclined to write them. Regards -- Adrian Hey

On Apr 26, 2008, at 7:41 AM, Adrian Hey wrote:
Jan-Willem Maessen wrote:
On Apr 24, 2008, at 11:33 AM, Adrian Hey wrote:
Also, if you're likely to be using union/intersection a lot you should know that Data.Map/Set are very slow for this because they use the not efficient hedge algorithm :-) OK, I'm going to bite here: What's the efficient algorithm for union on balanced trees, given that hedge union was chosen as being more efficient than naive alternatives (split and merge, repeated insertion)? My going hypothesis has been "hedge union is an inefficient algorithm, except that it's better than all those other inefficient algorithms".
Divide and conquer seems to be the most efficient, though not the algorithm presented in the Adams paper.
Just to clarify: divide and conquer splits one tree on the root value of the other (possibly avoiding enforcing the balance metric until after joining trees, though not obvious how / if that's useful)? The definition of "divide and conquer" on trees without a fixed structure is rather unclear, which is why the question comes up in the first place.
Hedge algorithm performs many more comparisons than are needed, which is obviously bad if you don't know how expensive those comparisons are going to be.
That makes sense. I found myself having the same kinds of thoughts when reading Knuth's analyses of (eg) different binary search algorithms in TAOCP v.3; if comparison was the dominant cost, which algorithm looked best suddenly changed. -Jan

Jan-Willem Maessen wrote:
Just to clarify: divide and conquer splits one tree on the root value of the other (possibly avoiding enforcing the balance metric until after joining trees, though not obvious how / if that's useful)? The definition of "divide and conquer" on trees without a fixed structure is rather unclear, which is why the question comes up in the first place.
The Divide and conquer algorithm presented in the Adams paper treats the two trees differently, depending on size. The algorithm used in AVL lib doesn't do this, it treats them both the same. You split each tree by the root of the other tree, then do the sub-unions on the three resulting ranges, then join these using the 2 orignal roots as "bridging" elements. This seems to give better results, and also aesthetically (an important consideration) seems more natural. With AVL I don't think there's really anything to be gained by not balancing intermediate trees as balancing costs practically nothing and has obvious advantages. Still it's not perfect. If the two trees are about the same size this still seemed to cost about 20% more comparisons than a noddy list merge union would. It's a big win over lists if there's a substantial difference in tree sizes though (and a big win over Hedge in all cases I think). It would be nice if someone could do a good theoretical analysis of the performance of these algorithms. I didn't because my maths wasn't good enough. I just chose algorithms empirically to minimise comparison counts (not execution times), which is the right thing to do for polymorphic implementations. Regards -- Adrian Hey

What are good options for concurrent dictionaries? A while ago i wrote a concurrent hash table prototype, but there are probably better solutions for Haskell. Regards, Felix
participants (7)
-
Adrian Hey
-
Brandon S. Allbery KF8NH
-
Cetin Sert
-
Don Stewart
-
Felix Martini
-
Jan-Willem Maessen
-
Ketil Malde