Improving containers library

Hi all, I have started an internship in Cambridge and will spend 3 months on Haskell. One of the possible projects is improving the containers library. Things I am thinking about: - measuring the performance of existing implementations (notably Map and Set) and possibly improve them (either without or with API changes), - adding Queue and a PriorityQueue, - maybe adding a generalized trie, - maybe adding a hashtable (more like a trie of the hashes, something in the line of Ideal hash trees; and maybe a 'classic' hash modifiable in the ST monad?) I would be grateful for any comments, thoughts or wishes. Milan Straka

Hi Milan,
That sounds like a great projects. Some thoughts/comment to each of the
points in your list.
On Wed, Mar 3, 2010 at 3:23 PM, Milan Straka
Hi all,
I have started an internship in Cambridge and will spend 3 months on Haskell. One of the possible projects is improving the containers library. Things I am thinking about: - measuring the performance of existing implementations (notably Map and Set) and possibly improve them (either without or with API changes),
In case you didn't know about it already, Criterion is an excellent benchmarking library. We've been using it extensively when benchmarking the event library. You can find some example benchmarks for a priority search queue at: http://github.com/tibbe/event/blob/master/benchmarks/PSQ.hs
- adding Queue and a PriorityQueue, - maybe adding a generalized trie, - maybe adding a hashtable (more like a trie of the hashes, something in the line of Ideal hash trees; and maybe a 'classic' hash modifiable in the ST monad?)
I would be grateful for any comments, thoughts or wishes.
Something I've been thinking about lately is how a containers library using associate data types would look like. Perhaps it would be interesting to investigate that as well. -- Johan

Hi Johan, thanks for the answer. Criterion is excellent, and there is also Progression build on the top of Criterion. Concerning the associated data types, they are definitely worth considering. Cheers, Milan
Hi Milan,
That sounds like a great projects. Some thoughts/comment to each of the points in your list.
On Wed, Mar 3, 2010 at 3:23 PM, Milan Straka
wrote: Hi all,
I have started an internship in Cambridge and will spend 3 months on Haskell. One of the possible projects is improving the containers library. Things I am thinking about: - measuring the performance of existing implementations (notably Map and Set) and possibly improve them (either without or with API changes),
In case you didn't know about it already, Criterion is an excellent benchmarking library. We've been using it extensively when benchmarking the event library. You can find some example benchmarks for a priority search queue at:
http://github.com/tibbe/event/blob/master/benchmarks/PSQ.hs
- adding Queue and a PriorityQueue, - maybe adding a generalized trie, - maybe adding a hashtable (more like a trie of the hashes, something in the line of Ideal hash trees; and maybe a 'classic' hash modifiable in the ST monad?)
I would be grateful for any comments, thoughts or wishes.
Something I've been thinking about lately is how a containers library using associate data types would look like. Perhaps it would be interesting to investigate that as well.
-- Johan

Hi Milan, On 03.03.2010, at 15:23, Milan Straka wrote:
Hi all,
I have started an internship in Cambridge and will spend 3 months on Haskell. One of the possible projects is improving the containers library. Things I am thinking about: - measuring the performance of existing implementations (notably Map and Set) and possibly improve them (either without or with API changes), - adding Queue and a PriorityQueue, - maybe adding a generalized trie, - maybe adding a hashtable (more like a trie of the hashes, something in the line of Ideal hash trees; and maybe a 'classic' hash modifiable in the ST monad?)
I would be grateful for any comments, thoughts or wishes.
In the context of program analysis, one tracks a lot of information about the state of a program at a given program point. This state is propagated around loops and conditionals. The implicit sharing of nodes in functional data structures are very well suited for this. However, whenever two states need to be joined at the loop head or at the end of a conditional, the two trees that represent the state need to be joined by performing some sort of join operation on each element of a tree. For all sensible domains a `join` a = a. Given that a loop or branch of a conditional usually touches only few variables, it is prudent not to perform the join operation on all elements of the tree. Instead, a tree-difference operation is required that traverses the two trees to be joined and calculates the difference between them. In particular, whenever two references are, in fact, the same pointer, the subtree does not need to be considered. This way, a join between two trees of size n reduces to joining only m elements where m is the maximum number of elements in each tree. This is a tremendous win for n >> m. I've implemented such a difference operation on a home-grown splay tree implementation. I would like to see this on standard AVL trees. Maybe other people could indicate if such a difference operation on trees would be useful in their applications. The difference operation in my implementation has the following signature: difference :: (Ord k) => (a -> a -> Bool) -> Map k a -> Map k a -> ([k], [k]) Calculate the difference between the two trees. Returns the elements that were in the first but not in the second and the elements that were in the second but not in the first. Both lists are unsorted. In case two element have the same key the given equality predicate is a applied to them. If the predicate returns False the elements are added to their respective difference lists. The contents of both trees remain unchanged. differenceElems :: (Ord k) => (a -> a -> Bool) -> Map k a -> Map k a -
([a], [a]) As difference, but extracts the elements, not their keys.
Cheers, Axel

Hi Axel, <snip>
In the context of program analysis, one tracks a lot of information about the state of a program at a given program point. This state is propagated around loops and conditionals. The implicit sharing of nodes in functional data structures are very well suited for this.
However, whenever two states need to be joined at the loop head or at the end of a conditional, the two trees that represent the state need to be joined by performing some sort of join operation on each element of a tree. For all sensible domains a `join` a = a. Given that a loop or branch of a conditional usually touches only few variables, it is prudent not to perform the join operation on all elements of the tree. Instead, a tree-difference operation is required that traverses the two trees to be joined and calculates the difference between them. In particular, whenever two references are, in fact, the same pointer, the subtree does not need to be considered. This way, a join between two trees of size n reduces to joining only m elements where m is the maximum number of elements in each tree. This is a tremendous win for n
m.
Any easy way of comparing pointers? I mean, if I have something like Tree a = N | B (Tree a) a (Tree a) and have (l::Tree a) and (r::Tree a), can I ask about the "physical equality"? The join would probably take O(m log n), because an insertion of 1 element changes the whole path to the new element, which is probably of length O(log n), so log n pointers change in the structure. You also need this time O(m log n) to construct the new data structure, surely for small m (one could hope for O(m log (n/m)). Still O(m log n) is definitely better than n for m << n :) Milan

On 03.03.2010, at 17:30, Milan Straka wrote:
Any easy way of comparing pointers? I mean, if I have something like Tree a = N | B (Tree a) a (Tree a) and have (l::Tree a) and (r::Tree a), can I ask about the "physical equality"?
You can! Despite the names appearing in the following code, the following is safe: -- | Equality on pointers. ptrEqual :: a -> a -> Bool ptrEqual x y = unsafePerformIO $ do nx <- makeStableName x ny <- makeStableName y return (nx==ny) Axel

On Wed, 2010-03-03 at 18:58 +0100, Axel Simon wrote:
On 03.03.2010, at 17:30, Milan Straka wrote:
Any easy way of comparing pointers? I mean, if I have something like Tree a = N | B (Tree a) a (Tree a) and have (l::Tree a) and (r::Tree a), can I ask about the "physical equality"?
You can! Despite the names appearing in the following code, the following is safe:
-- | Equality on pointers. ptrEqual :: a -> a -> Bool ptrEqual x y = unsafePerformIO $ do nx <- makeStableName x ny <- makeStableName y return (nx==ny)
Safe but not pure? jcc

You should seq or use a bangpattern on the arguments before taking the stable names to ensure that the result is reproducible if the input thunks may later be forced.
let y = 1 + 1; x = id y in (ptrEqual x y, x `seq` y `seq` ptrEqual x y) (False,True)
a === b = a `seq` b `seq` unsafePerformIO $ (==) <$> makeStableName a <*> makeStableName b
let y = 1 + 1; x = id y in (x === y, x `seq` y `seq` x === y) (True,True)
This may be less of a problem if your tree has bang patterns, but even there
it it sometimes easy to trip up and make a mistake with the unforced root of
a tree.
-Edward Kmett
On Wed, Mar 3, 2010 at 1:08 PM, Jonathan Cast
On Wed, 2010-03-03 at 18:58 +0100, Axel Simon wrote:
On 03.03.2010, at 17:30, Milan Straka wrote:
Any easy way of comparing pointers? I mean, if I have something like Tree a = N | B (Tree a) a (Tree a) and have (l::Tree a) and (r::Tree a), can I ask about the "physical equality"?
You can! Despite the names appearing in the following code, the following is safe:
-- | Equality on pointers. ptrEqual :: a -> a -> Bool ptrEqual x y = unsafePerformIO $ do nx <- makeStableName x ny <- makeStableName y return (nx==ny)
Safe but not pure?
jcc
_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

Hi, thanks for enlightening me :) I wished for something like this some time ago... Milan
You should seq or use a bangpattern on the arguments before taking the stable names to ensure that the result is reproducible if the input thunks may later be forced.
let y = 1 + 1; x = id y in (ptrEqual x y, x `seq` y `seq` ptrEqual x y) (False,True)
a === b = a `seq` b `seq` unsafePerformIO $ (==) <$> makeStableName a <*> makeStableName b
let y = 1 + 1; x = id y in (x === y, x `seq` y `seq` x === y) (True,True)
This may be less of a problem if your tree has bang patterns, but even there it it sometimes easy to trip up and make a mistake with the unforced root of a tree.
-Edward Kmett
On Wed, Mar 3, 2010 at 1:08 PM, Jonathan Cast
wrote: On Wed, 2010-03-03 at 18:58 +0100, Axel Simon wrote:
On 03.03.2010, at 17:30, Milan Straka wrote:
Any easy way of comparing pointers? I mean, if I have something like Tree a = N | B (Tree a) a (Tree a) and have (l::Tree a) and (r::Tree a), can I ask about the "physical equality"?
You can! Despite the names appearing in the following code, the following is safe:
-- | Equality on pointers. ptrEqual :: a -> a -> Bool ptrEqual x y = unsafePerformIO $ do nx <- makeStableName x ny <- makeStableName y return (nx==ny)
Safe but not pure?
jcc
_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries
_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

I've used it in a set of bottom-up applicative parser combinators, so that I
could make them provide an Applicative interface, but be warned, allocation
of a stable name involves a fair bit more than just doing a pointer
comparison. IIRC, Andy Gill has a paper on a related approach, which he uses
in his version of Lava.
Although I haven't benchmarked it, I have a strong suspicion that it is
likely cheaper to just unsafePerformIO to allocate a new IORef (), add that
as a member, and to check those for equality. Of course you then need to be
disciplined about not sharing the refs. This is the approach taken by the
mainstream Lava distribution.
Also, be warned that common subexpression elimination might yield more
positives than you would expect from a naive source reading.
-Edward Kmett
On Wed, Mar 3, 2010 at 1:59 PM, Milan Straka
Hi,
thanks for enlightening me :) I wished for something like this some time ago...
Milan
You should seq or use a bangpattern on the arguments before taking the stable names to ensure that the result is reproducible if the input thunks may later be forced.
let y = 1 + 1; x = id y in (ptrEqual x y, x `seq` y `seq` ptrEqual x y) (False,True)
a === b = a `seq` b `seq` unsafePerformIO $ (==) <$> makeStableName a <*> makeStableName b
let y = 1 + 1; x = id y in (x === y, x `seq` y `seq` x === y) (True,True)
This may be less of a problem if your tree has bang patterns, but even there it it sometimes easy to trip up and make a mistake with the unforced root of a tree.
-Edward Kmett
On Wed, Mar 3, 2010 at 1:08 PM, Jonathan Cast
On 03.03.2010, at 17:30, Milan Straka wrote:
Any easy way of comparing pointers? I mean, if I have something
On Wed, 2010-03-03 at 18:58 +0100, Axel Simon wrote: like
Tree a = N | B (Tree a) a (Tree a) and have (l::Tree a) and (r::Tree a), can I ask about the "physical equality"?
You can! Despite the names appearing in the following code, the following is safe:
-- | Equality on pointers. ptrEqual :: a -> a -> Bool ptrEqual x y = unsafePerformIO $ do nx <- makeStableName x ny <- makeStableName y return (nx==ny)
Safe but not pure?
jcc
_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries
_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

Axel Simon wrote:
Any easy way of comparing pointers? I mean, if I have something like Tree a = N | B (Tree a) a (Tree a) and have (l::Tree a) and (r::Tree a), can I ask about the "physical equality"?
You can! Despite the names appearing in the following code, the following is safe:
-- | Equality on pointers. ptrEqual :: a -> a -> Bool ptrEqual x y = unsafePerformIO $ do nx <- makeStableName x ny <- makeStableName y return (nx==ny)
{-# LANGUAGE MagicHash #-} import GHC.Exts ptrEqual' :: a -> a -> Bool ptrEqual' x y = case reallyUnsafePtrEquality# x y of 0# -> False 1# -> True This is actually a pointer comparison. I believe it can produce false negatives though, because indirections are not followed. If that's correct, a false negative may be turned into a positive by garbage collection. So use with care. regards, Bertram

On 06.03.2010, at 18:19, Bertram Felgenhauer wrote:
Axel Simon wrote:
Any easy way of comparing pointers? I mean, if I have something like Tree a = N | B (Tree a) a (Tree a) and have (l::Tree a) and (r::Tree a), can I ask about the "physical equality"?
You can! Despite the names appearing in the following code, the following is safe:
-- | Equality on pointers. ptrEqual :: a -> a -> Bool ptrEqual x y = unsafePerformIO $ do nx <- makeStableName x ny <- makeStableName y return (nx==ny)
{-# LANGUAGE MagicHash #-} import GHC.Exts
ptrEqual' :: a -> a -> Bool ptrEqual' x y = case reallyUnsafePtrEquality# x y of 0# -> False 1# -> True
This is actually a pointer comparison. I believe it can produce false negatives though, because indirections are not followed. If that's correct, a false negative may be turned into a positive by garbage collection. So use with care.
Interesting. But if reallyUnsafePtrEquality# is a primitive, it is evaluated either before or after GC, so it can't compare a pointer from one generation with the next. Having ptrEqual return False if, in fact, the two values are identical is not a problem for the considered data structure. It merely implies that a structural comparison is performed which is slower. Cheers, Axel.
regards,
Bertram _______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

Axel Simon wrote:
On 06.03.2010, at 18:19, Bertram Felgenhauer wrote:
ptrEqual' :: a -> a -> Bool ptrEqual' x y = case reallyUnsafePtrEquality# x y of 0# -> False 1# -> True
This is actually a pointer comparison. I believe it can produce false negatives though, because indirections are not followed. If that's correct, a false negative may be turned into a positive by garbage collection. So use with care.
Interesting. But if reallyUnsafePtrEquality# is a primitive, it is evaluated either before or after GC, so it can't compare a pointer from one generation with the next.
That is true. But besides GC, there is another source of indirections: updating a thunk with its final value if it's too large, or with a different thunk, which happens when an asynchronous exception is raised. regards, Bertram

On Sat, Mar 06, 2010 at 06:19:46PM +0100, Bertram Felgenhauer wrote:
This is actually a pointer comparison. I believe it can produce false negatives though, because indirections are not followed. If that's correct, a false negative may be turned into a positive by garbage collection. So use with care.
I am not sure what your intended use is, but a less obvious caveat when using this technique to replace a use of (==) is that (==) is generally strict, forcing a whole data structure, while pointer equality will not. this may lead to a space leak if you arn't careful. It may not be an issue for what you are doing though. John -- John Meacham - ⑆repetae.net⑆john⑈ - http://notanumber.net/

On Wed, Mar 03, 2010 at 03:23:01PM +0100, Milan Straka wrote:
I have started an internship in Cambridge and will spend 3 months on Haskell. One of the possible projects is improving the containers library. Things I am thinking about: - measuring the performance of existing implementations (notably Map and Set) and possibly improve them (either without or with API changes), - adding Queue and a PriorityQueue, - maybe adding a generalized trie, - maybe adding a hashtable (more like a trie of the hashes, something in the line of Ideal hash trees; and maybe a 'classic' hash modifiable in the ST monad?)
The containers package aims to provide general purpose immutable containers in portable Haskell. So things like generalized tries or mutable hash tables, while worthwhile, should be in different packages. Ordinary tries (i.e. ListMaps) would fit though. Data.Sequence already provides a queue. A Banker's queue would be a little faster, but IMO the difference isn't enough to justify the special case. The two-stack implementation would be faster, but it performs poorly in a persistent setting, so fails the "general purpose" criterion. A priority queue and (priority search queue) would be useful. It would be important to have an efficient union operation. Then you have to decide whether this needs to be stable. It's a trade-off between speed and utility. Skew heaps are simple to implement and very fast, but don't support stable union. Implementations that do seem to be a bit slower.

Ross Paterson wrote:
On Wed, Mar 03, 2010 at 03:23:01PM +0100, Milan Straka wrote:
I have started an internship in Cambridge and will spend 3 months on Haskell. One of the possible projects is improving the containers library. Things I am thinking about: - measuring the performance of existing implementations (notably Map and Set) and possibly improve them (either without or with API changes), - adding Queue and a PriorityQueue, - maybe adding a generalized trie, - maybe adding a hashtable (more like a trie of the hashes, something in the line of Ideal hash trees; and maybe a 'classic' hash modifiable in the ST monad?)
Data.Sequence already provides a queue. A Banker's queue would be a little faster, but IMO the difference isn't enough to justify the special case. The two-stack implementation would be faster, but it performs poorly in a persistent setting, so fails the "general purpose" criterion.
How about the real time version of the banker's queue? That would be a definite improvement upon Data.Sequence , at least. Regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com

On Thu, Mar 04, 2010 at 03:40:56PM +0100, Heinrich Apfelmus wrote:
How about the real time version of the banker's queue? That would be a definite improvement upon Data.Sequence, at least.
We used to have that. It gives the real time guarantee, but the average performance is about twice as slow as Data.Sequence.

Ross Paterson wrote:
On Thu, Mar 04, 2010 at 03:40:56PM +0100, Heinrich Apfelmus wrote:
How about the real time version of the banker's queue? That would be a definite improvement upon Data.Sequence, at least.
We used to have that. It gives the real time guarantee, but the average performance is about twice as slow as Data.Sequence.
Oh? I would have thought that the finger tree has a higher constant factor since every element has to be "carried over" a few times while traveling from back to front. In contrast, it's clear that the real time queue touches each element only 2-3 times (add to rear, force the suspension in the front, remove from front). Regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com

Hello,
I have a somewhat related project suggestion. As far as I can tell there is
no good Set / Map like data structure which supports multiple keys.
The closest option is happstack-ixset. However, it provides no real type
safety, and I suspect it is not very efficient:
http://hackage.haskell.org/package/happstack-ixset
As an experiment I started a data-type loosely based on kd-trees and the
existing Data.Map:
http://src.seereason.com/haskell-kdmap/
This type supports two keys. You can do looks using one or both keys.
Ideally, though, you would be able to support an arbitrary number of keys,
and do queries that are more complex than a straight lookup.
The aim is to produce a type which supports many of the same operations as a
normal database table. (But with out going so far as to be exactly like a
database table).
- jeremy
On Wed, Mar 3, 2010 at 8:23 AM, Milan Straka
Hi all,
I have started an internship in Cambridge and will spend 3 months on Haskell. One of the possible projects is improving the containers library. Things I am thinking about: - measuring the performance of existing implementations (notably Map and Set) and possibly improve them (either without or with API changes), - adding Queue and a PriorityQueue, - maybe adding a generalized trie, - maybe adding a hashtable (more like a trie of the hashes, something in the line of Ideal hash trees; and maybe a 'classic' hash modifiable in the ST monad?)
I would be grateful for any comments, thoughts or wishes.
Milan Straka _______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries
participants (10)
-
Axel Simon
-
Bertram Felgenhauer
-
Edward Kmett
-
Heinrich Apfelmus
-
Jeremy Shaw
-
Johan Tibell
-
John Meacham
-
Jonathan Cast
-
Milan Straka
-
Ross Paterson