Improving containers library

older
Progress on priority queues for...

Milan Straka

3 Mar 2010 3 Mar '10

2:23 p.m.

Hi all, I have started an internship in Cambridge and will spend 3 months on Haskell. One of the possible projects is improving the containers library. Things I am thinking about: - measuring the performance of existing implementations (notably Map and Set) and possibly improve them (either without or with API changes), - adding Queue and a PriorityQueue, - maybe adding a generalized trie, - maybe adding a hashtable (more like a trie of the hashes, something in the line of Ideal hash trees; and maybe a 'classic' hash modifiable in the ST monad?) I would be grateful for any comments, thoughts or wishes. Milan Straka

Show replies by date

Johan Tibell

3 Mar 3 Mar

3:13 p.m.

Hi Milan, That sounds like a great projects. Some thoughts/comment to each of the points in your list. On Wed, Mar 3, 2010 at 3:23 PM, Milan Straka wrote:

...

Hi all,

I have started an internship in Cambridge and will spend 3 months on Haskell. One of the possible projects is improving the containers library. Things I am thinking about: - measuring the performance of existing implementations (notably Map and Set) and possibly improve them (either without or with API changes),

In case you didn't know about it already, Criterion is an excellent benchmarking library. We've been using it extensively when benchmarking the event library. You can find some example benchmarks for a priority search queue at: http://github.com/tibbe/event/blob/master/benchmarks/PSQ.hs

...

- adding Queue and a PriorityQueue, - maybe adding a generalized trie, - maybe adding a hashtable (more like a trie of the hashes, something in the line of Ideal hash trees; and maybe a 'classic' hash modifiable in the ST monad?)

I would be grateful for any comments, thoughts or wishes.

...

Something I've been thinking about lately is how a containers library using associate data types would look like. Perhaps it would be interesting to investigate that as well. -- Johan

Milan Straka

4:21 p.m.

Hi Johan, thanks for the answer. Criterion is excellent, and there is also Progression build on the top of Criterion. Concerning the associated data types, they are definitely worth considering. Cheers, Milan

...

Hi Milan,

That sounds like a great projects. Some thoughts/comment to each of the points in your list.

On Wed, Mar 3, 2010 at 3:23 PM, Milan Straka wrote:

...
Hi all,

I have started an internship in Cambridge and will spend 3 months on Haskell. One of the possible projects is improving the containers library. Things I am thinking about: - measuring the performance of existing implementations (notably Map and Set) and possibly improve them (either without or with API changes),

In case you didn't know about it already, Criterion is an excellent benchmarking library. We've been using it extensively when benchmarking the event library. You can find some example benchmarks for a priority search queue at:

http://github.com/tibbe/event/blob/master/benchmarks/PSQ.hs

...
- adding Queue and a PriorityQueue, - maybe adding a generalized trie, - maybe adding a hashtable (more like a trie of the hashes, something in the line of Ideal hash trees; and maybe a 'classic' hash modifiable in the ST monad?)

I would be grateful for any comments, thoughts or wishes.

...
Something I've been thinking about lately is how a containers library using associate data types would look like. Perhaps it would be interesting to investigate that as well.

-- Johan

Axel Simon

3:28 p.m.

Hi Milan, On 03.03.2010, at 15:23, Milan Straka wrote:

...

Hi all,

I have started an internship in Cambridge and will spend 3 months on Haskell. One of the possible projects is improving the containers library. Things I am thinking about: - measuring the performance of existing implementations (notably Map and Set) and possibly improve them (either without or with API changes), - adding Queue and a PriorityQueue, - maybe adding a generalized trie, - maybe adding a hashtable (more like a trie of the hashes, something in the line of Ideal hash trees; and maybe a 'classic' hash modifiable in the ST monad?)

I would be grateful for any comments, thoughts or wishes.

In the context of program analysis, one tracks a lot of information about the state of a program at a given program point. This state is propagated around loops and conditionals. The implicit sharing of nodes in functional data structures are very well suited for this. However, whenever two states need to be joined at the loop head or at the end of a conditional, the two trees that represent the state need to be joined by performing some sort of join operation on each element of a tree. For all sensible domains a `join` a = a. Given that a loop or branch of a conditional usually touches only few variables, it is prudent not to perform the join operation on all elements of the tree. Instead, a tree-difference operation is required that traverses the two trees to be joined and calculates the difference between them. In particular, whenever two references are, in fact, the same pointer, the subtree does not need to be considered. This way, a join between two trees of size n reduces to joining only m elements where m is the maximum number of elements in each tree. This is a tremendous win for n >> m. I've implemented such a difference operation on a home-grown splay tree implementation. I would like to see this on standard AVL trees. Maybe other people could indicate if such a difference operation on trees would be useful in their applications. The difference operation in my implementation has the following signature: difference :: (Ord k) => (a -> a -> Bool) -> Map k a -> Map k a -> ([k], [k]) Calculate the difference between the two trees. Returns the elements that were in the first but not in the second and the elements that were in the second but not in the first. Both lists are unsorted. In case two element have the same key the given equality predicate is a applied to them. If the predicate returns False the elements are added to their respective difference lists. The contents of both trees remain unchanged. differenceElems :: (Ord k) => (a -> a -> Bool) -> Map k a -> Map k a -

...

([a], [a]) As difference, but extracts the elements, not their keys.

Cheers, Axel

Milan Straka

4:30 p.m.

Hi Axel, <snip>

...

In the context of program analysis, one tracks a lot of information about the state of a program at a given program point. This state is propagated around loops and conditionals. The implicit sharing of nodes in functional data structures are very well suited for this.

However, whenever two states need to be joined at the loop head or at the end of a conditional, the two trees that represent the state need to be joined by performing some sort of join operation on each element of a tree. For all sensible domains a `join` a = a. Given that a loop or branch of a conditional usually touches only few variables, it is prudent not to perform the join operation on all elements of the tree. Instead, a tree-difference operation is required that traverses the two trees to be joined and calculates the difference between them. In particular, whenever two references are, in fact, the same pointer, the subtree does not need to be considered. This way, a join between two trees of size n reduces to joining only m elements where m is the maximum number of elements in each tree. This is a tremendous win for n

...
...
m.

Any easy way of comparing pointers? I mean, if I have something like Tree a = N | B (Tree a) a (Tree a) and have (l::Tree a) and (r::Tree a), can I ask about the "physical equality"? The join would probably take O(m log n), because an insertion of 1 element changes the whole path to the new element, which is probably of length O(log n), so log n pointers change in the structure. You also need this time O(m log n) to construct the new data structure, surely for small m (one could hope for O(m log (n/m)). Still O(m log n) is definitely better than n for m << n :) Milan

Axel Simon

5:58 p.m.

On 03.03.2010, at 17:30, Milan Straka wrote:

...

Any easy way of comparing pointers? I mean, if I have something like Tree a = N | B (Tree a) a (Tree a) and have (l::Tree a) and (r::Tree a), can I ask about the "physical equality"?

You can! Despite the names appearing in the following code, the following is safe: -- | Equality on pointers. ptrEqual :: a -> a -> Bool ptrEqual x y = unsafePerformIO $ do nx <- makeStableName x ny <- makeStableName y return (nx==ny) Axel

Jonathan Cast

6:08 p.m.

On Wed, 2010-03-03 at 18:58 +0100, Axel Simon wrote:

...

On 03.03.2010, at 17:30, Milan Straka wrote:

...
Any easy way of comparing pointers? I mean, if I have something like Tree a = N | B (Tree a) a (Tree a) and have (l::Tree a) and (r::Tree a), can I ask about the "physical equality"?

You can! Despite the names appearing in the following code, the following is safe:

-- | Equality on pointers. ptrEqual :: a -> a -> Bool ptrEqual x y = unsafePerformIO $ do nx <- makeStableName x ny <- makeStableName y return (nx==ny)

Safe but not pure? jcc

Edward Kmett

6:38 p.m.

You should seq or use a bangpattern on the arguments before taking the stable names to ensure that the result is reproducible if the input thunks may later be forced.

...

let y = 1 + 1; x = id y in (ptrEqual x y, x `seq` y `seq` ptrEqual x y) (False,True)

a === b = a `seq` b `seq` unsafePerformIO $ (==) <$> makeStableName a <*> makeStableName b

...

let y = 1 + 1; x = id y in (x === y, x `seq` y `seq` x === y) (True,True)

This may be less of a problem if your tree has bang patterns, but even there it it sometimes easy to trip up and make a mistake with the unforced root of a tree. -Edward Kmett On Wed, Mar 3, 2010 at 1:08 PM, Jonathan Cast wrote:

...

On Wed, 2010-03-03 at 18:58 +0100, Axel Simon wrote:

...
On 03.03.2010, at 17:30, Milan Straka wrote:

...
Any easy way of comparing pointers? I mean, if I have something like Tree a = N | B (Tree a) a (Tree a) and have (l::Tree a) and (r::Tree a), can I ask about the "physical equality"?

You can! Despite the names appearing in the following code, the following is safe:

-- | Equality on pointers. ptrEqual :: a -> a -> Bool ptrEqual x y = unsafePerformIO $ do nx <- makeStableName x ny <- makeStableName y return (nx==ny)

Safe but not pure?

jcc

_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

Milan Straka

6:59 p.m.

Hi, thanks for enlightening me :) I wished for something like this some time ago... Milan

...

You should seq or use a bangpattern on the arguments before taking the stable names to ensure that the result is reproducible if the input thunks may later be forced.

...
let y = 1 + 1; x = id y in (ptrEqual x y, x `seq` y `seq` ptrEqual x y) (False,True)

a === b = a `seq` b `seq` unsafePerformIO $ (==) <$> makeStableName a <*> makeStableName b

...
let y = 1 + 1; x = id y in (x === y, x `seq` y `seq` x === y) (True,True)

This may be less of a problem if your tree has bang patterns, but even there it it sometimes easy to trip up and make a mistake with the unforced root of a tree.

-Edward Kmett

On Wed, Mar 3, 2010 at 1:08 PM, Jonathan Cast wrote:

...
On Wed, 2010-03-03 at 18:58 +0100, Axel Simon wrote:

...
On 03.03.2010, at 17:30, Milan Straka wrote:

...
Any easy way of comparing pointers? I mean, if I have something like Tree a = N | B (Tree a) a (Tree a) and have (l::Tree a) and (r::Tree a), can I ask about the "physical equality"?

You can! Despite the names appearing in the following code, the following is safe:

-- | Equality on pointers. ptrEqual :: a -> a -> Bool ptrEqual x y = unsafePerformIO $ do nx <- makeStableName x ny <- makeStableName y return (nx==ny)

Safe but not pure?

jcc

_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

...

_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

Edward Kmett

7:11 p.m.

I've used it in a set of bottom-up applicative parser combinators, so that I could make them provide an Applicative interface, but be warned, allocation of a stable name involves a fair bit more than just doing a pointer comparison. IIRC, Andy Gill has a paper on a related approach, which he uses in his version of Lava. Although I haven't benchmarked it, I have a strong suspicion that it is likely cheaper to just unsafePerformIO to allocate a new IORef (), add that as a member, and to check those for equality. Of course you then need to be disciplined about not sharing the refs. This is the approach taken by the mainstream Lava distribution. Also, be warned that common subexpression elimination might yield more positives than you would expect from a naive source reading. -Edward Kmett On Wed, Mar 3, 2010 at 1:59 PM, Milan Straka wrote:

...

Hi,

thanks for enlightening me :) I wished for something like this some time ago...

Milan

...
You should seq or use a bangpattern on the arguments before taking the stable names to ensure that the result is reproducible if the input thunks may later be forced.

...
let y = 1 + 1; x = id y in (ptrEqual x y, x `seq` y `seq` ptrEqual x y) (False,True)

a === b = a `seq` b `seq` unsafePerformIO $ (==) <$> makeStableName a <*> makeStableName b

...
let y = 1 + 1; x = id y in (x === y, x `seq` y `seq` x === y) (True,True)

This may be less of a problem if your tree has bang patterns, but even there it it sometimes easy to trip up and make a mistake with the unforced root of a tree.

-Edward Kmett

On Wed, Mar 3, 2010 at 1:08 PM, Jonathan Cast
...
...
On 03.03.2010, at 17:30, Milan Straka wrote:

...
Any easy way of comparing pointers? I mean, if I have something

On Wed, 2010-03-03 at 18:58 +0100, Axel Simon wrote: like

...
...
Tree a = N | B (Tree a) a (Tree a) and have (l::Tree a) and (r::Tree a), can I ask about the "physical equality"?

You can! Despite the names appearing in the following code, the following is safe:

-- | Equality on pointers. ptrEqual :: a -> a -> Bool ptrEqual x y = unsafePerformIO $ do nx <- makeStableName x ny <- makeStableName y return (nx==ny)

Safe but not pure?

jcc

_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

...
_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

Bertram Felgenhauer

6 Mar 6 Mar

5:19 p.m.

Axel Simon wrote:

...

...
Any easy way of comparing pointers? I mean, if I have something like Tree a = N | B (Tree a) a (Tree a) and have (l::Tree a) and (r::Tree a), can I ask about the "physical equality"?

You can! Despite the names appearing in the following code, the following is safe:

-- | Equality on pointers. ptrEqual :: a -> a -> Bool ptrEqual x y = unsafePerformIO $ do nx <- makeStableName x ny <- makeStableName y return (nx==ny)

{-# LANGUAGE MagicHash #-} import GHC.Exts ptrEqual' :: a -> a -> Bool ptrEqual' x y = case reallyUnsafePtrEquality# x y of 0# -> False 1# -> True This is actually a pointer comparison. I believe it can produce false negatives though, because indirections are not followed. If that's correct, a false negative may be turned into a positive by garbage collection. So use with care. regards, Bertram

Axel Simon

7 Mar 7 Mar

3:19 p.m.

On 06.03.2010, at 18:19, Bertram Felgenhauer wrote:

...

Axel Simon wrote:

...
...
Any easy way of comparing pointers? I mean, if I have something like Tree a = N | B (Tree a) a (Tree a) and have (l::Tree a) and (r::Tree a), can I ask about the "physical equality"?

You can! Despite the names appearing in the following code, the following is safe:

-- | Equality on pointers. ptrEqual :: a -> a -> Bool ptrEqual x y = unsafePerformIO $ do nx <- makeStableName x ny <- makeStableName y return (nx==ny)

{-# LANGUAGE MagicHash #-} import GHC.Exts

ptrEqual' :: a -> a -> Bool ptrEqual' x y = case reallyUnsafePtrEquality# x y of 0# -> False 1# -> True

This is actually a pointer comparison. I believe it can produce false negatives though, because indirections are not followed. If that's correct, a false negative may be turned into a positive by garbage collection. So use with care.

Interesting. But if reallyUnsafePtrEquality# is a primitive, it is evaluated either before or after GC, so it can't compare a pointer from one generation with the next. Having ptrEqual return False if, in fact, the two values are identical is not a problem for the considered data structure. It merely implies that a structural comparison is performed which is slower. Cheers, Axel.

...

regards,

Bertram _______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

Bertram Felgenhauer

8 Mar 8 Mar

10:35 a.m.

Axel Simon wrote:

...

On 06.03.2010, at 18:19, Bertram Felgenhauer wrote:

...
ptrEqual' :: a -> a -> Bool ptrEqual' x y = case reallyUnsafePtrEquality# x y of 0# -> False 1# -> True

This is actually a pointer comparison. I believe it can produce false negatives though, because indirections are not followed. If that's correct, a false negative may be turned into a positive by garbage collection. So use with care.

Interesting. But if reallyUnsafePtrEquality# is a primitive, it is evaluated either before or after GC, so it can't compare a pointer from one generation with the next.

That is true. But besides GC, there is another source of indirections: updating a thunk with its final value if it's too large, or with a different thunk, which happens when an asynchronous exception is raised. regards, Bertram

John Meacham

7:33 p.m.

On Sat, Mar 06, 2010 at 06:19:46PM +0100, Bertram Felgenhauer wrote:

...

This is actually a pointer comparison. I believe it can produce false negatives though, because indirections are not followed. If that's correct, a false negative may be turned into a positive by garbage collection. So use with care.

I am not sure what your intended use is, but a less obvious caveat when using this technique to replace a use of (==) is that (==) is generally strict, forcing a whole data structure, while pointer equality will not. this may lead to a space leak if you arn't careful. It may not be an issue for what you are doing though. John -- John Meacham - ⑆repetae.net⑆john⑈ - http://notanumber.net/

Ross Paterson

4 Mar 4 Mar

11 a.m.

On Wed, Mar 03, 2010 at 03:23:01PM +0100, Milan Straka wrote:

...

I have started an internship in Cambridge and will spend 3 months on Haskell. One of the possible projects is improving the containers library. Things I am thinking about: - measuring the performance of existing implementations (notably Map and Set) and possibly improve them (either without or with API changes), - adding Queue and a PriorityQueue, - maybe adding a generalized trie, - maybe adding a hashtable (more like a trie of the hashes, something in the line of Ideal hash trees; and maybe a 'classic' hash modifiable in the ST monad?)

The containers package aims to provide general purpose immutable containers in portable Haskell. So things like generalized tries or mutable hash tables, while worthwhile, should be in different packages. Ordinary tries (i.e. ListMaps) would fit though. Data.Sequence already provides a queue. A Banker's queue would be a little faster, but IMO the difference isn't enough to justify the special case. The two-stack implementation would be faster, but it performs poorly in a persistent setting, so fails the "general purpose" criterion. A priority queue and (priority search queue) would be useful. It would be important to have an efficient union operation. Then you have to decide whether this needs to be stable. It's a trade-off between speed and utility. Skew heaps are simple to implement and very fast, but don't support stable union. Implementations that do seem to be a bit slower.

Heinrich Apfelmus

2:40 p.m.

Ross Paterson wrote:

...

On Wed, Mar 03, 2010 at 03:23:01PM +0100, Milan Straka wrote:

...
I have started an internship in Cambridge and will spend 3 months on Haskell. One of the possible projects is improving the containers library. Things I am thinking about: - measuring the performance of existing implementations (notably Map and Set) and possibly improve them (either without or with API changes), - adding Queue and a PriorityQueue, - maybe adding a generalized trie, - maybe adding a hashtable (more like a trie of the hashes, something in the line of Ideal hash trees; and maybe a 'classic' hash modifiable in the ST monad?)

Data.Sequence already provides a queue. A Banker's queue would be a little faster, but IMO the difference isn't enough to justify the special case. The two-stack implementation would be faster, but it performs poorly in a persistent setting, so fails the "general purpose" criterion.

How about the real time version of the banker's queue? That would be a definite improvement upon Data.Sequence , at least. Regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com

Ross Paterson

3:27 p.m.

On Thu, Mar 04, 2010 at 03:40:56PM +0100, Heinrich Apfelmus wrote:

...

How about the real time version of the banker's queue? That would be a definite improvement upon Data.Sequence, at least.

We used to have that. It gives the real time guarantee, but the average performance is about twice as slow as Data.Sequence.

Heinrich Apfelmus

5 Mar 5 Mar

10:04 a.m.

Ross Paterson wrote:

...

On Thu, Mar 04, 2010 at 03:40:56PM +0100, Heinrich Apfelmus wrote:

...
How about the real time version of the banker's queue? That would be a definite improvement upon Data.Sequence, at least.

We used to have that. It gives the real time guarantee, but the average performance is about twice as slow as Data.Sequence.

Oh? I would have thought that the finger tree has a higher constant factor since every element has to be "carried over" a few times while traveling from back to front. In contrast, it's clear that the real time queue touches each element only 2-3 times (add to rear, force the suspension in the front, remove from front). Regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com

Jeremy Shaw

12:01 a.m.

Hello, I have a somewhat related project suggestion. As far as I can tell there is no good Set / Map like data structure which supports multiple keys. The closest option is happstack-ixset. However, it provides no real type safety, and I suspect it is not very efficient: http://hackage.haskell.org/package/happstack-ixset As an experiment I started a data-type loosely based on kd-trees and the existing Data.Map: http://src.seereason.com/haskell-kdmap/ This type supports two keys. You can do looks using one or both keys. Ideally, though, you would be able to support an arbitrary number of keys, and do queries that are more complex than a straight lookup. The aim is to produce a type which supports many of the same operations as a normal database table. (But with out going so far as to be exactly like a database table). - jeremy On Wed, Mar 3, 2010 at 8:23 AM, Milan Straka wrote:

...

Hi all,

I have started an internship in Cambridge and will spend 3 months on Haskell. One of the possible projects is improving the containers library. Things I am thinking about: - measuring the performance of existing implementations (notably Map and Set) and possibly improve them (either without or with API changes), - adding Queue and a PriorityQueue, - maybe adding a generalized trie, - maybe adding a hashtable (more like a trie of the hashes, something in the line of Ideal hash trees; and maybe a 'classic' hash modifiable in the ST monad?)

I would be grateful for any comments, thoughts or wishes.

Milan Straka _______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

5605

Age (days ago)

5610

Last active (days ago)

List overview

Download

18 comments

10 participants

participants (10)

Axel Simon
Bertram Felgenhauer
Edward Kmett
Heinrich Apfelmus
Jeremy Shaw
Johan Tibell
John Meacham
Jonathan Cast
Milan Straka
Ross Paterson