Proposal: containers: add indexing operations for Set

newer
Proposal: Control.Concurrent.Async

Patrick Palka

3 Jul 2012 3 Jul '12

7:10 p.m.

I'd like to add the following indexing operations for Data.Set.Set, to complement the existing functions for Data.Map.Map: findIndex :: Ord a => a -> Set a -> Int lookupIndex :: Ord a => a -> Set a -> Maybe Int elemAt :: Int -> Set a -> a deleteAt :: Int -> Set a -> Set a A tentative patch can be found here https://github.com/haskell/containers/pull/13 Discussion period: 2 weeks

Show replies by date

Milan Straka

3 Jul 3 Jul

7:16 p.m.

Hi,

...

I'd like to add the following indexing operations for Data.Set.Set, to complement the existing functions for Data.Map.Map:

findIndex :: Ord a => a -> Set a -> Int lookupIndex :: Ord a => a -> Set a -> Maybe Int elemAt :: Int -> Set a -> a deleteAt :: Int -> Set a -> Set a

+1 Milan

Johan Tibell

7:30 p.m.

Hi, On Tue, Jul 3, 2012 at 3:10 PM, Patrick Palka wrote:

...

I'd like to add the following indexing operations for Data.Set.Set, to complement the existing functions for Data.Map.Map:

findIndex :: Ord a => a -> Set a -> Int lookupIndex :: Ord a => a -> Set a -> Maybe Int elemAt :: Int -> Set a -> a deleteAt :: Int -> Set a -> Set a

A tentative patch can be found here https://github.com/haskell/containers/pull/13

I'm fine with this API addition, as it's consistent with what we already do for (ordered) maps. I haven't reviewed the actual patch yet. Cheers, Johan

Edward Kmett

4 Jul 4 Jul

12:12 a.m.

+1 On Tue, Jul 3, 2012 at 3:10 PM, Patrick Palka wrote:

...

I'd like to add the following indexing operations for Data.Set.Set, to complement the existing functions for Data.Map.Map:

findIndex :: Ord a => a -> Set a -> Int lookupIndex :: Ord a => a -> Set a -> Maybe Int elemAt :: Int -> Set a -> a deleteAt :: Int -> Set a -> Set a

A tentative patch can be found here https://github.com/haskell/** containers/pull/13 https://github.com/haskell/containers/pull/13

Discussion period: 2 weeks

______________________________**_________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/**mailman/listinfo/libraries http://www.haskell.org/mailman/listinfo/libraries

Jon Fairbairn

8:39 a.m.

Patrick Palka writes:

...

I'd like to add the following indexing operations for Data.Set.Set, to complement the existing functions for Data.Map.Map:

findIndex :: Ord a => a -> Set a -> Int lookupIndex :: Ord a => a -> Set a -> Maybe Int elemAt :: Int -> Set a -> a deleteAt :: Int -> Set a -> Set a

These seem to be abstraction-breaking. And what do we expect from let i = findIndex a s1 in deleteAt i (s1 `union` s2) and similar? -- Jón Fairbairn Jon.Fairbairn@cl.cam.ac.uk

Milan Straka

6:31 p.m.

...

...
I'd like to add the following indexing operations for Data.Set.Set, to complement the existing functions for Data.Map.Map:

findIndex :: Ord a => a -> Set a -> Int lookupIndex :: Ord a => a -> Set a -> Maybe Int elemAt :: Int -> Set a -> a deleteAt :: Int -> Set a -> Set a

These seem to be abstraction-breaking.

There is a natural bijection between the elements of an ordered set and their indexes in this ordering, and the new API works with the elements using the indexes instead of the elements themselves. I believe the set abstraction is not compromised by that.

...

And what do we expect from

let i = findIndex a s1 in deleteAt i (s1 `union` s2)

It removes the i-th least element of (s1 `union` s2). Which element is that is hard to say, depending on s1 and s2. Cheers, Milan

Thomas Schilling

9:02 a.m.

On 3 July 2012 20:10, Patrick Palka wrote:

...

I'd like to add the following indexing operations for Data.Set.Set, to complement the existing functions for Data.Map.Map:

findIndex :: Ord a => a -> Set a -> Int lookupIndex :: Ord a => a -> Set a -> Maybe Int elemAt :: Int -> Set a -> a deleteAt :: Int -> Set a -> Set a

I'm surprised that these functions already exists in Data.Map. What are the use cases? I assume the only sensible semantics is to behave as a slightly more efficient version of the same function working on "Set.toList set" which must be ordered. I don't think that breaks any abstractions, but they do seem like very weird functions to have. Something that I might consider potentially useful function would be a variant of splitAt, so that you could split a set using a fixed ratio (e.g., "splitAt (Set.size set `div` 2) set"). So far, -1 / Thomas

Henning Thielemann

9:26 a.m.

On Wed, 4 Jul 2012, Thomas Schilling wrote:

...

I'm surprised that these functions already exists in Data.Map. What are the use cases?

These index functions should certainly be part of an OrderedMap AP. But since Map functions have "Ord" constraints and not "MapDomain", the abstraction is already broken by the way how Data.Map is designed today. Anyway, I found the index functions useful when working with wxHaskell and Notebook pages. WX manages these pages by numbers, but I want to add and remove pages dynamically while keeping the pages sorted with respect to their titles.

Patrick Palka

1:54 p.m.

On 7/4/2012 5:02 AM, Thomas Schilling wrote:

...

I'm surprised that these functions already exists in Data.Map. What are the use cases?

One can use the indexing functions to retrieve a random element from the container, or to random-shuffle a container in n log n time.

Thomas Schilling

2:44 p.m.

On 4 July 2012 14:54, Patrick Palka wrote:

...

On 7/4/2012 5:02 AM, Thomas Schilling wrote:

...
I'm surprised that these functions already exists in Data.Map. What are the use cases?

One can use the indexing functions to retrieve a random element from the container, or to random-shuffle a container in n log n time.

OK, that seems like a reasonable use case. The existing documentation is lacking, though. For example, the above use case requires that: 0 <= index < Map.size map It's also not specified that the index refers to the index in the sorted sequence. Of course, that problem here is that "Ord" -- which is really an implementation detail -- becomes part of the specified properties of a Map/Set. E.g., the same function for HashMap probably wouldn't make that guarantee. I.e., what are the properties of these functions? Do we have: - findIndex a set `compare` findIndex b set ==> a `compare` b - findIndex a set == findIndex a (Set.toList set) - ... Hoogle is down (along with haskell.org), ATM, so I can't look up the proper list functions.

Milan Straka

5 Jul 5 Jul

12:01 p.m.

...

OK, that seems like a reasonable use case.

The existing documentation is lacking, though. For example, the above use case requires that: 0 <= index < Map.size map

Yes, that is the case. The documentation should definitely improve.

...

It's also not specified that the index refers to the index in the sorted sequence. Of course, that problem here is that "Ord" -- which is really an implementation detail -- becomes part of the specified properties of a Map/Set. E.g., the same function for HashMap probably wouldn't make that guarantee.

Agreed. These indexing methods work only for OrderededSet and OrderedMap, not for HashMap and HashSet. If the Data.Set and Data.Map modules were called Data.OrdSet and Data.OrdMap, then there would be no problem including the indexing API in them. Personally I consider Data.Set to be Data.OrdSet, not a generic Set API. If we every create a Set class, the indexing API will definitely _not_ be part of it. BTW, the indexing API is currently _not_ part of IntSet and IntMap. The reason is that the data structure does not allow efficient implementation of the indexing API.

...

I.e., what are the properties of these functions? Do we have:

- findIndex a set `compare` findIndex b set ==> a `compare` b

it is even true that - findIndex a set `compare` findIndex b set <==> a `compare` b because (flip findIndex set) is a bijection respecting ordering.

...

- findIndex a set == findIndex a (Set.toList set)

This one holds too. It could be considered a definition of Set.findIndex, but the Set.findIndex is more efficient than the list version. Cheers, Milan

Thomas Schilling

12:10 p.m.

Right, given that Data.Set and Data.Map are de facto implementing ordered sets/maps, I'm fine with adding the proposed functions. The documentation should be improved, though. On 5 July 2012 13:01, Milan Straka wrote:

...

...
I.e., what are the properties of these functions? Do we have:

- findIndex a set `compare` findIndex b set ==> a `compare` b

it is even true that

- findIndex a set `compare` findIndex b set <==> a `compare` b

because (flip findIndex set) is a bijection respecting ordering.

...
- findIndex a set == findIndex a (Set.toList set)

This one holds too. It could be considered a definition of Set.findIndex, but the Set.findIndex is more efficient than the list version.

Sure. But I believe it's good to have a simple and precise semantic model.

Johan Tibell

2:18 p.m.

On Thu, Jul 5, 2012 at 8:01 AM, Milan Straka wrote:

...

Agreed. These indexing methods work only for OrderededSet and OrderedMap, not for HashMap and HashSet.

If the Data.Set and Data.Map modules were called Data.OrdSet and Data.OrdMap, then there would be no problem including the indexing API in them. Personally I consider Data.Set to be Data.OrdSet, not a generic Set API. If we every create a Set class, the indexing API will definitely _not_ be part of it.

I've also considered Map and Set to be OrdMap and OrdSet for quite a while. If we ever add a type class to abstract over e.g. HashMap and Map we'd probably call it Data.Map.Class or something. A potential downside to adding these functions is that I don't know if we can support them efficiently with a different implementation of ordered trees than the size-balanced ones. I still think we should add them though. -- Johan

Milan Straka

3:01 p.m.

...

...
Agreed. These indexing methods work only for OrderededSet and OrderedMap, not for HashMap and HashSet.

If the Data.Set and Data.Map modules were called Data.OrdSet and Data.OrdMap, then there would be no problem including the indexing API in them. Personally I consider Data.Set to be Data.OrdSet, not a generic Set API. If we every create a Set class, the indexing API will definitely _not_ be part of it.

I've also considered Map and Set to be OrdMap and OrdSet for quite a while. If we ever add a type class to abstract over e.g. HashMap and Map we'd probably call it Data.Map.Class or something.

A potential downside to adding these functions is that I don't know if we can support them efficiently with a different implementation of ordered trees than the size-balanced ones. I still think we should add them though.

Any balanced tree can support indexing API, as long as every subtree stores its size. That is obviously trivially true for size-balanced trees, as the subtree sizes are used for balancing. If for example AVL trees were used, sizes would have to be stored explicitly in every node for indexing API to work fast. The complexity of 'size' also depends on whether sizes are stored explicitly in every node -- Data.Map.size is constant-time operation. Consider IntMap -- there sizes are not stored in the nodes, so we have no indexing API and 'size' is linear-time operation. BTW, when I last benchmarked it, adding size to every node of a red-black tree increased the time complexity of insert, lookup and delete by 0-5%. Nevertheless, I also think we should add indexing API to Data.Set. Cheers, Milan

Johan Tibell

9 Jul 9 Jul

5:59 p.m.

On Thu, Jul 5, 2012 at 8:01 AM, Milan Straka wrote:

...

Any balanced tree can support indexing API, as long as every subtree stores its size. That is obviously trivially true for size-balanced trees, as the subtree sizes are used for balancing. If for example AVL trees were used, sizes would have to be stored explicitly in every node for indexing API to work fast.

There might be some other benefits to adding the size as well. IIRC union is faster if the first set is smaller than the second set. If we can look up the sizes in O(1) time we can rearrange the arguments to make sure this is the case. -- Johan

Evan Laforge

10 Jul 10 Jul

12:46 a.m.

On Tue, Jul 10, 2012 at 1:59 AM, Johan Tibell wrote:

...

On Thu, Jul 5, 2012 at 8:01 AM, Milan Straka wrote:

...
Any balanced tree can support indexing API, as long as every subtree stores its size. That is obviously trivially true for size-balanced trees, as the subtree sizes are used for balancing. If for example AVL trees were used, sizes would have to be stored explicitly in every node for indexing API to work fast.

There might be some other benefits to adding the size as well. IIRC union is faster if the first set is smaller than the second set. If we can look up the sizes in O(1) time we can rearrange the arguments to make sure this is the case.

I've always wondered why Data.Map doesn't do this by default. E.g. I have: -- | Map.union says it's more efficient with @big `union` small@, so this one -- flips the args to be more efficient. It's still left-biased. union2 :: (Ord k) => Map.Map k v -> Map.Map k v -> Map.Map k v union2 m1 m2 | Map.size m1 < Map.size m2 = m2 `right` m1 | otherwise = m1 `Map.union` m2 where right = Map.unionWith (\_ a -> a)

Milan Straka

3:31 p.m.

Hi,

...

On Tue, Jul 10, 2012 at 1:59 AM, Johan Tibell wrote:

...
On Thu, Jul 5, 2012 at 8:01 AM, Milan Straka wrote:

...
Any balanced tree can support indexing API, as long as every subtree stores its size. That is obviously trivially true for size-balanced trees, as the subtree sizes are used for balancing. If for example AVL trees were used, sizes would have to be stored explicitly in every node for indexing API to work fast.

There might be some other benefits to adding the size as well. IIRC union is faster if the first set is smaller than the second set. If we can look up the sizes in O(1) time we can rearrange the arguments to make sure this is the case.

I've always wondered why Data.Map doesn't do this by default. E.g. I have:

-- | Map.union says it's more efficient with @big `union` small@, so this one -- flips the args to be more efficient. It's still left-biased. union2 :: (Ord k) => Map.Map k v -> Map.Map k v -> Map.Map k v union2 m1 m2 | Map.size m1 < Map.size m2 = m2 `right` m1 | otherwise = m1 `Map.union` m2 where right = Map.unionWith (\_ a -> a)

there are several problems with this. Most notably, it is not true that "set1 < set2" is always better than "set1 > set2" -- if you run the containers/benchmarks/SetOperations/bench-SetOperations-Map benchmark, you can see that for different input distributions, "set1 < set2" is sometimes faster and sometimes slower than "set1 > set2". For various input distributions, difference between "set1 < set2" and "set1 > set2" are -7.5%, -25%, -35%, -17%, +20%, +8%, -13%. Also union is more efficient than (unionWith const), because is preserves sharing of the nodes present in both sets. That could be solved by creating another implementation of right-biased unionR which also preserves sharing. Probably the best idea is just to remove the documentation comment about "bigset `union` smallset" being faster, as it is not true :) Cheers, Milan

Evan Laforge

3:35 p.m.

...

Probably the best idea is just to remove the documentation comment about "bigset `union` smallset" being faster, as it is not true :)

Nice, I'll go ahead and remove that function from my library, thanks for the clarification. And I agree with removing it from the documentation if it's not true!

Milan Straka

4 Jul 4 Jul

6:04 p.m.

...

...
I'd like to add the following indexing operations for Data.Set.Set, to complement the existing functions for Data.Map.Map:

findIndex :: Ord a => a -> Set a -> Int lookupIndex :: Ord a => a -> Set a -> Maybe Int elemAt :: Int -> Set a -> a deleteAt :: Int -> Set a -> Set a

I'm surprised that these functions already exists in Data.Map. What are the use cases?

For an ordered Set, it makes sense to be able to work with the elements using their ordering -- i.e., to work with "i-th least element of s", i.e. the element toAscList s !! i. If you have for example a set of words of some language, you might want to work with those not explicitly, but using "i-th word of the language". Of you might want to find the median of the elements in the set.

...

I assume the only sensible semantics is to behave as a slightly more efficient version of the same function working on "Set.toList set" which must be ordered.

Yes. But the "slightly" means exponentially -- getting the median of the set takes linear time using Set.toList, but only logarithmic time using Set.elemAt.

...

I don't think that breaks any abstractions, but they do seem like very weird functions to have. Something that I might consider potentially useful function would be a variant of splitAt, so that you could split a set using a fixed ratio (e.g., "splitAt (Set.size set `div` 2) set").

splitAt index set = split (elemAt index set) set would work reasonably well. It traverses the tree twice, but the first traversal is very fast. Cheers, Milan

4742

Age (days ago)

4749

Last active (days ago)

List overview

Download

18 comments

8 participants

participants (8)

Edward Kmett
Evan Laforge
Henning Thielemann
Johan Tibell
Jon Fairbairn
Milan Straka
Patrick Palka
Thomas Schilling