Proposal: IntMap.differenceKeysSet for removing an IntSet of keys

Ticket: http://hackage.haskell.org/trac/ghc/ticket/5242 Currently, IntMap.difference ma mb removes all the keys in mb from ma, where the elements of the two IntMaps can be of different types; the elements of mb are not used. There is no efficient way to remove an IntSet of keys, however. These patches adds the IntMap.differenceKeysSet function—essentially a copy/paste of difference—that satisfies the following property: prop_DiffKeysSet :: Map Int -> Map () -> Bool prop_DiffKeysSet t1 t2 = difference t1 t2 == differenceKeysSet t1 (keysSet t2) Cons: Not so happy with the name. Code bloat. Thanks, /Liyang

If we had something like fromIntSet :: IntSet -> IntMap () then at least we could have a RULE to convert "difference m (fromIntSet s)" to "differenceIntSet m s". Then we could just hide differenceIntSet and the name wouldn't matter. Perhaps this "fromIntSet" function should be proposed as well? Cheers, -- Felipe.

On 10 June 2011 10:09, Felipe Almeida Lessa
If we had something like fromIntSet :: IntSet -> IntMap () then at least we could have a RULE to convert "difference m (fromIntSet s)" to "differenceIntSet m s". Then we could just hide differenceIntSet and the name wouldn't matter. Perhaps this "fromIntSet" function should be proposed as well?
Sure, that sounds quite reasonable. Have done that as patch 0005 on the ticket. I've generalised it a bit to: fromSet :: (Key -> a) -> IntSet -> IntMap a Which version does everyone prefer? differenceKeysSet m s, or difference m (fromSet f s) ? Cheers, /Liyang

On 6/9/11 10:17 PM, Liyang HU wrote:
On 10 June 2011 10:09, Felipe Almeida Lessa
wrote: If we had something like fromIntSet :: IntSet -> IntMap () then at least we could have a RULE to convert "difference m (fromIntSet s)" to "differenceIntSet m s". Then we could just hide differenceIntSet and the name wouldn't matter. Perhaps this "fromIntSet" function should be proposed as well?
Sure, that sounds quite reasonable. Have done that as patch 0005 on the ticket. I've generalised it a bit to:
fromSet :: (Key -> a) -> IntSet -> IntMap a
Which version does everyone prefer? differenceKeysSet m s, or difference m (fromSet f s) ?
One downside of using fromSet is that it clones the IntSet before differencing. Since IntSet is spine-strict, this can be quite expensive since it'll clone the whole thing no matter how much is used. I like the idea of having a fromSet function with the (Key->a) argument. But there's also room for differenceKeysSet for efficiency reasons. Perhaps you can come up with some fusion rules to get rid of the overhead of difference m (fromSet f s)? Certainly there should be the rule: map f . fromSet g = fromSet (f . g) and similar for the other mapping functions. -- Live well, ~wren

wren ng thornton
On 6/9/11 10:17 PM, Liyang HU wrote:
fromSet :: (Key -> a) -> IntSet -> IntMap a One downside of using fromSet is that it clones the IntSet before differencing. Since IntSet is spine-strict, this can be quite expensive since it'll clone the whole thing no matter how much is used.
I like the idea of having a fromSet function with the (Key->a) argument. But there's also room for differenceKeysSet for efficiency reasons. Perhaps you can come up with some fusion rules to get rid of the overhead of difference m (fromSet f s)?
There's precisely this rule in patch 0005 on the ticket.
Certainly there should be the rule: map f . fromSet g = fromSet (f . g) and similar for the other mapping functions.
I have a bunch of similar RULES in my own code already… (map, filter &c.) Should probably file them under a separate ticket. Cheers, /Liyang

Hi,
Ticket: http://hackage.haskell.org/trac/ghc/ticket/5242
Currently, IntMap.difference ma mb removes all the keys in mb from ma, where the elements of the two IntMaps can be of different types; the elements of mb are not used.
There is no efficient way to remove an IntSet of keys, however. These patches adds the IntMap.differenceKeysSet function—essentially a copy/paste of difference—that satisfies the following property:
prop_DiffKeysSet :: Map Int -> Map () -> Bool prop_DiffKeysSet t1 t2 = difference t1 t2 == differenceKeysSet t1 (keysSet t2)
Cons: Not so happy with the name. Code bloat.
another problem is that more methods could be overloaded for IntSets -- also union and intersect. Adding the variant only for difference seems quite arbitrary, but adding all is bloating the API. I liked the second idea with the fromSet :: (Key -> a) -> IntSet.IntSet -> IntMap a but I would not add the rewriting rules. The intermediate IntMap has to be constructed, but I think it is reasonable price for doing cross-structure operations and not extending the API. Instead we could make fromSet more efficient -- if the IntMap could access the internal representation of IntSet, the can implement fromSet just by adding values to each node. The strictness annotations are a bit harmful here -- if they were not there, only the parts of the structure needed for the following difference (union, intersect) would be constructed. BTW, seeing the IntSet representation in IntMap would allow us to define keysSet more efficient too. Currently if you want to do IntSet \\ IntMap, you can use IntSet.difference set (IntMap.keysSet map), with fromSet we could do IntMap \\ IntSet by IntMap.difference map (IntMap.fromSet (const ()) set), which seems nicely analogous. Cheers, Milan

Milan Straka
Ticket: http://hackage.haskell.org/trac/ghc/ticket/5242 another problem is that more methods could be overloaded for IntSets -- also union and intersect. Adding the variant only for difference seems quite arbitrary, but adding all is bloating the API.
For intersection, sure. I can see plausible use-cases for this too. Patch 0007. For union, where would it get a value from, if a key was found in the set but not the map? Take a function to generate the value from the key like fromSet? Given union is left-biased in its effect but not in its type, I'm going to need separate versions to handle the two ways a map and a set can be combined. Not sure I want to go down this route, especially as I can't think of a good use-case.
I liked the second idea with the fromSet :: (Key -> a) -> IntSet.IntSet -> IntMap a but I would not add the rewriting rules. The intermediate IntMap has to be constructed, but I think it is reasonable price for doing cross-structure operations and not extending the API.
I don't understand your reasoning behind not adding the rewrite rules. Simply don't export differenceKeysSet, as patch 0005 does.
Instead we could make fromSet more efficient -- if the IntMap could access the internal representation of IntSet, the can implement fromSet just by adding values to each node.
Right. I factored "data IntSet" out to a separate module so that this was possible. This takes its time complexity from O(n*min(n,W)) down to O(n), I believe.
The strictness annotations are a bit harmful here -- if they were not there, only the parts of the structure needed for the following difference (union, intersect) would be constructed. BTW, seeing the IntSet representation in IntMap would allow us to define keysSet more efficient too.
Done both! Patch 0006. Cheers, /Liyang

Milan Straka
writes: Ticket: http://hackage.haskell.org/trac/ghc/ticket/5242 another problem is that more methods could be overloaded for IntSets -- also union and intersect. Adding the variant only for difference seems quite arbitrary, but adding all is bloating the API.
For intersection, sure. I can see plausible use-cases for this too. Patch 0007.
For union, where would it get a value from, if a key was found in the set but not the map? Take a function to generate the value from the key like fromSet? Given union is left-biased in its effect but not in its type, I'm going to need separate versions to handle the two ways a map and a set can be combined. Not sure I want to go down this route, especially as I can't think of a good use-case.
My point was, that there are many combinations you can think of -- IntMap minus IntSet, IntMap minus Map Int, IntMap minus list of Ints, IntMap union Map Int, IntMap intersection IntSet, IntMap intersection Set Int and so on. With fromSet and fromDistinctAscList, you can handle all of these easily, although with some time penalty. For example the union case, union map (fromSet (const 1) set) seems fine.
I liked the second idea with the fromSet :: (Key -> a) -> IntSet.IntSet -> IntMap a but I would not add the rewriting rules. The intermediate IntMap has to be constructed, but I think it is reasonable price for doing cross-structure operations and not extending the API.
I don't understand your reasoning behind not adding the rewrite rules. Simply don't export differenceKeysSet, as patch 0005 does.
(Johan, please read this too.) I am quite unhappy about exporting differenceKeysSet and other hybrid functions working on different data structures. But I do not like depending on rules alone either. If we promise that difference map (fromSet set) works without creating intermediate map, then it - works only for GHC - is nontrivial to use -- which functions do really fuse with fromSet? My biggest problem is "Does difference with IntSet deserves to have a special function implementing it?" I do not believe so. There are many functions that could fuse with fromSet, you could even implement better implementations of e.g. union with Map Int. But how do we measure which combinations are worth it? For me, right now no functions working with different data structures are worth it. The API is complicated as it is and I do not feel adding cross-structure functions is a good thing to do. I am happy with being able to convert between the structures efficiently. Johan, what is your opinion?
Instead we could make fromSet more efficient -- if the IntMap could access the internal representation of IntSet, the can implement fromSet just by adding values to each node.
Right. I factored "data IntSet" out to a separate module so that this was possible. This takes its time complexity from O(n*min(n,W)) down to O(n), I believe.
The strictness annotations are a bit harmful here -- if they were not there, only the parts of the structure needed for the following difference (union, intersect) would be constructed. BTW, seeing the IntSet representation in IntMap would allow us to define keysSet more efficient too.
Done both! Patch 0006.
Thanks for all the pathes! The patch 0006 is unfortunately missing the new module and also the changes to IntSet. Cheers, Milan

Greetings!
On 13 June 2011 21:40, Milan Straka
My point was, that there are many combinations you can think of -- IntMap minus IntSet, IntMap minus Map Int, IntMap minus list of Ints, IntMap union Map Int, IntMap intersection IntSet, IntMap intersection Set Int and so on.
Actually, it's just IntMap and IntSet, since they share the same underlying patricia tree implementation.
I am quite unhappy about exporting differenceKeysSet and other hybrid functions working on different data structures.
Agreed. I don't think we should export differenceKeysSet either.
But I do not like depending on rules alone either. If we promise that difference map (fromSet set) works without creating intermediate map, then it - works only for GHC - is nontrivial to use -- which functions do really fuse with fromSet?
I'm not promising anything performance-wise… it still produces the same result without the rules, just that an intermediate IntMap will be produced and consumed. With the rules, it's the same O(m+n) time complexity, but with a much smaller constant. How is this any more controversial than list fusion?
My biggest problem is "Does difference with IntSet deserves to have a special function implementing it?" I do not believe so.
The point is that IntSet and IntMap are identical internally.
For me, right now no functions working with different data structures are worth it. The API is complicated as it is and I do not feel adding cross-structure functions is a good thing to do. I am happy with being able to convert between the structures efficiently.
For the record, the only API extension I'm proposing is: fromSet :: (Key -> a) -> IntSet -> IntMap a Nothing else changes, as far as users are concerned, except that intersection m (fromSet f s) and difference m (fromSet g s) will perform equally well as intersection/difference between pairs of maps.
Thanks for all the pathes! The patch 0006 is unfortunately missing the new module and also the changes to IntSet.
Those are in patch 0001. Cheers, /Liyang

Hello, thanks for patience with me. <snip>
My point was, that there are many combinations you can think of -- IntMap minus IntSet, IntMap minus Map Int, IntMap minus list of Ints, IntMap union Map Int, IntMap intersection IntSet, IntMap intersection Set Int and so on.
Actually, it's just IntMap and IntSet, since they share the same underlying patricia tree implementation. <snip>
My biggest problem is "Does difference with IntSet deserves to have a special function implementing it?" I do not believe so.
The point is that IntSet and IntMap are identical internally.
The same implementation means we can implement those. But does it also imply a reason for doing so? I am not sure we are doing it just because we can or because it is useful. Does some other people think it is worth having intersection m (fromSet f s) and difference m (fromSet g s) perform equally well as intersection/difference between pairs of maps, instead of a penalty for converting the set to the map?
But I do not like depending on rules alone either. If we promise that difference map (fromSet set) works without creating intermediate map, then it - works only for GHC - is nontrivial to use -- which functions do really fuse with fromSet?
I'm not promising anything performance-wise… it still produces the same result without the rules, just that an intermediate IntMap will be produced and consumed. With the rules, it's the same O(m+n) time complexity, but with a much smaller constant.
How is this any more controversial than list fusion?
Hm, you are right. I was still thinking about deterministically calling differenceKeysSet instead of just invisible optimization.
For me, right now no functions working with different data structures are worth it. The API is complicated as it is and I do not feel adding cross-structure functions is a good thing to do. I am happy with being able to convert between the structures efficiently.
For the record, the only API extension I'm proposing is:
fromSet :: (Key -> a) -> IntSet -> IntMap a
and I am happy with this one.
Nothing else changes, as far as users are concerned, except that
intersection m (fromSet f s) and difference m (fromSet g s)
will perform equally well as intersection/difference between pairs of maps.
The last thing I am worried about is whether the new duplicated code (intersectionKeysSet and differenceKeysSet) is worth the speedup. I will wait for some other opinions on this.
Thanks for all the pathes! The patch 0006 is unfortunately missing the new module and also the changes to IntSet.
Those are in patch 0001.
Oh, I am sorry I did not read the first one. Also, the IntMap is considered to be a replacement of Map -- I would like to add the fromSet method to the Map module too. But I am not sure I want to add the specialization of intersectionKeysSet and differenceKeysSet, at least not now -- I want to tune the representation of Map and Set and less code I need to change the better. Cheers, Milan

Hi,
On 13 June 2011 21:40, Milan Straka
wrote: My point was, that there are many combinations you can think of -- IntMap minus IntSet, IntMap minus Map Int, IntMap minus list of Ints, IntMap union Map Int, IntMap intersection IntSet, IntMap intersection Set Int and so on.
Actually, it's just IntMap and IntSet, since they share the same underlying patricia tree implementation.
I am quite unhappy about exporting differenceKeysSet and other hybrid functions working on different data structures.
Agreed. I don't think we should export differenceKeysSet either.
But I do not like depending on rules alone either. If we promise that difference map (fromSet set) works without creating intermediate map, then it - works only for GHC - is nontrivial to use -- which functions do really fuse with fromSet?
I'm not promising anything performance-wise… it still produces the same result without the rules, just that an intermediate IntMap will be produced and consumed. With the rules, it's the same O(m+n) time complexity, but with a much smaller constant.
How is this any more controversial than list fusion?
My biggest problem is "Does difference with IntSet deserves to have a special function implementing it?" I do not believe so.
The point is that IntSet and IntMap are identical internally.
For me, right now no functions working with different data structures are worth it. The API is complicated as it is and I do not feel adding cross-structure functions is a good thing to do. I am happy with being able to convert between the structures efficiently.
For the record, the only API extension I'm proposing is:
fromSet :: (Key -> a) -> IntSet -> IntMap a
Nothing else changes, as far as users are concerned, except that
intersection m (fromSet f s) and difference m (fromSet g s)
will perform equally well as intersection/difference between pairs of maps.
Thanks for all the pathes! The patch 0006 is unfortunately missing the new module and also the changes to IntSet.
Those are in patch 0001.
Sorry for the long inactivity. Now we have a repo on github, so I am returning to this. I want to add fromSet to Data.Map and Data.IntMap. For the difference specialization, could you please benchmark what is the difference of having differenceKeysSet versus not having and performing difference + fromSet? We should verify that there is reasonable benefit in adding differenceKeysSet. I am now leaving on vacation and won't be reading mail till Mon 17. Cheers, Milan

Hi,
Sorry for the late reply. I've been on vacation and had some catching
up to do the last few days.
On Mon, Jun 13, 2011 at 2:40 PM, Milan Straka
My point was, that there are many combinations you can think of -- IntMap minus IntSet, IntMap minus Map Int, IntMap minus list of Ints, IntMap union Map Int, IntMap intersection IntSet, IntMap intersection Set Int and so on.
With fromSet and fromDistinctAscList, you can handle all of these easily, although with some time penalty. For example the union case, union map (fromSet (const 1) set) seems fine.
The containers library doesn't separate algorithms from data structures, like e.g. STL does. As I wrote in another email it would be nice if we could separate traversal of elements in containers from the algorithms working on the elements (like difference). Unfortunately I don't know how to do this.
I am quite unhappy about exporting differenceKeysSet and other hybrid functions working on different data structures.
But I do not like depending on rules alone either. If we promise that difference map (fromSet set) works without creating intermediate map, then it - works only for GHC - is nontrivial to use -- which functions do really fuse with fromSet?
My biggest problem is "Does difference with IntSet deserves to have a special function implementing it?" I do not believe so. There are many functions that could fuse with fromSet, you could even implement better implementations of e.g. union with Map Int. But how do we measure which combinations are worth it?
We should measure the time difference, if it's small I don't think it's worth it.
For me, right now no functions working with different data structures are worth it. The API is complicated as it is and I do not feel adding cross-structure functions is a good thing to do. I am happy with being able to convert between the structures efficiently.
Johan, what is your opinion?
I'm off the same mind. On a more general note: The containers library has several problems: * Lack of abstraction You cannot write a function that works on different map implementations. You cannot use a Map in a function that expects a Set even though Map could have a O(1) conversion to a set view (most OO languages allow this) We cannot easily change the implementation. Data.Map requires the implementation to be a weight balanced tree due to the presences of functions that work on indexes. If we manage to write a faster map using AVL trees we cannot use it to improve the performance of Data.Map, even though a tiny minority of the users use the index-based functions. * Code duplication There's lots of code duplication in the implementation. I don't know how many times we write the same traversal just to do something slightly different at the lead position. There are algorithms that should work cross data structure (like difference) but doesn't. Unfortunately I don't know how to address either group of problems well. Perhaps with associated data types we can finally write a type class for Map and Set that works well. If we could get foldr/build fusion to fuse left folds we could use lists to implement e.g. difference efficiently by doing: difference s1 s2 = fromAscList . union (toAscList s1) (toAscList s2) which would fused to a "loop" that traverses both sets at the same time, building a new set as we go. Cheers, Johan
participants (5)
-
Felipe Almeida Lessa
-
Johan Tibell
-
Liyang HU
-
Milan Straka
-
wren ng thornton