Proposal: more general unionWith for Data.Map

2012/1/24 Johan Tibell
On Tue, Jan 24, 2012 at 9:35 AM, Christian Sattler
wrote: There are some high-level operations on maps which take two tree traversals only because the interface fails to expose sufficiently general functions. This proposal is concerned with an analogue of unionWithKey of type Ord k => (k -> a -> a -> Maybe a) -> Map k a -> Map k a -> Map k a, with the intended semantics that if a key is present in both maps and the operation applied to the key and corresponding values returns Nothing, the key is deleted from the result.
Is union really an appropriate name here? I expect the following to hold:
forall k ∊ (keys m1) ∪ (keys m2) => k ∊ (m1 ∪ m2)
This means that keys must not be deleted by the union operator. Perhaps 'merge' is a better name.
Cheers, Johan
I don't care much about the naming, but note that the analogous property already fails for the generalized intersectionWithKey in the development version.

On Tue, Jan 24, 2012 at 11:42 AM, Christian Sattler
I don't care much about the naming, but note that the analogous property already fails for the generalized intersectionWithKey in the development version.
I suspected as much. I think the current intersectionWithKey is broken. We should have a single function, mergeWithKey, that allows people to do the things they currently do using Maybe return values in e.g. intersectionWithKey. Cheers, Johan

Hi Johan,
On Tue, Jan 24, 2012 at 11:42 AM, Christian Sattler
wrote: I don't care much about the naming, but note that the analogous property already fails for the generalized intersectionWithKey in the development version.
I suspected as much. I think the current intersectionWithKey is broken. We should have a single function, mergeWithKey, that allows people to do the things they currently do using Maybe return values in e.g. intersectionWithKey.
Funnily, I just also used mergeWithKey for the "ultimate combining function" in this thread :) FYI, the new intersectionWithKey is not released yet -- maybe we could leave intersectionWithKey as it is, and provide mergeWith[Key] instead. Opinions? Cheers, Milan

Hi,
On Tue, Jan 24, 2012 at 11:42 AM, Christian Sattler
wrote: I don't care much about the naming, but note that the analogous property already fails for the generalized intersectionWithKey in the development version.
I suspected as much. I think the current intersectionWithKey is broken. We should have a single function, mergeWithKey, that allows people to do the things they currently do using Maybe return values in e.g. intersectionWithKey.
Funnily, I just also used mergeWithKey for the "ultimate combining function" in this thread :)
FYI, the new intersectionWithKey is not released yet -- maybe we could leave intersectionWithKey as it is, and provide mergeWith[Key] instead. Opinions?
Oh, Christian just noted that the performance of mergeWithKey :: Ord k => (k -> Maybe a -> Maybe b -> Maybe c) -> Map k a -> Map k b -> Map k c is too big -- it is at least O(size_of_first_op + size_of_second_op). But intersectionWithKey and unionWithKey is at most O(size_of_smaller * log(size_of_bigger)). So the proposal stands: should we add mergeWithKey :: Ord k => (k -> a -> a -> Maybe a) -> Map k a -> Map k a -> Map k a ? Cheers, Milan

On 1/24/12 3:39 PM, Milan Straka wrote:
Hi,
On Tue, Jan 24, 2012 at 11:42 AM, Christian Sattler
wrote: I don't care much about the naming, but note that the analogous property already fails for the generalized intersectionWithKey in the development version.
I suspected as much. I think the current intersectionWithKey is broken. We should have a single function, mergeWithKey, that allows people to do the things they currently do using Maybe return values in e.g. intersectionWithKey.
Funnily, I just also used mergeWithKey for the "ultimate combining function" in this thread :)
FYI, the new intersectionWithKey is not released yet -- maybe we could leave intersectionWithKey as it is, and provide mergeWith[Key] instead. Opinions?
Oh, Christian just noted that the performance of mergeWithKey :: Ord k => (k -> Maybe a -> Maybe b -> Maybe c) -> Map k a -> Map k b -> Map k c
If you're going to go for that level of generality, I'd suggest looking into my currently unpublished Data.Or[1] for combining the a and b arguments with the correct semantics. Naturally, this function will be less efficient for union-like operators since it requires traversing non-intersecting branches of the trees as well. [1] http://code.haskell.org/~wren/data-or/dist/doc/html/data-or/ -- Live well, ~wren

On Tue, Jan 24, 2012 at 11:58 AM, Milan Straka
FYI, the new intersectionWithKey is not released yet -- maybe we could leave intersectionWithKey as it is, and provide mergeWith[Key] instead. Opinions?
I think so. We should avoid the words union and intersection if they don't correspond to the mathematical idea, or we risk confusing people.

On Tue, Jan 24, 2012 at 11:58 AM, Milan Straka
wrote: FYI, the new intersectionWithKey is not released yet -- maybe we could leave intersectionWithKey as it is, and provide mergeWith[Key] instead. Opinions?
I think so. We should avoid the words union and intersection if they don't correspond to the mathematical idea, or we risk confusing people.
I think that with intersectionWithKey, we are still holding to the mathematical idea. Imagine I have a Map Hash List. When performing intersection of two such maps, it can happen that for one hash there are different values in the maps and so in the intersection the has is not present. Generally if the Map k a is representing a set using both k and a, the intersection on the same key can still be empty. But this is not so in union -- if key is any of maps, it should be in the result. So personally I would generalize the combining function of intersectionWith to (a -> b -> Maybe c), but not generalize the combining function of unionWith, and instead provide mergeWith with combining function (a -> a -> Maybe a). (Similarly for *Key variants). Cheers, Milan

On Tue, Jan 24, 2012 at 1:25 PM, Milan Straka
I think that with intersectionWithKey, we are still holding to the mathematical idea. Imagine I have a Map Hash List. When performing intersection of two such maps, it can happen that for one hash there are different values in the maps and so in the intersection the has is not present.
Generally if the Map k a is representing a set using both k and a, the intersection on the same key can still be empty.
But Map doesn't model a set using both k and a, if it did this would hold: length (insert 1 'a' (insert 1 'b') empty) == 2 i.e. we would have a multimap. -- Johan

On 1/24/12 2:52 PM, Johan Tibell wrote:
On Tue, Jan 24, 2012 at 11:42 AM, Christian Sattler
wrote: I don't care much about the naming, but note that the analogous property already fails for the generalized intersectionWithKey in the development version.
I suspected as much. I think the current intersectionWithKey is broken. We should have a single function, mergeWithKey, that allows people to do the things they currently do using Maybe return values in e.g. intersectionWithKey.
I'd just like to point out that bytestring-trie has some prior art using the name "merge"[1]. AKA, +1 for that name and variants thereof. [1] http://hackage.haskell.org/packages/archive/bytestring-trie/0.2.3/doc/html/D... -- Live well, ~wren

Is union really an appropriate name here? I expect the following to hold:
forall k ∊ (keys m1) ∪ (keys m2) => k ∊ (m1 ∪ m2)
This means that keys must not be deleted by the union operator. Perhaps 'merge' is a better name.
Cheers, Johan
I don't care much about the naming, but note that the analogous property already fails for the generalized intersectionWithKey in the development version.
It is true that forall k ∊ (keys m1) ∩ (keys m2) => k ∊ (m1 ∩ m2) fails, but this expression is not really correct for intersectionWithKey, as it intersects Map k a and Map k b, so Map k a ∩ Map k b is not well defined. Anyway, we could instead of proposed unionWithKey offer functions mergeWith :: Ord k => (Maybe a -> Maybe b -> Maybe c) -> Map k a -> Map k b -> Map k c mergeWithKey :: Ord k => (k -> Maybe a -> Maybe b -> Maybe c) -> Map k a -> Map k b -> Map k c The combining function is executed for every unique key from both maps. All functions unionWith, intersectionWith, differencseWith, proposed unionWith can be expressed using this function. But I am not sure about the performance (using so many Maybes). Cheers, Milan

Milan Straka wrote:
Anyway, we could instead of proposed unionWithKey offer functions mergeWith :: Ord k => (Maybe a -> Maybe b -> Maybe c) -> Map k a -> Map k b -> Map k c mergeWithKey :: Ord k => (k -> Maybe a -> Maybe b -> Maybe c) -> Map k a -> Map k b -> Map k c The combining function is executed for every unique key from both maps. All functions unionWith, intersectionWith, differencseWith, proposed unionWith can be expressed using this function.
But I am not sure about the performance (using so many Maybes).
There's a bigger problem with performance than this constant factor, namely that it's no longer possible to short-cut evaluation for subtrees of the map that are known to be disjoint. For example, empty `intersect` x currently can be computed in constant time, no matter what x is; this can not be done with `merge`. This reasoning justifies the existence of intersection, union and difference functions in Data.Map in addition to a merge function. Of course, the functions union, intersect and difference could be implemented as a single function that takes two boolean arguments to specify which of the disjoint parts to keep in the result. Bertram

On 1/25/12 10:45 AM, Bertram Felgenhauer wrote:
Milan Straka wrote:
Anyway, we could instead of proposed unionWithKey offer functions mergeWith :: Ord k => (Maybe a -> Maybe b -> Maybe c) -> Map k a -> Map k b -> Map k c mergeWithKey :: Ord k => (k -> Maybe a -> Maybe b -> Maybe c) -> Map k a -> Map k b -> Map k c The combining function is executed for every unique key from both maps. All functions unionWith, intersectionWith, differencseWith, proposed unionWith can be expressed using this function.
But I am not sure about the performance (using so many Maybes).
There's a bigger problem with performance than this constant factor, namely that it's no longer possible to short-cut evaluation for subtrees of the map that are known to be disjoint. For example, empty `intersect` x currently can be computed in constant time, no matter what x is; this can not be done with `merge`.
This reasoning justifies the existence of intersection, union and difference functions in Data.Map in addition to a merge function.
Of course, the functions union, intersect and difference could be implemented as a single function that takes two boolean arguments to specify which of the disjoint parts to keep in the result.
The greatest generality is obtained by starting from the natural representation of what's going on: data Or a b = Fst a | Both a b | Snd b Since we're always interested in (Or a b -> c) or (Or a b -> Maybe c) morphisms, we should use an Or-algebra rather than Or itself. Thus, data Alg a b c = Alg { or_fst :: a -> Maybe c , or_both :: a -> b -> Maybe c , or_snd :: b -> Maybe c } In order to avoid extraneous traversals, rather than using Haskell functions, we can define our own function type which allows case-matching to identify the trivial and vacuous functions into Maybe. data MaybeFun a b where Trivial :: MaybeFun a a Vacuous :: MaybeFun a b Function :: (a -> Maybe b) -> MaybeFun a b maybefun :: MaybeFun a b -> (a -> Maybe b) maybefun Trivial a = Just a maybefun Vacuous a = Nothing maybefun (Function f) a = f a data Alg a b c = Alg { or_fst :: !(MaybeFun a c) , or_both :: !(MaybeFun (a,b) c) , or_snd :: !(MaybeFun b c) } Now, rather than actually using the interpretation function (maybefun), we can perform the case match in the unionIntersectionDifferenceMergeEverythingInOne function and use the knowledge that one of the algebra functions is trivial or vacuous in order to avoid traversing the appropriate subtree; only traversing each of the three regions of interest as necessary. Adding those case matches will introduce some overhead, though some of it may be avoidable via proper use of inlining. However, I don't think it'd be very pretty to expect people to use this sort of interface, so you'd still end up writing a bunch of helper functions to hide the generality by automatically constructing Alg values. -- Live well, ~wren
participants (5)
-
Bertram Felgenhauer
-
Christian Sattler
-
Johan Tibell
-
Milan Straka
-
wren ng thornton