Proposal: more general unionWith for Data.Map

Hi all, There are some high-level operations on maps which take two tree traversals only because the interface fails to expose sufficiently general functions. This proposal is concerned with an analogue of unionWithKey of type Ord k => (k -> a -> a -> Maybe a) -> Map k a -> Map k a -> Map k a, with the intended semantics that if a key is present in both maps and the operation applied to the key and corresponding values returns Nothing, the key is deleted from the result. Example use: sparse vectors are represented as maps from indices to non-zero coordinates. Often, a highly sparse vector u needs to be a added to a rather dense sparse vector v, i.e. we have m << n where m and n are the counts of non-zero elements in u and v, respectively. I need this operation to be O(m + n) as well as O(m * log(n)), and it should be as efficient as reasonable possible. This seems like a straightforward use case for unionWith since its hedge-union algorithm guarantees O(m * log(n)). The only problem occurs when u[i] = x and v[i] = -x for some index i and non-zero x, resulting in a zero entry I would rather have deleted from the result (this is a common case when dealing with matrices with small integer entries). The same problems occurs for coordinate-wise vector multiplication over non-integral domains, this time with intersectionWith. Remarks: - It would align the types of unionWith and intersectionWith with differenceWith. - The concept of allowing operations with return type Maybe a to avoid an additional high-level operation is already expressed in functions like alter or mapMaybeWithKey. - It would enable an efficient implementation of symmetricDifference = unionMaybeWithKey (const Nothing). - This concerns Data.Map, Data.IntMap, Data.Set, Data.IntSet and the With[Key] versions of union/intersection/difference - At least for Data.Map.unionWithKey, modifying the existing implementation seems trivial, requiring a merge instead of a join in the new case. I apologize if some of these points have been brought up before. Christian

Hi,
There are some high-level operations on maps which take two tree traversals only because the interface fails to expose sufficiently general functions. This proposal is concerned with an analogue of unionWithKey of type Ord k => (k -> a -> a -> Maybe a) -> Map k a -> Map k a -> Map k a, with the intended semantics that if a key is present in both maps and the operation applied to the key and corresponding values returns Nothing, the key is deleted from the result.
Example use: sparse vectors are represented as maps from indices to non-zero coordinates. Often, a highly sparse vector u needs to be a added to a rather dense sparse vector v, i.e. we have m << n where m and n are the counts of non-zero elements in u and v, respectively. I need this operation to be O(m + n) as well as O(m * log(n)), and it should be as efficient as reasonable possible. This seems like a straightforward use case for unionWith since its hedge-union algorithm guarantees O(m * log(n)). The only problem occurs when u[i] = x and v[i] = -x for some index i and non-zero x, resulting in a zero entry I would rather have deleted from the result (this is a common case when dealing with matrices with small integer entries). The same problems occurs for coordinate-wise vector multiplication over non-integral domains, this time with intersectionWith.
Remarks: - It would align the types of unionWith and intersectionWith with differenceWith.
FYI, development version of containers has generalized intersectionWith[Key] with type intersectionWithKey :: Ord k => (k -> a -> b -> Maybe c) -> Map k a -> Map k b -> Map k c If others agree, it is indeed simple to generalize type of unionWith[Key] from unionWithKey :: Ord k => (k -> a -> a -> a) -> Map k a -> Map k a -> Map k a to unionWithKey :: Ord k => (k -> a -> a -> Maybe a) -> Map k a -> Map k a -> Map k a These functions are used quite frequently in my opinion, so it will probably result in quite a lot of code breaks (unionWith (++) has to be changed to unionWith (\a b -> Just (a ++ b))). Any opinion, noble almighty others? Cheers, Milan

Milan Straka schrieb:
If others agree, it is indeed simple to generalize type of unionWith[Key] from unionWithKey :: Ord k => (k -> a -> a -> a) -> Map k a -> Map k a -> Map k a to unionWithKey :: Ord k => (k -> a -> a -> Maybe a) -> Map k a -> Map k a -> Map k a These functions are used quite frequently in my opinion, so it will probably result in quite a lot of code breaks (unionWith (++) has to be changed to unionWith (\a b -> Just (a ++ b))).
Just for consistency with other "With" functions, the result of the merger function should be 'a' not 'Maybe a'. Maybe 'unionUpdateWithKey' or so would show the similarity to 'update'.

On Tue, Jan 24, 2012 at 9:35 AM, Christian Sattler
There are some high-level operations on maps which take two tree traversals only because the interface fails to expose sufficiently general functions. This proposal is concerned with an analogue of unionWithKey of type Ord k => (k -> a -> a -> Maybe a) -> Map k a -> Map k a -> Map k a, with the intended semantics that if a key is present in both maps and the operation applied to the key and corresponding values returns Nothing, the key is deleted from the result.
Is union really an appropriate name here? I expect the following to hold: forall k ∊ (keys m1) ∪ (keys m2) => k ∊ (m1 ∪ m2) This means that keys must not be deleted by the union operator. Perhaps 'merge' is a better name. Cheers, Johan

Johan Tibell schrieb:
On Tue, Jan 24, 2012 at 9:35 AM, Christian Sattler
wrote: There are some high-level operations on maps which take two tree traversals only because the interface fails to expose sufficiently general functions. This proposal is concerned with an analogue of unionWithKey of type Ord k => (k -> a -> a -> Maybe a) -> Map k a -> Map k a -> Map k a, with the intended semantics that if a key is present in both maps and the operation applied to the key and corresponding values returns Nothing, the key is deleted from the result.
Is union really an appropriate name here? I expect the following to hold:
forall k ∊ (keys m1) ∪ (keys m2) => k ∊ (m1 ∪ m2)
This means that keys must not be deleted by the union operator. Perhaps 'merge' is a better name.
Sounds better than my 'unionUpdate'.

I have needed this before merge :: (a -> b -> Maybe c) -> (a -> Maybe c) -> (b -> Maybe c) -> Map k a -> Map k b -> Map k c mergeWithKey :: (k -> a -> b -> Maybe c) -> (k -> a -> Maybe c) -> (k-> b -> Maybe c) -> Map k a -> Map k b -> Map k c where the function arguments are what to use when the key appears in both maps what to use when the key appears in just the left map what to use when the key appears in just the right map. It seems like it would be the ultimate fully general fast map merging routine. intersection = merge (Just . const) (const Nothing) (const Nothing) union = merge (Just . const) Just Just xor = merge ( const Nothing . const) Just Just and so forth... John John

Hi,
merge :: (a -> b -> Maybe c) -> (a -> Maybe c) -> (b -> Maybe c) -> Map k a -> Map k b -> Map k c mergeWithKey :: (k -> a -> b -> Maybe c) -> (k -> a -> Maybe c) -> (k-> b -> Maybe c) -> Map k a -> Map k b -> Map k c
where the function arguments are what to use when the key appears in both maps what to use when the key appears in just the left map what to use when the key appears in just the right map.
It seems like it would be the ultimate fully general fast map merging routine.
intersection = merge (Just . const) (const Nothing) (const Nothing) union = merge (Just . const) Just Just xor = merge ( const Nothing . const) Just Just
and so forth...
I agree with the ultimate generality, but the problem is efficiency. For example running intersection (fromList [1..1000]) (fromList [1001..2000]) will end very soon with empty result, but merge (Just . const) (const Nothing) (const Nothing) (fromList [1..1000]) (fromList [1001..2000]) will call the (const Nothing) for 2000 times. Cheers, Milan

On Tue, Jan 24, 2012 at 3:20 PM, Milan Straka
Hi,
merge :: (a -> b -> Maybe c) -> (a -> Maybe c) -> (b -> Maybe c) -> Map k a -> Map k b -> Map k c mergeWithKey :: (k -> a -> b -> Maybe c) -> (k -> a -> Maybe c) -> (k-> b -> Maybe c) -> Map k a -> Map k b -> Map k c
where the function arguments are what to use when the key appears in both maps what to use when the key appears in just the left map what to use when the key appears in just the right map.
It seems like it would be the ultimate fully general fast map merging routine.
intersection = merge (Just . const) (const Nothing) (const Nothing) union = merge (Just . const) Just Just xor = merge ( const Nothing . const) Just Just
and so forth...
I agree with the ultimate generality, but the problem is efficiency. For example running intersection (fromList [1..1000]) (fromList [1001..2000]) will end very soon with empty result, but merge (Just . const) (const Nothing) (const Nothing) (fromList [1..1000]) (fromList [1001..2000]) will call the (const Nothing) for 2000 times.
I have managed to use very general functions inside unordered-container with maintained performance by using this pattern: -- Internal only. Always inlined. merge' :: (a -> b -> Maybe c) -> (a -> Maybe c) -> (b -> Maybe c) -> Map k a -> Map k b -> Map k c {-# INLINE merge' #-} merge :: (a -> b -> Maybe c) -> (a -> Maybe c) -> (b -> Maybe c) -> Map k a -> Map k b -> Map k c merge = merge' {-# INLINABLE merge #-} intersection = merge' (Just . const) (const Nothing) (const Nothing) {-# INLINABLE intersection #-} union = merge' (Just . const) Just Just {-# INLINABLE union #-} xor = merge' (const Nothing . const) Just Just {-# INLINABLE xor #-} By inlining merge the case-of-case, case-of-known-constructor, and dead-code-elimination optimizations should be able to remove all unused code. -- Johan

On 01/24/2012 11:27 PM, Johan Tibell wrote:
On Tue, Jan 24, 2012 at 3:20 PM, Milan Straka
wrote: Hi,
merge :: (a -> b -> Maybe c) -> (a -> Maybe c) -> (b -> Maybe c) -> Map k a -> Map k b -> Map k c mergeWithKey :: (k -> a -> b -> Maybe c) -> (k -> a -> Maybe c) -> (k-> b -> Maybe c) -> Map k a -> Map k b -> Map k c
where the function arguments are what to use when the key appears in both maps what to use when the key appears in just the left map what to use when the key appears in just the right map.
It seems like it would be the ultimate fully general fast map merging routine.
intersection = merge (Just . const) (const Nothing) (const Nothing) union = merge (Just . const) Just Just xor = merge ( const Nothing . const) Just Just
and so forth... I agree with the ultimate generality, but the problem is efficiency. For example running intersection (fromList [1..1000]) (fromList [1001..2000]) will end very soon with empty result, but merge (Just . const) (const Nothing) (const Nothing) (fromList [1..1000]) (fromList [1001..2000]) will call the (const Nothing) for 2000 times. I have managed to use very general functions inside unordered-container with maintained performance by using this pattern:
-- Internal only. Always inlined. merge' :: (a -> b -> Maybe c) -> (a -> Maybe c) -> (b -> Maybe c) -> Map k a -> Map k b -> Map k c {-# INLINE merge' #-}
merge :: (a -> b -> Maybe c) -> (a -> Maybe c) -> (b -> Maybe c) -> Map k a -> Map k b -> Map k c merge = merge' {-# INLINABLE merge #-}
intersection = merge' (Just . const) (const Nothing) (const Nothing) {-# INLINABLE intersection #-}
union = merge' (Just . const) Just Just {-# INLINABLE union #-}
xor = merge' (const Nothing . const) Just Just {-# INLINABLE xor #-}
By inlining merge the case-of-case, case-of-known-constructor, and dead-code-elimination optimizations should be able to remove all unused code.
-- Johan
Some non-trivial recursive optimizations are required here to meet expected performance targets, like recognizing that f t = case t of Bin s k x l r -> Bin s k x (f l) (f r); Tip -> Tip can be replaced by id. I would be pleasantly surprised if the compiler was able to handle this.

Oh yeah, I didn't mean we should actually replace the implementations, I was just demonstrating that merge is truely a more general operation and provide evidence it was useful. The most recent use case for me dealt with a lattice (in the mathematical sense) class, where the result of a meet or join could be a different type than the arguments... or maybe it was a boolean algebra class... hrmm.. or was it an implemenation of approximations of probability distributions where the map contained representative samples... sigh.. I juggle too many projects at once.. :) John

Argh, sent this out before in reply to the wrong message (and not
reply-to-all). Sorry Milan.
On Tue, Jan 24, 2012 at 6:08 PM, John Meacham
I have needed this before
merge :: (a -> b -> Maybe c) -> (a -> Maybe c) -> (b -> Maybe c) -> Map k a -> Map k b -> Map k c mergeWithKey :: (k -> a -> b -> Maybe c) -> (k -> a -> Maybe c) -> (k-> b -> Maybe c) -> Map k a -> Map k b -> Map k c
where the function arguments are what to use when the key appears in both maps what to use when the key appears in just the left map what to use when the key appears in just the right map.
It seems like it would be the ultimate fully general fast map merging routine.
intersection = merge (Just . const) (const Nothing) (const Nothing) union = merge (Just . const) Just Just xor = merge ( const Nothing . const) Just Just
and so forth...
As others (including Milan) observed, this isn't asymptotically efficient if the maps don't overlap much. Better is the following swiss army knife of combining functions: combine :: Ord k => (k -> a -> b -> Maybe c) -- Combine common entries -> (Map k a -> Map k c) -- Convert left submap with no common entries -> (Map k b -> Map k c) -- Convert right submap with no common entries -> Map k a -> Map k b -> Map k c Now unionWithKey f = combine (\k a b -> Just (f k a b)) id id intersectionWithKey f = combine f (const empty) (const empty) with no change in asymptotic complexity, and a bunch of fusion laws with fmap hold. This also permits difference and symmetric difference to be easily defined: difference = combine (\_ _ _-> Nothing) id (const empty) symmDifference = combine (\_ _ _ -> Nothing) id id When you inline combine and (where necessary) mapWithKey, you get exactly the code you'd expect without fancy footwork to delete the identity tree traversals. -Jan

On 01/25/2012 11:26 PM, Jan-Willem Maessen wrote:
Argh, sent this out before in reply to the wrong message (and not reply-to-all). Sorry Milan.
On Tue, Jan 24, 2012 at 6:08 PM, John Meacham
mailto:john@repetae.net> wrote: I have needed this before
merge :: (a -> b -> Maybe c) -> (a -> Maybe c) -> (b -> Maybe c) -> Map k a -> Map k b -> Map k c mergeWithKey :: (k -> a -> b -> Maybe c) -> (k -> a -> Maybe c) -> (k-> b -> Maybe c) -> Map k a -> Map k b -> Map k c
where the function arguments are what to use when the key appears in both maps what to use when the key appears in just the left map what to use when the key appears in just the right map.
It seems like it would be the ultimate fully general fast map merging routine.
intersection = merge (Just . const) (const Nothing) (const Nothing) union = merge (Just . const) Just Just xor = merge ( const Nothing . const) Just Just
and so forth...
As others (including Milan) observed, this isn't asymptotically efficient if the maps don't overlap much. Better is the following swiss army knife of combining functions:
combine :: Ord k => (k -> a -> b -> Maybe c) -- Combine common entries -> (Map k a -> Map k c) -- Convert left submap with no common entries -> (Map k b -> Map k c) -- Convert right submap with no common entries -> Map k a -> Map k b -> Map k c
Now
unionWithKey f = combine (\k a b -> Just (f k a b)) id id intersectionWithKey f = combine f (const empty) (const empty)
with no change in asymptotic complexity, and a bunch of fusion laws with fmap hold. This also permits difference and symmetric difference to be easily defined:
difference = combine (\_ _ _-> Nothing) id (const empty) symmDifference = combine (\_ _ _ -> Nothing) id id
When you inline combine and (where necessary) mapWithKey, you get exactly the code you'd expect without fancy footwork to delete the identity tree traversals.
-Jan
I don't think this solves the complexity issue e.g. for symmetric difference of a large set of size n^2 with a small set of size n.

On Wed, Jan 25, 2012 at 7:01 PM, Christian Sattler < sattler.christian@gmail.com> wrote:
... As others (including Milan) observed, this isn't asymptotically efficient if the maps don't overlap much. Better is the following swiss army knife of combining functions:
combine :: Ord k => (k -> a -> b -> Maybe c) -- Combine common entries -> (Map k a -> Map k c) -- Convert left submap with no common entries -> (Map k b -> Map k c) -- Convert right submap with no common entries -> Map k a -> Map k b -> Map k c
Now
unionWithKey f = combine (\k a b -> Just (f k a b)) id id intersectionWithKey f = combine f (const empty) (const empty)
with no change in asymptotic complexity, and a bunch of fusion laws with fmap hold. This also permits difference and symmetric difference to be easily defined:
difference = combine (\_ _ _-> Nothing) id (const empty) symmDifference = combine (\_ _ _ -> Nothing) id id
When you inline combine and (where necessary) mapWithKey, you get exactly the code you'd expect without fancy footwork to delete the identity tree traversals.
-Jan
I don't think this solves the complexity issue e.g. for symmetric difference of a large set of size n^2 with a small set of size n.
Er, could you be more specific? What case is not handled that prevents us from matching the asymptotic complexity of a hand-written symmetric difference function? There should be O(n) tree splits and joins in either case. -Jan

On 01/26/2012 12:08 AM, Jan-Willem Maessen wrote:
On Wed, Jan 25, 2012 at 7:01 PM, Christian Sattler
mailto:sattler.christian@gmail.com> wrote: ... As others (including Milan) observed, this isn't asymptotically efficient if the maps don't overlap much. Better is the following swiss army knife of combining functions:
combine :: Ord k => (k -> a -> b -> Maybe c) -- Combine common entries -> (Map k a -> Map k c) -- Convert left submap with no common entries -> (Map k b -> Map k c) -- Convert right submap with no common entries -> Map k a -> Map k b -> Map k c
Now
unionWithKey f = combine (\k a b -> Just (f k a b)) id id intersectionWithKey f = combine f (const empty) (const empty)
with no change in asymptotic complexity, and a bunch of fusion laws with fmap hold. This also permits difference and symmetric difference to be easily defined:
difference = combine (\_ _ _-> Nothing) id (const empty) symmDifference = combine (\_ _ _ -> Nothing) id id
When you inline combine and (where necessary) mapWithKey, you get exactly the code you'd expect without fancy footwork to delete the identity tree traversals.
-Jan
I don't think this solves the complexity issue e.g. for symmetric difference of a large set of size n^2 with a small set of size n.
Er, could you be more specific? What case is not handled that prevents us from matching the asymptotic complexity of a hand-written symmetric difference function? There should be O(n) tree splits and joins in either case.
-Jan
Ah, yes, you are right. Mistake on my part. +1 for the proposal
participants (6)
-
Christian Sattler
-
Henning Thielemann
-
Jan-Willem Maessen
-
Johan Tibell
-
John Meacham
-
Milan Straka