Proposal #2717: Add nubWith, nubOrd

Ok, I've killed the overly ambitious proposal #2629, and replaced it with the more modest proposal #2717 http://hackage.haskell.org/trac/ghc/ticket/2717 which just adds Data.List.nubWith and Data.Set.nubOrd as previously discussed ad nauseam. I know at least one person thought that adding these was a bad idea, but some others thought it was a fine thing to do. Hence the discussion on this list. Bart Massey bart@cs.pdx.edu

Hi, nubOrd: Seems good. Useful function to have. Shame its not in Data.List, but I understand the reasons for that, and think this is a perfectly sensible choice. nubWith: Seems bad. A not particularly useful function (other than to write nubOrd), with a fairly confusing name. I have previously used nubWith to mean nubOn (nubOn f = nubBy ((==) `on` f)) with cacheing of the function f. As a name, "with" only means "with additional stuff" - it gives no hint what the additional stuff is. The concept of stop-lists is a little confusing, and then you tell the user they almost certainly want to pass [] every single time - in that case why do they get an option? Also stop-lists have type "b", so stop-unconstrainted-type-variable seems a more appropriate name :-) The type signature is fairly complex. I'm not even convinced that nubWith really is a nub function, and not just some generalised filter - filterState perhaps. In which case you'd want to generalise Maybe b to (Bool,b). Can you ever imagine anyone other than nubOrd using nubWith? Is it a genuine utility function people have been crying out for? It seems perfectly good to include as a local function in Data.Set to be used to implement nubOrd, but I don't see its value as a standalone export from a very commonly used library full of very useful stuff. So, in summary I think nubOrd is great, and nubWith isn't nub, and isn't great. Thanks Neil
-----Original Message----- From: libraries-bounces@haskell.org [mailto:libraries-bounces@haskell.org] On Behalf Of Bart Massey Sent: 21 October 2008 10:04 am To: libraries@haskell.org Subject: Proposal #2717: Add nubWith, nubOrd
Ok, I've killed the overly ambitious proposal #2629, and replaced it with the more modest proposal #2717 http://hackage.haskell.org/trac/ghc/ticket/2717 which just adds Data.List.nubWith and Data.Set.nubOrd as previously discussed ad nauseam.
I know at least one person thought that adding these was a bad idea, but some others thought it was a fine thing to do. Hence the discussion on this list.
Bart Massey bart@cs.pdx.edu
_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries
============================================================================== Please access the attached hyperlink for an important electronic communications disclaimer: http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html ==============================================================================

Mitchell, Neil
nubOrd: Seems good. Useful function to have. Shame its not in Data.List, but I understand the reasons for that, and think this is a perfectly sensible choice.
Thanks.
nubWith: Seems bad. A not particularly useful function (other than to write nubOrd), with a fairly confusing name. I have previously used nubWith to mean nubOn (nubOn f = nubBy ((==) `on` f)) with cacheing of the function f. As a name, "with" only means "with additional stuff" - it gives no hint what the additional stuff is.
Were we to keep it, do you have a better naming suggestion? I agree that nubWith is not optimal, but I don't have a great alternative. As you sort of suggest below, maybe nubFilter could work? [The problem starts with the belief of everyone I've ever talked to about it that "nub" itself should have been called "uniq". :-)]
The concept of stop-lists is a little confusing, and then you tell the user they almost certainly want to pass [] every single time - in that case why do they get an option?
One can imagine situations in which you want to accumulate a "stop list" over several runs of the function, but of course nubWith as designed won't let you do that, so that's fail. If they already have a list of things they want pre-stopped, they could pass that in. Imagine a variant of nondescendingSubsequence below in which the accumulator started with 0 instead of the first element, for example. Of course the technical reason the user needs to pass an empty stop list is that we killed the type class that goes with nubWith, so there's no way to know what the empty accumulator of their stop list type actually is.
Also stop-lists have type "b", so stop-unconstrainted-type-variable seems a more appropriate name.
The name "stop list" is a term of art in the algorithm and AI fields, which is why I chose it. I could stick some bib reference in the docs if you thought that would help. In general, the stop list is an accumulator of values that you want to stop after the first one.
The type signature is fairly complex.
I agree. I'd written this off initially for this reason, but other folks on the list seem to be fine with it.
I'm not even convinced that nubWith really is a nub function, and not just some generalised filter - filterState perhaps. In which case you'd want to generalise Maybe b to (Bool,b).
Or to Either b b? Which is preferred in this case?
Can you ever imagine anyone other than nubOrd using nubWith?
Yes. As discussed previously you can get a significant constant factor speedup with nubInt on IntSets if you need it. Also, as discussed previously, you can get a speedup for nubChar by using a mutable array as the stop list. One can also imagine using this in less traditional ways; for example nondescendingSubsequence [] = [] nondescendingSubsequence (e : es) = e : nubWith (\a b-> if (a >= b) then Just a else Nothing) e es
nondescendingSubsequence [1,2,3,2,3,3,4,1,1] [1,2,3,3,3,4]
Is it a genuine utility function people have been crying out for?
IMHO it is a "genuine utility function", whatever that means. It certainly isn't something "people have been crying out for", so if that's the criterion we should omit it. It just seems churlish to hide it, given that it's sitting in there being a perfectly useful building block for things people do cry out for. I'd guess it would be used about as often as "break", which is in there for much the same reason.
It seems perfectly good to include as a local function in Data.Set to be used to implement nubOrd, but I don't see its value as a standalone export from a very commonly used library full of very useful stuff.
We could move it to live in Data.Set with nubOrd if folks think that it doesn't belong in Data.List. Or we could move them both into their own module, although that seems silly to me given that I can't really think what else goes in there.
So, in summary I think nubOrd is great, and nubWith isn't nub, and isn't great.
Thanks hugely for the detailed commentary! Please understand that I'm in no way wedded to the idea of getting any of this in; I want to do what works best for everyone, and am grateful for your and everyone's help in figuring out what that is. Bart Massey bart@cs.pdx.edu

Hi
Were we to keep it, do you have a better naming suggestion?
[Warning: untested code] filterAccum :: (acc -> x -> (acc,Bool)) -> acc -> [x] -> (acc, [x]) filterAccum f a [] = (a, []) filterAccum f a (x:xs) = (an, [x|b]++rest) where (a2,b) = f a x (an,rest) = filterAccum f a2 xs This follows the type of mapAccumL, and is more general than your function. You could change the utility function to be (acc -> x -> (acc,Maybe y)) to get a variant that is more general than both mapAccum and filterAccum.
could work? [The problem starts with the belief of everyone I've ever talked to about it that "nub" itself should have been called "uniq". :-)]
Your function has nothing to do with uniqueness or nubness. It is filtering with a state.
Of course the technical reason the user needs to pass an empty stop list is that we killed the type class that goes with nubWith, so there's no way to know what the empty accumulator of their stop list type actually is.
This all seems like complexity that shouldn't be there. The base library should provide simple things (folds, maps, filters) and simple concepts with very slightly more involved implementations (sort, reverse, nub). Anything that isn't a simple concept should go in a separate library.
general, the stop list is an accumulator of values that you want to stop after the first one.
That's how you use it from nubOrd, not anything to do with the function.
I'm not even convinced that nubWith really is a nub function, and not just some generalised filter - filterState perhaps. In which case you'd want to generalise Maybe b to (Bool,b).
Or to Either b b? Which is preferred in this case?
Either b b is a horrible type, its semantically equivalent to (Bool,b) but with added confusion (in most cases, there are some exceptions). See my above filterState for a more general version.
Can you ever imagine anyone other than nubOrd using nubWith?
Yes. As discussed previously you can get a significant constant factor speedup with nubInt on IntSets if you need it. Also, as discussed previously, you can get a speedup for nubChar by using a mutable array as the stop list.
Sounds like great reasons for adding nubInt (or just nubOrd in the IntMap module). Perhaps there should also be a CharMap module which provides similar values for Char. I can't imagine your nubChar with a mutable array handles all Unicode points? These ideas are starting to be more radical, and sound like the thing a type class or type families would be suited for. Perhaps prototype these ideas in a "fastnub" library on Hackage?
One can also imagine using this in less traditional ways; for example
nondescendingSubsequence [] = [] nondescendingSubsequence (e : es) = e : nubWith (\a b-> if (a >= b) then Just a else Nothing) e es
nondescendingSubsequence [1,2,3,2,3,3,4,1,1] [1,2,3,3,3,4]
Is it a genuine utility function people have been crying out for?
IMHO it is a "genuine utility function", whatever that means. It certainly isn't something "people have been crying out for", so if that's the criterion we should omit it.
I think for the base library that "crying out for" is the minimum criteria. We also need it to have an obvious name, a well explored design space (or an obviously trivial design space), and the desire to support it for all eternity.
It just seems churlish to hide it, given that it's sitting in there being a perfectly useful building block for things people do cry out for. I'd guess it would be used about as often as "break", which is in there for much the same reason.
Hoogle uses break 33 times. It used to use it a lot more, but then I moved to using Parsec. I've never had a desire to use filterState or nubWith. I'm only one person, so this may not be typical.
It seems perfectly good to include as a local function in Data.Set to be used to implement nubOrd, but I don't see its value as a standalone export from a very commonly used library full of very useful stuff.
We could move it to live in Data.Set with nubOrd if folks think that it doesn't belong in Data.List. Or we could move them both into their own module, although that seems silly to me given that I can't really think what else goes in there.
It's too little for its own module - if you are going to do that just make it a separate package. I don't want nubWith moved to Data.Set, I want it hidden. Once its hidden, then it can go wherever (although because of package scoping rules etc inside Data.Set is the best place)
Please understand that I'm in no way wedded to the idea of getting any of this in; I want to do what works best for everyone, and am grateful for your and everyone's help in figuring out what that is.
I realise. I really am very enthusiastic about getting nubOrd in (I've even proposed it previously myself), and appreciate the effort you've put into this. Thanks Neil ============================================================================== Please access the attached hyperlink for an important electronic communications disclaimer: http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html ==============================================================================

Mitchell, Neil wrote:
Hi
Were we to keep it, do you have a better naming suggestion?
[Warning: untested code]
filterAccum :: (acc -> x -> (acc,Bool)) -> acc -> [x] -> (acc, [x]) filterAccum f a [] = (a, []) filterAccum f a (x:xs) = (an, [x|b]++rest) where (a2,b) = f a x (an,rest) = filterAccum f a2 xs
This follows the type of mapAccumL, and is more general than your function.
can we call it filterAccumL then? It took me a while to figure out from the code whether it was truly an "L" or "R" version, and in principle there could be filterAccumR (although I don't know why we'd want it)
You could change the utility function to be (acc -> x -> (acc,Maybe y)) to get a variant that is more general than both mapAccum and filterAccum.
in which case the result type would be (acc, [y]), and it would have a similar type and semantics to Data.Maybe.mapMaybe :: (a -> Maybe b) -> [a] -> [b] .. maybeAccumL? :-) -Isaac

Hi
Were we to keep it, do you have a better naming suggestion?
[Warning: untested code]
filterAccum :: (acc -> x -> (acc,Bool)) -> acc -> [x] -> (acc, [x]) filterAccum f a [] = (a, []) filterAccum f a (x:xs) = (an, [x|b]++rest) where (a2,b) = f a x (an,rest) = filterAccum f a2 xs
This follows the type of mapAccumL, and is more general than your function.
can we call it filterAccumL then? It took me a while to figure out from the code whether it was truly an "L" or "R" version, and in principle there could be filterAccumR (although I don't know why we'd want it)
Yes, that would be the sensible name. However, I don't think it deserves to be in the standard libraries :-) If anyone feels it should be, that sounds like it should be a separate proposal from nubOrd.
You could change the utility function to be (acc -> x -> (acc,Maybe y)) to get a variant that is more general than both mapAccum and filterAccum.
in which case the result type would be (acc, [y]), and it would have a similar type and semantics to Data.Maybe.mapMaybe :: (a -> Maybe b) -> [a] -> [b]
.. maybeAccumL? :-)
Yes, that does seem like the correct name for it. Again, I don't think it's a massively useful function - the Accum functions just aren't as useful anymore now Monads are better understood. Thanks Neil ============================================================================== Please access the attached hyperlink for an important electronic communications disclaimer: http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html ==============================================================================

Mitchell, Neil wrote:
Hi
Were we to keep it, do you have a better naming suggestion?
[Warning: untested code]
filterAccum :: (acc -> x -> (acc,Bool)) -> acc -> [x] -> (acc, [x]) filterAccum f a [] = (a, []) filterAccum f a (x:xs) = (an, [x|b]++rest) where (a2,b) = f a x (an,rest) = filterAccum f a2 xs
This follows the type of mapAccumL, and is more general than your function. You could change the utility function to be (acc -> x -> (acc,Maybe y)) to get a variant that is more general than both mapAccum and filterAccum.
In other words, mapAccumL = mapM filterAccumL = filterM when used with the state monad. Concerning the proposal, I agree with Neil, +1 for nubOrd and -1 for nubWith . Adding filterAccum sounds reasonable but should be a separate proposal. The patch introduces two documentation errors concerning the asymptotic complexity: m is *not* the number of duplicate elements but the number of *unique* elements, i.e. n minus the number of duplicates. Furthermore, the proposed documentation for nubOrd should mention what nubOrd actually does. I suggest changing it to: Like 'nub', the 'nubOrd' function removes duplicate elements from a list. But while 'nub' only requires that two elements can be tested for equality ('Eq a'), 'nubOrd' requires that the elements can be ordered ('Ord a'). This allows the 'nubOrd' implementation to be significantly faster; it runs in /O(n log m)/ time where /n/ is the length of the list and /m/ is the number of unique elements in that list. Regards, apfelmus

Bart Massey wrote:
Hmm, this reminds us that Data.Set really needs insertMember :: Ord a => a -> Set a -> (Bool, Set a) so that we don't need to traverse the tree twice each time. And the same for Data.IntSet, of course. Analogous to existing things like Data.Set.deleteFindMin Data.Map.insertLookupWithKey Hasn't someone made this obvious proposal before?
participants (6)
-
apfelmus
-
Bart Massey
-
David Roundy
-
Isaac Dupree
-
Mitchell, Neil
-
Yitzchak Gale