
Hi, http://hackage.haskell.org/trac/ghc/ticket/1218 I propose the addition of sortNub and sortNubBy. Semantically sortNub = sort . nub Practically, if you are doing both a sort and a nub, this can be implemented as: sortNub = map head . group . sort This is O(n log n) [time to sort], rather than O(n^2) [time to nub]. I have seen this around several times, often called snub. People have also used snub to mean other things, snub itself has a meaning, and sortNub is a more accurate name (following the concatMap tradition). I personally have defined this function at least 25 times, I suspect others have to. I can find 5 identical versions of it on google code [1], and I know google code is lying because it's missing ones it found earlier. Thanks Neil [1] http://www.google.com/codesearch?hl=en&lr=&q=lang%3Ahaskell+%22map+head+.+group+.+sort%22&btnG=Search

On Tue, Mar 13, 2007 at 03:16:44AM +0000, Neil Mitchell wrote:
Hi,
http://hackage.haskell.org/trac/ghc/ticket/1218
I propose the addition of sortNub and sortNubBy.
Semantically sortNub = sort . nub
Practically, if you are doing both a sort and a nub, this can be implemented as:
sortNub = map head . group . sort
This is O(n log n) [time to sort], rather than O(n^2) [time to nub].
I have seen this around several times, often called snub. People have also used snub to mean other things, snub itself has a meaning, and sortNub is a more accurate name (following the concatMap tradition).
I personally have defined this function at least 25 times, I suspect others have to. I can find 5 identical versions of it on google code [1], and I know google code is lying because it's missing ones it found earlier.
ordNub = flip evalState S.empty . filterM (State . tst) where tst x st = (not $ S.member x st, S.insert x st) ordNub ls = ordNub' ls S.empty where ordNub' (x:xs) set | S.member x set = ordNub' xs set | otherwise = x : ordNub' xs (S.insert x set) ordNub' [] set = [] O(n log n), doesn't change order. Perhaps also worthy? Stefan

On Mon, Mar 12, 2007 at 08:25:07PM -0700, Stefan O'Rear wrote:
On Tue, Mar 13, 2007 at 03:16:44AM +0000, Neil Mitchell wrote:
Hi,
http://hackage.haskell.org/trac/ghc/ticket/1218
I propose the addition of sortNub and sortNubBy.
Semantically sortNub = sort . nub
Practically, if you are doing both a sort and a nub, this can be implemented as:
sortNub = map head . group . sort
This is O(n log n) [time to sort], rather than O(n^2) [time to nub].
I have seen this around several times, often called snub. People have also used snub to mean other things, snub itself has a meaning, and sortNub is a more accurate name (following the concatMap tradition).
I personally have defined this function at least 25 times, I suspect others have to. I can find 5 identical versions of it on google code [1], and I know google code is lying because it's missing ones it found earlier.
ordNub = flip evalState S.empty . filterM (State . tst) where tst x st = (not $ S.member x st, S.insert x st)
ordNub ls = ordNub' ls S.empty where ordNub' (x:xs) set | S.member x set = ordNub' xs set | otherwise = x : ordNub' xs (S.insert x set) ordNub' [] set = []
O(n log n), doesn't change order. Perhaps also worthy?
not only does it preserve order, but it works on infinite lists and is lazy :) John -- John Meacham - ⑆repetae.net⑆john⑈

Neil Mitchell schrieb: [...]
sortNub = map head . group . sort
sortNub = Data.Set.toList . Data.Set.fromList
This is O(n log n) [time to sort], rather than O(n^2) [time to nub]. [...]
I personally have defined this function at least 25 times, I suspect others have to.
Maybe you should have switched from lists to sets anyway. Cheers Christian

Hi
sortNub = Data.Set.toList . Data.Set.fromList
This implementation seems to perform better on at least one particular usage case. I'll take a closer look and see if this is always the case. Are we going to have dependency problems by making Data.List depend on Data.Set? My guess is yes, but I'm not sure.
I personally have defined this function at least 25 times, I suspect others have to.
Maybe you should have switched from lists to sets anyway.
I use Set's where appropriate, I use List's where appropriate. Just because I'm using a list which I want nub'ing doesn't mean it is a set. Set's also don't allow things like pattern matching, and don't have the same range of functions over them, so lists are sometimes easier for certain tasks. Thanks Neil

Hello Neil, Tuesday, March 13, 2007, 12:38:14 PM, you wrote:
Are we going to have dependency problems by making Data.List depend on Data.Set? My guess is yes, but I'm not sure.
this is only an *implementation* dependency and it's ok as long as base includes both modules. if sometime in future situation will change, we can change implementation without boring users -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin schrieb:
Hello Neil,
Tuesday, March 13, 2007, 12:38:14 PM, you wrote:
Are we going to have dependency problems by making Data.List depend on Data.Set? My guess is yes, but I'm not sure.
this is only an *implementation* dependency and it's ok as long as base includes both modules. if sometime in future situation will change, we can change implementation without boring users
In fact, as an implementation of sortNub/sortNubBy (if that is what we want) I would rather suggest to generalize the merge function (i.e. as below) for the mergesort algorithm (and possibly export merge, too). Cheers Christian merge :: (a -> a -> Ordering) -> (a -> a -> [a] -> [a]) -> [a] -> [a] -> [a] merge cmp jn l1 l2 = case l1 of [] -> l2 x1 : r1 -> case l2 of [] -> l1 x2 : r2 -> let recmerge = merge cmp jn in case cmp x1 x2 of LT -> x1 : recmerge r1 l2 EQ -> jn x1 x2 $ recmerge r1 r2 GT -> x2 : recmerge l1 r2

Hi
this is only an *implementation* dependency and it's ok as long as base includes both modules. if sometime in future situation will change, we can change implementation without boring users
Yes, the main question here is should the functions be included in the Data.List module. If the answer is yes, how we implement them is almost an irrelevancy, provided they satisfy the invariants we specify. I more meant that I suspect that Data.Set imports Data.List transitively (although I do not know, its just a suspicion). If this is the case, then the implementation would either require chunks to be restructured, or recursively dependant modules (within the package) would exist. It is Haskell 98, but it is not fun :) Christian's mergesort idea is probably fast and doesn't require painful dependencies. Of course only benchmarks prove fast, so I'll benchmark whatever comes up at the end of the day. As for Stefan's ordNub, I think it is a great idea to add along with sortNub - this really does give people a nicer set of Nub functions. My only worry is that the dependency on State and Set will make the implementation have cyclic dependency fun. Thanks Neil

Hello Neil, Tuesday, March 13, 2007, 2:12:40 PM, you wrote:
I more meant that I suspect that Data.Set imports Data.List transitively (although I do not know, its just a suspicion). If this is the case, then the implementation would either require chunks to be restructured, or recursively dependant modules (within the package) would exist. It is Haskell 98, but it is not fun :)
afaik, hugs don't support recursive modules? -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Hi
I more meant that I suspect that Data.Set imports Data.List transitively (although I do not know, its just a suspicion). If this is the case, then the implementation would either require chunks to be restructured, or recursively dependant modules (within the package) would exist. It is Haskell 98, but it is not fun :)
afaik, hugs don't support recursive modules?
It doesn't. GHC and nhc don't exactly make them natural though, to the point where I'd say they are supported "with hoop jumping". Not the kind of jumping tricks I think should be in the base libraries more than necessary. I think JHC supports recursive modules properly. Thanks Neil

On Tue, Mar 13, 2007 at 11:48:41AM +0000, Neil Mitchell wrote:
I think JHC supports recursive modules properly.
yes. it collects all modules and breaks them into strongly connected components and compiles each such component as a unit. It does make things quite nice during development. However, I have found that due to the prelude being pulled into everything by default I end up with a massive prelude-wad of recursive modules that can take a bit for jhc to chug through and compile. Though, I don't see this as being an issue for end users since the situation just doesn't work at all with other compilers. Recursive modules don't come up a whole lot, but you have to mentally page out a lot of the task you are working on in order to refactor your module layout when it comes up, it is nice just to be able to continue working and delay thinking about module layout after you complete the task at hand. A lot of the times, recursive modules do make sense though. The logical layout of code you want to present to the user does not always agree with the ideal implementation layout and it is nice to have the ability to decouple those. I believe at least one compiler went the other route and allowed multiple (possibly recursive) module definitions in a single file. John -- John Meacham - ⑆repetae.ne⑆john⑈

On Tue, Mar 13, 2007 at 11:12:40AM +0000, Neil Mitchell wrote:
As for Stefan's ordNub, I think it is a great idea to add along with sortNub - this really does give people a nicer set of Nub functions. My only worry is that the dependency on State and Set will make the implementation have cyclic dependency fun.
once we have ordNub, we can create lots of RULES like the following {-# RULES "nub/ordNub" nub = ordNub :: [String] -> [String] #-} {-# RULES "nub/ordNub" nub = ordNub :: [Int] -> [Int] #-} {-# RULES "nub/ordNub" nub = ordNub :: [Integer] -> [Integer] #-} actually, we probably want one that uses IntMap for ints. it is signifigantly faster. I don't suppose there is a way to match a RULE to a class constraint? I don't see how it would be implemented in any easy way since classes get desugared away pretty early... but it would be an interesting feature.. {-# RULES "nub/ordNub" forall (a::*) . Ord a => nub = ordNub :: [a] -> [a] #-} or something... John -- John Meacham - ⑆repetae.net⑆john⑈

On 13/03/2007, at 18:25, John Meacham wrote:
once we have ordNub, we can create lots of RULES like the following
{-# RULES "nub/ordNub" nub = ordNub :: [String] -> [String] #-} {-# RULES "nub/ordNub" nub = ordNub :: [Int] -> [Int] #-} {-# RULES "nub/ordNub" nub = ordNub :: [Integer] -> [Integer] #-}
actually, we probably want one that uses IntMap for ints. it is signifigantly faster.
very nice idea!
I don't suppose there is a way to match a RULE to a class constraint? I don't see how it would be implemented in any easy way since classes get desugared away pretty early... but it would be an interesting feature..
{-# RULES "nub/ordNub" forall (a::*) . Ord a => nub = ordNub :: [a] -> [a] #-} or something...
I think yes, the right syntax would be something like:
{-# RULES "nub/ordNub" forall (x::Ord a => a). nub x = ordNub x #-}
But hopefully someone else will confirm. Also, what happens if the general Ord rule is added. Is it possible to convince GHC to use the more specific rule for Int before the general one?

john:
On Tue, Mar 13, 2007 at 11:12:40AM +0000, Neil Mitchell wrote:
As for Stefan's ordNub, I think it is a great idea to add along with sortNub - this really does give people a nicer set of Nub functions. My only worry is that the dependency on State and Set will make the implementation have cyclic dependency fun.
once we have ordNub, we can create lots of RULES like the following
{-# RULES "nub/ordNub" nub = ordNub :: [String] -> [String] #-} {-# RULES "nub/ordNub" nub = ordNub :: [Int] -> [Int] #-} {-# RULES "nub/ordNub" nub = ordNub :: [Integer] -> [Integer] #-}
actually, we probably want one that uses IntMap for ints. it is signifigantly faster.
I don't suppose there is a way to match a RULE to a class constraint? I don't see how it would be implemented in any easy way since classes get desugared away pretty early... but it would be an interesting feature..
{-# RULES "nub/ordNub" forall (a::*) . Ord a => nub = ordNub :: [a] -> [a] #-} or something...
Hmm, this type-determining-represetation sounds like a job for associated types :-) -- Don

On Tue, Mar 13, 2007 at 03:16:44AM +0000, Neil Mitchell wrote:
I propose the addition of sortNub and sortNubBy.
I would rather see nubSorted :: Eq a => [a] -> [a] nubSorted (x1:x2:xs) | x1 == x2 = nubSorted (x1:xs) nubSorted (x:xs) = x : nubSorted xs nubSorted [] = [] You can then nubSorted . sort, but you can also use it on data that you know is already sorted (or at least data in which all the dupes you want to eliminate are in clumps). Thanks Ian

Hi
nubSorted :: Eq a => [a] -> [a] nubSorted (x1:x2:xs) | x1 == x2 = nubSorted (x1:xs) nubSorted (x:xs) = x : nubSorted xs nubSorted [] = []
I considered this, but nubSorted imposes a precondition, sortNub ensures a postcondition. As an interface goes sortNub is harder to get wrong. Plus sortNub is likely to be substantially more efficient than nubSorted . sort - to the point where nubSorted . sort is likely to be slower than a normal nub. Thanks Neil

On Tue, Mar 13, 2007 at 11:31:47PM +0000, Neil Mitchell wrote:
nubSorted :: Eq a => [a] -> [a] nubSorted (x1:x2:xs) | x1 == x2 = nubSorted (x1:xs) nubSorted (x:xs) = x : nubSorted xs nubSorted [] = []
I considered this, but nubSorted imposes a precondition, sortNub ensures a postcondition. As an interface goes sortNub is harder to get wrong.
That's all true, but I'd still prefer to have nubSorted than sortNub :-)
Plus sortNub is likely to be substantially more efficient than nubSorted . sort - to the point where nubSorted . sort is likely to be slower than a normal nub.
Huh? nub is quadratic, sort is (n log n) and nubSorted linear. Unless you're talking about constant factors with very small lists, in which case I'd like to see numbers. Thanks Ian

Hi
Huh? nub is quadratic, sort is (n log n) and nubSorted linear.
That was my first guess. Turns out for lots of cases its not quite an accurate reflection of time taken. Consider a list where a reasonable proportion of elements are the same, i.e. a [Char], where we can expect that there are probably < 100 distinct characters, if we are working with text. Given: out = nub in nub will be O(#in * #out). Sort will be O(#in log #in). Where #out is much smaller than log #in, nub will win. For a large list which is being generated "lazily" - i.e. from a file, lots of other ways etc, sort is tail strict, which also makes the memory use substantially worse. I'll get complete benchmarks in a range of situations sometime later. Thanks Neil

"Neil Mitchell"
nub will be O(#in * #out). Sort will be O(#in log #in). Where #out is much smaller than log #in, nub will win.
This can be remedied by having (quick)sort partitioning in three groups (less than, equal to, and greater than the pivot). I think I argued that this should be the default, since it reduces a category of bad-case behaviours. -k -- If I haven't seen further, it is by standing in the footprints of giants

On Wed, Mar 14, 2007 at 12:09:41AM +0000, Ian Lynagh wrote:
On Tue, Mar 13, 2007 at 11:31:47PM +0000, Neil Mitchell wrote:
nubSorted :: Eq a => [a] -> [a] nubSorted (x1:x2:xs) | x1 == x2 = nubSorted (x1:xs) nubSorted (x:xs) = x : nubSorted xs nubSorted [] = []
I considered this, but nubSorted imposes a precondition, sortNub ensures a postcondition. As an interface goes sortNub is harder to get wrong.
That's all true, but I'd still prefer to have nubSorted than sortNub :-)
I dunno. I can't think of a common non-pathological operation that would preserve sorting while adding duplicate elements. not that (map head . group) won't ever come up... but i don't see how it could come up all that often. John -- John Meacham - ⑆repetae.net⑆john⑈

On Mar 13, 2007, at 4:16 , Neil Mitchell wrote:
Hi,
http://hackage.haskell.org/trac/ghc/ticket/1218
I propose the addition of sortNub and sortNubBy.
Semantically sortNub = sort . nub
Practically, if you are doing both a sort and a nub, this can be implemented as:
sortNub = map head . group . sort
This is O(n log n) [time to sort], rather than O(n^2) [time to nub].
I have seen this around several times, often called snub. People have also used snub to mean other things, snub itself has a meaning, and sortNub is a more accurate name (following the concatMap tradition).
I personally have defined this function at least 25 times, I suspect others have to. I can find 5 identical versions of it on google code [1], and I know google code is lying because it's missing ones it found earlier.
Thanks
Neil
[1] http://www.google.com/codesearch?hl=en&lr=&q=lang%3Ahaskell+% 22map+head+.+group+.+sort%22&btnG=Search
Yes! I've written those several times myself. /Björn
participants (10)
-
Bjorn Bringert
-
Bulat Ziganshin
-
Christian Maeder
-
dons@cse.unsw.edu.au
-
Ian Lynagh
-
John Meacham
-
Ketil Malde
-
Neil Mitchell
-
Pepe Iborra
-
Stefan O'Rear