ordNub

Niklas Hambüchen

14 Jul 2013 14 Jul '13

11:20 a.m.

tldr: nub is abnormally slow, we shouldn't use it, but we do. As you might know, Data.List.nub is O(n²). (*) As you might not know, almost *all* practical Haskell projects use it, and that in places where an Ord instance is given, e.g. happy, Xmonad, ghc-mod, Agda, darcs, QuickCheck, yesod, shake, Cabal, haddock, and 600 more (see https://github.com/nh2/haskell-ordnub). I've taken the Ord-based O(n * log n) implementation from yi using a Set: ordNub :: (Ord a) => [a] -> [a] ordNub l = go empty l where go _ [] = [] go s (x:xs) = if x `member` s then go s xs else x : go (insert x s) xs and put benchmarks on http://htmlpreview.github.io/?https://github.com/nh2/haskell-ordnub/blob/1f0... (compare `nub` vs `ordNub`). `ordNub` is not only in a different complexity class, but even seems to perform better than nub for very small numbers of actually different list elements (that's the numbers before the benchmark names). (The benchmark also shows some other potential problem: Using a state monad to keep the set instead of a function argument can be up to 20 times slower. Should that happen?) What do you think about ordNub? I've seen a proposal from 5 years ago about adding a *sort*Nub function started by Neil, but it just died. (*) The mentioned complexity is for the (very common) worst case, in which the number of different elements in the list grows with the list (alias you don't have an N element list with always only 5 different things inside).

Show replies by date

Clark Gaebel

14 Jul 14 Jul

11:31 a.m.

Similarly, I've always used: import qualified Data.HashSet as S nub :: Hashable a => [a] -> [a] nub = S.toList . S.fromList And i can't think of any type which i can't write a Hashable instance, so this is extremely practical. On Jul 14, 2013 7:24 AM, "Niklas Hambüchen" wrote:

...

tldr: nub is abnormally slow, we shouldn't use it, but we do.

As you might know, Data.List.nub is O(n²). (*)

As you might not know, almost *all* practical Haskell projects use it, and that in places where an Ord instance is given, e.g. happy, Xmonad, ghc-mod, Agda, darcs, QuickCheck, yesod, shake, Cabal, haddock, and 600 more (see https://github.com/nh2/haskell-ordnub).

I've taken the Ord-based O(n * log n) implementation from yi using a Set:

ordNub :: (Ord a) => [a] -> [a] ordNub l = go empty l where go _ [] = [] go s (x:xs) = if x `member` s then go s xs else x : go (insert x s) xs

and put benchmarks on

http://htmlpreview.github.io/?https://github.com/nh2/haskell-ordnub/blob/1f0... (compare `nub` vs `ordNub`).

`ordNub` is not only in a different complexity class, but even seems to perform better than nub for very small numbers of actually different list elements (that's the numbers before the benchmark names).

(The benchmark also shows some other potential problem: Using a state monad to keep the set instead of a function argument can be up to 20 times slower. Should that happen?)

What do you think about ordNub?

I've seen a proposal from 5 years ago about adding a *sort*Nub function started by Neil, but it just died.

(*) The mentioned complexity is for the (very common) worst case, in which the number of different elements in the list grows with the list (alias you don't have an N element list with always only 5 different things inside).

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Niklas Hambüchen

11:39 a.m.

One of my main points is: Should we not add such a function (ord-based, same output as nub, stable, no sorting) to base? As the package counting shows, if we don't offer an alternative, people obviously use it, and not to our benefit. (Not to say it this way: We could make the Haskell world fast with smarter fusion, strictness analysis and LLVM backends. Or we could stop using quadratic algorithms.)

Clark Gaebel

11:54 a.m.

Oops sorry I guess my point wasn't clear. Why ord based when hashable is faster? Then there's no reason this has to be in base, it can just be a free function in Data.HashSet. If stability is a concern then there's a way to easily account for that using HashMap. - Clark On Jul 14, 2013 7:48 AM, "Niklas Hambüchen" wrote:

...

One of my main points is:

Should we not add such a function (ord-based, same output as nub, stable, no sorting) to base?

As the package counting shows, if we don't offer an alternative, people obviously use it, and not to our benefit.

(Not to say it this way: We could make the Haskell world fast with smarter fusion, strictness analysis and LLVM backends. Or we could stop using quadratic algorithms.)

Brandon Allbery

15 Jul 15 Jul

4:46 a.m.

On Sun, Jul 14, 2013 at 7:54 AM, Clark Gaebel wrote:

...

Oops sorry I guess my point wasn't clear.

Why ord based when hashable is faster? Then there's no reason this has to be in base, it can just be a

Did the point about "stable" fly overhead? -- brandon s allbery kf8nh sine nomine associates allbery.b@gmail.com ballbery@sinenomine.net unix, openafs, kerberos, infrastructure, xmonad http://sinenomine.net

Clark Gaebel

12:43 p.m.

Apologies. I was being lazy. Here's a stable version: import qualified Data.HashSet as S hashNub :: (Ord a) => [a] -> [a] hashNub l = go S.empty l where go _ [] = [] go s (x:xs) = if x `S.member` s then go s xs else x : go (S.insert x s) xs Which, again, will probably be faster than the one using Ord, and I can't think of any cases where I'd want the one using Ord instead. I may just not be creative enough, though. - Clark On Mon, Jul 15, 2013 at 12:46 AM, Brandon Allbery wrote:

...

On Sun, Jul 14, 2013 at 7:54 AM, Clark Gaebel wrote:

...
Oops sorry I guess my point wasn't clear.

Why ord based when hashable is faster? Then there's no reason this has to be in base, it can just be a

Did the point about "stable" fly overhead?

-- brandon s allbery kf8nh sine nomine associates allbery.b@gmail.com ballbery@sinenomine.net unix, openafs, kerberos, infrastructure, xmonad http://sinenomine.net

John Lato

16 Jul 16 Jul

1:46 a.m.

In my tests, using unordered-containers was slightly slower than using Ord, although as the number of repeated elements grows unordered-containers appears to have an advantage. I'm sure the relative costs of comparison vs hashing would affect this also. But both are dramatically better than the current nub. Has anyone looked at Bart's patches to see how difficult it would be to apply them (or re-write them)? On Mon, Jul 15, 2013 at 8:43 PM, Clark Gaebel wrote:

...

Apologies. I was being lazy. Here's a stable version:

import qualified Data.HashSet as S

hashNub :: (Ord a) => [a] -> [a] hashNub l = go S.empty l where go _ [] = [] go s (x:xs) = if x `S.member` s then go s xs else x : go (S.insert x s) xs

Which, again, will probably be faster than the one using Ord, and I can't think of any cases where I'd want the one using Ord instead. I may just not be creative enough, though.

- Clark

On Mon, Jul 15, 2013 at 12:46 AM, Brandon Allbery wrote:

...
On Sun, Jul 14, 2013 at 7:54 AM, Clark Gaebel wrote:

...
Oops sorry I guess my point wasn't clear.

Why ord based when hashable is faster? Then there's no reason this has

to

...
be in base, it can just be a

Did the point about "stable" fly overhead?

-- brandon s allbery kf8nh sine nomine associates allbery.b@gmail.com ballbery@sinenomine.net unix, openafs, kerberos, infrastructure, xmonad http://sinenomine.net

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Ivan Lazar Miljenovic

2:31 a.m.

On 16 July 2013 11:46, John Lato wrote:

...

In my tests, using unordered-containers was slightly slower than using Ord, although as the number of repeated elements grows unordered-containers appears to have an advantage. I'm sure the relative costs of comparison vs hashing would affect this also. But both are dramatically better than the current nub.

Has anyone looked at Bart's patches to see how difficult it would be to apply them (or re-write them)?

If I understand correctly, this function is proposed to be added to Data.List which lives in base... but the proposals here are about using either Sets from containers or HashSet from unordered-containers; I thought base wasn't supposed to depend on any other package :/

...

On Mon, Jul 15, 2013 at 8:43 PM, Clark Gaebel wrote:

...
Apologies. I was being lazy. Here's a stable version:

import qualified Data.HashSet as S

hashNub :: (Ord a) => [a] -> [a] hashNub l = go S.empty l where go _ [] = [] go s (x:xs) = if x `S.member` s then go s xs else x : go (S.insert x s) xs

Which, again, will probably be faster than the one using Ord, and I can't think of any cases where I'd want the one using Ord instead. I may just not be creative enough, though.

- Clark

On Mon, Jul 15, 2013 at 12:46 AM, Brandon Allbery wrote:

...
On Sun, Jul 14, 2013 at 7:54 AM, Clark Gaebel wrote:

...
Oops sorry I guess my point wasn't clear.

Why ord based when hashable is faster? Then there's no reason this has to be in base, it can just be a

Did the point about "stable" fly overhead?

-- brandon s allbery kf8nh sine nomine associates allbery.b@gmail.com ballbery@sinenomine.net unix, openafs, kerberos, infrastructure, xmonad http://sinenomine.net

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

-- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com http://IvanMiljenovic.wordpress.com

John Lato

2:40 a.m.

On Tue, Jul 16, 2013 at 10:31 AM, Ivan Lazar Miljenovic < ivan.miljenovic@gmail.com> wrote:

...

...
In my tests, using unordered-containers was slightly slower than using Ord, although as the number of repeated elements grows unordered-containers appears to have an advantage. I'm sure the relative costs of comparison vs hashing would affect this also. But both are dramatically better than

On 16 July 2013 11:46, John Lato wrote: the

...
current nub.

Has anyone looked at Bart's patches to see how difficult it would be to apply them (or re-write them)?

If I understand correctly, this function is proposed to be added to Data.List which lives in base... but the proposals here are about using either Sets from containers or HashSet from unordered-containers; I thought base wasn't supposed to depend on any other package :/

That was one of the points up for discussion: is it worth including a subset of Set functionality to enable a much better nub in base? Is it even worth having Data.List.nub if it has quadratic complexity? As an alternative, Bart's proposal was for both including ordNub in containers and an improved nub (with no dependencies outside base) in Data.List. Unfortunately the patches are quite old (darcs format), and I don't know how they'd apply to the current situation.

Clark Gaebel

3:21 a.m.

I'm procrastinating something else, so I wrote the patch to unordered-containers. Feel free to comment on the github link: https://github.com/tibbe/unordered-containers/pull/67 I'm still against having an Ord version, since my intuition tells me that hash-based data structures are faster than ordered ones. Someone else can write the patch, though! As a tangent, can anyone think of a data structure for which you can write an Ord instance but Hashable/Eq is impossible (or prove otherwise)? How about the converse? Regards, - Clark On Mon, Jul 15, 2013 at 10:40 PM, John Lato wrote:

...

On Tue, Jul 16, 2013 at 10:31 AM, Ivan Lazar Miljenovic wrote:

...
On 16 July 2013 11:46, John Lato wrote:

...
In my tests, using unordered-containers was slightly slower than using Ord, although as the number of repeated elements grows unordered-containers appears to have an advantage. I'm sure the relative costs of comparison vs hashing would affect this also. But both are dramatically better than the current nub.

Has anyone looked at Bart's patches to see how difficult it would be to apply them (or re-write them)?

If I understand correctly, this function is proposed to be added to Data.List which lives in base... but the proposals here are about using either Sets from containers or HashSet from unordered-containers; I thought base wasn't supposed to depend on any other package :/

That was one of the points up for discussion: is it worth including a subset of Set functionality to enable a much better nub in base? Is it even worth having Data.List.nub if it has quadratic complexity?

As an alternative, Bart's proposal was for both including ordNub in containers and an improved nub (with no dependencies outside base) in Data.List. Unfortunately the patches are quite old (darcs format), and I don't know how they'd apply to the current situation.

Richard A. O'Keefe

3:35 a.m.

On 16/07/2013, at 3:21 PM, Clark Gaebel wrote:

...

I'm still against having an Ord version, since my intuition tells me that hash-based data structures are faster than ordered ones.

There are at least four different things that "an Ord version" might mean: - first sort a list, then eliminate duplicates - sort a list eliminating duplicates stably as you go (think 'merge sort', using 'union' instead of 'merge') - build a balanced tree set as you go - having a list that is already sorted, use that to eliminated duplicates cheaply. These things have different costs. For example, if there are N elements of which U are unique, the first as O(N.log N) cost, the third has O(N.log U) cost, and the fourth has O(N) cost. What I want is more often ordNubBy than ordNub, though.

...

Someone else can write the patch, though!

As a tangent, can anyone think of a data structure for which you can write an Ord instance but Hashable/Eq is impossible (or prove otherwise)? How about the converse?

Since Ord has Eq as a superclass, and since 0 is a functionally correct hash value for anything, if you can implement Ord you can obviously implement Hashable/Eq. Whether it is *useful* to do so is another question. It turns out that it _is_ possible to define good quality hash functions on sets, but most code in the field to do so is pretty bad. (Just a modular sum or exclusive or.)

Clark Gaebel

3:58 a.m.

nubBy is a very good suggestion. Added! Regarding good hash functions: if your data structure is algebraic, you can derive generic and Hashable will give you a pretty good hash function:

...

data ADT a = C0 Int String | C1 [a] deriving Generic

instance Hashable a => Hashable (ADT a)

It's magic! - Clark On Mon, Jul 15, 2013 at 11:35 PM, Richard A. O'Keefe wrote:

...

On 16/07/2013, at 3:21 PM, Clark Gaebel wrote:

...
I'm still against having an Ord version, since my intuition tells me that hash-based data structures are faster than ordered ones.

There are at least four different things that "an Ord version" might mean:

- first sort a list, then eliminate duplicates - sort a list eliminating duplicates stably as you go (think 'merge sort', using 'union' instead of 'merge') - build a balanced tree set as you go - having a list that is already sorted, use that to eliminated duplicates cheaply.

These things have different costs. For example, if there are N elements of which U are unique, the first as O(N.log N) cost, the third has O(N.log U) cost, and the fourth has O(N) cost.

What I want is more often ordNubBy than ordNub, though.

...
Someone else can write the patch, though!

As a tangent, can anyone think of a data structure for which you can write an Ord instance but Hashable/Eq is impossible (or prove otherwise)? How about the converse?

Since Ord has Eq as a superclass, and since 0 is a functionally correct hash value for anything, if you can implement Ord you can obviously implement Hashable/Eq. Whether it is *useful* to do so is another question.

It turns out that it _is_ possible to define good quality hash functions on sets, but most code in the field to do so is pretty bad. (Just a modular sum or exclusive or.)

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

AntC

19 Aug 19 Aug

11:13 a.m.

...

Richard A. O'Keefe writes:

There are at least four different things that "an Ord version" might mean:

- first sort a list, then eliminate duplicates - sort a list eliminating duplicates stably as you go (think 'merge sort', using 'union' instead of 'merge') - build a balanced tree set as you go - having a list that is already sorted, use that to eliminated duplicates cheaply.

These things have different costs. For example, ...

What I want is more often ordNubBy than ordNub, though.

(ordNubBy you can get via a suitable Ord instance for the element type?) The bigger problem is that you might not have a suitable Ord instance. After all, sets are defined by equality/equivalence relation, not necessarily by Ord. There are many other things you might want to do with a set/collection than just remove duplicates. I notice that Data.Set.map = fromList . (map stuff) . toList That is, build two lists (to be GC'd), as well as the set result. So does the performane cost of from/to List outrun the benefit of Data.Set.union? Depends how much you're mapping vs inserting and checking membership.

Tom Ellis

1 Jan 1 Jan

1:24 p.m.

On Mon, Jul 15, 2013 at 11:21:39PM -0400, Clark Gaebel wrote:

...

As a tangent, can anyone think of a data structure for which you can write an Ord instance but Hashable/Eq is impossible (or prove otherwise)? How about the converse?

A conversation on #haskell brought back to mind this question, which I'm not sure was ever answered. Is there a type with an Eq instance for which on Ord instance could not also be provided (i.e. I'm interested in Clark's "converse")? One answer is that I could write data A = A deriving Eq and then not export the constructor. That would make it impossible to write an Ord instance. However, I'm more interested in the question "morally" or "in principle". If we think about algebraic datatypes built from sums and products of base types, which have both Eq and Ord, and the function type construct, which prevents both Eq and Ord, it seems that the two always go together, in principle. Am I missing anything more exotic? IORef's have Eq but not Ord. Is there a good reason for that? Would Ord break "referential transparency" somehow? I doubt it. How about STRef? It seems more likely they shouldn't have Ord. Any thoughts? Tom

Brandon Allbery

16 Jul 16 Jul

3:33 a.m.

On Mon, Jul 15, 2013 at 10:31 PM, Ivan Lazar Miljenovic < ivan.miljenovic@gmail.com> wrote:

...

If I understand correctly, this function is proposed to be added to Data.List which lives in base... but the proposals here are about using either Sets from containers or HashSet from unordered-containers; I thought base wasn't supposed to depend on any other package :/

As I understand it, currently we have a double proposal: simple ordNub in Data.List without external dependencies, and the other one in containers and/or unordered-containers as appropriate. -- brandon s allbery kf8nh sine nomine associates allbery.b@gmail.com ballbery@sinenomine.net unix, openafs, kerberos, infrastructure, xmonad http://sinenomine.net

Conrad Parker

4:08 a.m.

On 16 July 2013 10:31, Ivan Lazar Miljenovic wrote:

...

On 16 July 2013 11:46, John Lato wrote:

...
In my tests, using unordered-containers was slightly slower than using Ord, although as the number of repeated elements grows unordered-containers appears to have an advantage. I'm sure the relative costs of comparison vs hashing would affect this also. But both are dramatically better than the current nub.

Has anyone looked at Bart's patches to see how difficult it would be to apply them (or re-write them)?

If I understand correctly, this function is proposed to be added to Data.List which lives in base... but the proposals here are about using either Sets from containers or HashSet from unordered-containers; I thought base wasn't supposed to depend on any other package :/

This discussion (on -cafe@) is just about what course of action to take; adding such functions to containers or unordered-containers would not require a libraries@ proposal. Conrad.

...

...
On Mon, Jul 15, 2013 at 8:43 PM, Clark Gaebel wrote:

...
Apologies. I was being lazy. Here's a stable version:

import qualified Data.HashSet as S

hashNub :: (Ord a) => [a] -> [a] hashNub l = go S.empty l where go _ [] = [] go s (x:xs) = if x `S.member` s then go s xs else x : go (S.insert x s) xs

Which, again, will probably be faster than the one using Ord, and I can't think of any cases where I'd want the one using Ord instead. I may just not be creative enough, though.

- Clark

On Mon, Jul 15, 2013 at 12:46 AM, Brandon Allbery wrote:

...
On Sun, Jul 14, 2013 at 7:54 AM, Clark Gaebel wrote:

...
Oops sorry I guess my point wasn't clear.

Why ord based when hashable is faster? Then there's no reason this has to be in base, it can just be a

Did the point about "stable" fly overhead?

-- brandon s allbery kf8nh sine nomine associates allbery.b@gmail.com ballbery@sinenomine.net unix, openafs, kerberos, infrastructure, xmonad http://sinenomine.net

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

-- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com http://IvanMiljenovic.wordpress.com

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Francesco Mazzoli

14 Jul 14 Jul

11:45 a.m.

At Sun, 14 Jul 2013 07:31:05 -0400, Clark Gaebel wrote:

...

Similarly, I've always used:

import qualified Data.HashSet as S

nub :: Hashable a => [a] -> [a] nub = S.toList . S.fromList

And i can't think of any type which i can't write a Hashable instance, so this is extremely practical.

Well, the above is not stable while Niklas’ is. But I guess that’s not the point of your message :). I’ve always avoided “nub” too, and FWIW I’d like a constrained version too—maybe avoiding Data.Set so that it could live in Data.List. I think Ord would be much better than Hashable, since it is 1. in “base” 2. much more established and understood. Although if you find yourself using “nub” too much you’re probably doing something wrong... Francesco

Ketil Malde

16 Jul 16 Jul

12:46 p.m.

Francesco Mazzoli writes:

...

...
import qualified Data.HashSet as S

nub :: Hashable a => [a] -> [a] nub = S.toList . S.fromList

...

Well, the above is not stable while Niklas’ is. But I guess that’s not the point of your message :).

We could also implement Data.BloomFilter.nub, which removes equal elements probabilistically (with a small but non-zero chance of removing some unique elements) :-) -k -- If I haven't seen further, it is by standing in the footprints of giants

Joey Adams

15 Jul 15 Jul

1:54 a.m.

On Sun, Jul 14, 2013 at 7:31 AM, Clark Gaebel wrote:

...

Similarly, I've always used:

import qualified Data.HashSet as S

nub :: Hashable a => [a] -> [a] nub = S.toList . S.fromList

And i can't think of any type which i can't write a Hashable instance, so this is extremely practical.

This won't yield results lazily (e.g. nub (repeat 'x') = _|_ instead of 'x' : _|_), but Niklas' ordNub will. His ordNub can be translated directly to HashSet and still have the stability and laziness properties. A difficulty with putting ordNub in Data.List is that it depends on containers, which is outside of the base package. Some options: * Move the implementation of Set to base. * Implement a lean version of Set in base that only provides 'insert' and 'member'. * Define ordNub in Data.Set instead. Adding a Hashable-based nub to base would be even more problematic, since you'd need Hashable in base.

Conrad Parker

2:24 a.m.

On 15 July 2013 09:54, Joey Adams wrote:

...

On Sun, Jul 14, 2013 at 7:31 AM, Clark Gaebel wrote:

...
Similarly, I've always used:

import qualified Data.HashSet as S

nub :: Hashable a => [a] -> [a] nub = S.toList . S.fromList

And i can't think of any type which i can't write a Hashable instance, so this is extremely practical.

This won't yield results lazily (e.g. nub (repeat 'x') = _|_ instead of 'x' : _|_), but Niklas' ordNub will. His ordNub can be translated directly to HashSet and still have the stability and laziness properties.

A difficulty with putting ordNub in Data.List is that it depends on containers, which is outside of the base package. Some options:

* Move the implementation of Set to base.

* Implement a lean version of Set in base that only provides 'insert' and 'member'.

* Define ordNub in Data.Set instead.

Adding a Hashable-based nub to base would be even more problematic, since you'd need Hashable in base.

Right, I suggest the following community course of action: 1a) add ordNub to Data.Set 1b) add ordNub to Data.Hashable (1 day) 2) make a libraries@ proposal to include a stripped-down Data.Set-like balanced binary tree implementation to base. (2 weeks) 3) bikeshed about the name, eg.: * is "nub" really intuitive? how about "uniq", like in perl/ruby/underscore.js? * but "uniq" in unix only removes _adjacent_ duplicates, confusing! * how about "distinct"? "sole"? "unique"? "azygous"? (7 weeks) 4) Failing consensus on technical grounds (that the stripped-down Data.Set implementation is overkill for one library function), agree that anyone who really cares should just use the version from containers or hashable. Only newbs and textbook authors actually use base anyway, and it's impossible to change the language definition. Prelude will continue to fulfil its role of avoiding success at all costs, quadratic or otherwise. (Please, let's have both 1a and 1b :) Conrad.

Roman Cheplyaka

14 Jul 14 Jul

1:12 p.m.

Something like that should definitely be included in Data.List. Thanks for working on it. Roman * Niklas Hambüchen [2013-07-14 19:20:52+0800]

...

tldr: nub is abnormally slow, we shouldn't use it, but we do.

As you might know, Data.List.nub is O(n²). (*)

As you might not know, almost *all* practical Haskell projects use it, and that in places where an Ord instance is given, e.g. happy, Xmonad, ghc-mod, Agda, darcs, QuickCheck, yesod, shake, Cabal, haddock, and 600 more (see https://github.com/nh2/haskell-ordnub).

I've taken the Ord-based O(n * log n) implementation from yi using a Set:

ordNub :: (Ord a) => [a] -> [a] ordNub l = go empty l where go _ [] = [] go s (x:xs) = if x `member` s then go s xs else x : go (insert x s) xs

and put benchmarks on http://htmlpreview.github.io/?https://github.com/nh2/haskell-ordnub/blob/1f0... (compare `nub` vs `ordNub`).

`ordNub` is not only in a different complexity class, but even seems to perform better than nub for very small numbers of actually different list elements (that's the numbers before the benchmark names).

(The benchmark also shows some other potential problem: Using a state monad to keep the set instead of a function argument can be up to 20 times slower. Should that happen?)

What do you think about ordNub?

I've seen a proposal from 5 years ago about adding a *sort*Nub function started by Neil, but it just died.

(*) The mentioned complexity is for the (very common) worst case, in which the number of different elements in the list grows with the list (alias you don't have an N element list with always only 5 different things inside).

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Niklas Hambüchen

12 Oct 12 Oct

6:47 p.m.

I would like to come back to the original question: How can ordNub be added to base? I guess we agree that Data.List is the right module for a function of type Ord a => [a] -> [a], but this introduces * a cyclic dependency between Data.List and Data.Set * a base dependency on containers. What is the right way to go with that? Should ordNub be introduced as part of Data.Set, as Conrad suggested? It does not really have anything to do with Set, apart from being implemented with it. On 14/07/13 14:12, Roman Cheplyaka wrote:

...

Something like that should definitely be included in Data.List. Thanks for working on it.

Roman

Anthony Cowley

7:43 p.m.

On Oct 12, 2013, at 2:47 PM, Niklas Hambüchen wrote:

...

I would like to come back to the original question:

How can ordNub be added to base?

I guess we agree that Data.List is the right module for a function of type Ord a => [a] -> [a], but this introduces

* a cyclic dependency between Data.List and Data.Set * a base dependency on containers.

What is the right way to go with that?

Should ordNub be introduced as part of Data.Set, as Conrad suggested?

It does not really have anything to do with Set, apart from being implemented with it.

I think nub's behavior is rather set-related, so I don't really understand the objection to putting it in Data.Set. Anthony

...

...
On 14/07/13 14:12, Roman Cheplyaka wrote: Something like that should definitely be included in Data.List. Thanks for working on it.

Roman

Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Niklas Hambüchen

9:12 p.m.

On 12/10/13 20:43, Anthony Cowley wrote:

...

I think nub's behavior is rather set-related, so I don't really understand the objection to putting it in Data.Set.

In sets, the order does not matter, while for nub it does. nub :: Eq a => [a] -> [a] ordNub :: Ord a => [a] -> [a] both do not mention Set, and the fact that Set is used inside my proposed ordNub implementation is a detail not visible to the caller. That's why it looks like a Data.List function to me.

AntC

13 Oct 13 Oct

8:42 p.m.

...

Niklas Hambüchen writes:

In sets, the order does not matter, while for nub it does.

Let's be careful here!. Niklas, when you say "order", do you mean: * the _ordering_ from the Ord instance? Or * the relative sequence of elements in the list?

...

... the fact that Set is used inside my proposed ordNub implementation is a detail not visible to the caller.

If you use the Set library, that fact may be very visible! Because Set re-sequences the whole list, as per its Ord instance. But List.nub preserves the list sequence (except for omitting duplicates). Furthermore, the Ord instance might compare two elements as EQ, even though their Eq instance says they're not equal. So a Set-based ordNub could end up returning: * not the same elements as List.nub * and/or not in the same list sequence I'd call that very much *visible* to the caller.

...

That's why it looks like a Data.List function to me.

[BTW I am still less than convinced that overall a Set-based ordNub is significantly more efficient. I suspect it depends on how big is your list.]

Niklas Hambüchen

10:25 p.m.

On 13/10/13 21:42, AntC wrote:

...

...
Niklas Hambüchen writes:

In sets, the order does not matter, while for nub it does.

Let's be careful here!. Niklas, when you say "order", do you mean: * the _ordering_ from the Ord instance? Or * the relative sequence of elements in the list?

...
... the fact that Set is used inside my proposed ordNub implementation is a detail not visible to the caller.

If you use the Set library, that fact may be very visible! Because Set re-sequences the whole list, as per its Ord instance.

But List.nub preserves the list sequence (except for omitting duplicates).

I mean *exactly* what you say here. ordNub behaves has the same behaviour as nub, while (Set.toList . Set.fromList) doesn't.

...

[BTW I am still less than convinced that overall a Set-based ordNub is significantly more efficient. I suspect it depends on how big is your list.]

What do you mean? ordNub is clearly in a different complexity class, and the benchmarks that I provided show not only this, but also that ordNub is *always* faster than nub, even for singleton lists.

AntC

14 Oct 14 Oct

2:20 a.m.

...

Niklas Hambüchen writes:

...
On 13/10/13 21:42, AntC wrote: ... If you use the Set library, that fact may be very visible! Because Set re-sequences the whole list, as per its Ord instance.

But List.nub preserves the list sequence (except for omitting duplicates).

I mean *exactly* what you say here.

ordNub behaves has the same behaviour as nub, while (Set.toList . Set.fromList) doesn't.

That's great, thank you.

...

...
[BTW I am still less than convinced that overall a Set-based ordNub is significantly more efficient. I suspect it depends on how big is your list.]

What do you mean?

ordNub is clearly in a different complexity class, ...

Yes, I'm not disputing that.

...

... and the benchmarks that I provided show not only this, but also that ordNub is *always* faster than nub, even for singleton lists.

Thanks Niklas, I hadn't spotted those benchmarks back in July. I'm surprised at that result for singletons (and for very small numbers of elements which are in fact each different). Especially because List's `nub` uses `filter` ==> fold, which should be tail-recursive. It seems to me that for small numbers, the Set-based approach still requires comparing each element to each other. Plus there's the overhead for building the Set and inserting each element into it -- where `insert` again walks the Set to find the insertion point. Then here's a further possible optimisation, instead of making separate calls to `member` and `insert`: * Make a single call to insert' :: (Ord a) => a -> Set a -> (Bool, Set a) * The Bool returns True if already a member. * Else returns an updated Set in the snd, with the element inserted.

Niklas Hambüchen

2:35 a.m.

On 14/10/13 03:20, AntC wrote:

...

Thanks Niklas, I hadn't spotted those benchmarks back in July.

No worries :)

...

I'm surprised at that result for singletons (and for very small numbers of elements which are in fact each different).

I think one of the main reasons for the performance difference is that a list node and a Set binary tree node have pretty much the same performance, with the difference that in http://hackage.haskell.org/package/containers-0.5.2.1/docs/src/Data-Set-Base... data Set a = Bin {-# UNPACK #-} !Size !a !(Set a) !(Set a) | Tip there are strictness and unpack annotations, while for data [a] = [] | a : [a] -- pseudo syntax there are not. Good for us in this case, I guess.

...

It seems to me that for small numbers, the Set-based approach still requires comparing each element to each other.

This I don't understand.

...

Then here's a further possible optimisation, instead of making separate calls to `member` and `insert`:

This I understand again. Where do you get insert' from? containers doesn't seem to have it. Do you suggest adding it? Niklas

AntC

5:03 a.m.

...

Niklas Hambüchen writes:

...
On 14/10/13 03:20, AntC wrote: ... Then here's a further possible optimisation, instead of making separate calls to `member` and `insert`:

This I understand again. Where do you get insert' from? containers doesn't seem to have it. Do you suggest adding it?

err, well I didn't have any specific library in mind. More there's a kind of 'folk idiom' for managing data structures, (this applies more for imperative code/update-in-situ than functional) that if you know the next thing you're going to do after failing to find an element is insert it, you might as well get on with the insert there and then. (It's a higher-level analogue of a machine instruction decrement-and- branch-if-zero.) I'm looking at all the remarks about managing libraries and dependencies. Would it make sense to build a stand-alone version of Set purely to support ordNub? Then it needs only methods `empty` and `insertIfAbsent`.

Yitzchak Gale

16 Oct 16 Oct

4:46 p.m.

On Sun, Oct 13, 2013 at 6:25 PM, Niklas Hambüchen wrote:

...

ordNub behaves has the same behaviour as nub, while (Set.toList . Set.fromList) doesn't.

Perhaps you mean that they produce exactly the same output when fully evaluated. But I doubt that their behavior is *exactly* the same in every aspect. Given the differences in strictness between sets and lists, can you prove that nub and nubOrd have *exactly* the same strictness properties in every possible case?

...

...ordNub is *always* faster than nub, even for singleton lists.

This is also a claim that I doubt. You say that you tested the case of singletons. How about this: 2 : replicate 100000 1 ++ [3] Well, you could optimize for that by having nubOrd squeeze runs of of equal elements in the input. But then how about something like this: [3,2,1] ++ take 100000 (cycle [3,2,1]) ++ [4] Or variations on that, cycling some other fairly small number of elements. Set containers must do extra comparisons, whereas the of cost running down the first few hundred elements of a linked list (as long as the compiler fuses the iteration down to a tight loop in the output code and you remain within the cache) is near zero. How could nubOrd be as fast as ord in these kinds of cases? I'm not saying that it wouldn't be worthwhile to add a standard well-optimized implementation of nubOrd somewhere. But nub is a very useful function. The only advantage of nubOrd is for very large lists where the complexity advantage overtakes the excellent constants of nub to provide better speed. In practice, the cases where that's worthwhile are unusual. I rarely bother with nubOrd over nub. -Yitz

Kim-Ee Yeoh

5:20 p.m.

On Wed, Oct 16, 2013 at 11:46 PM, Yitzchak Gale wrote:

...

I'm not saying that it wouldn't be worthwhile to add a standard well-optimized implementation of nubOrd somewhere. But nub is a very useful function.

+1 In many cases, an elementary operational semantics beats something more complicated, even if faster. Please don't kill nub. -- Kim-Ee

Niklas Hambüchen

5:32 p.m.

On 16/10/13 17:46, Yitzchak Gale wrote:

...

On Sun, Oct 13, 2013 at 6:25 PM, Niklas Hambüchen wrote:

...
ordNub behaves has the same behaviour as nub, while (Set.toList . Set.fromList) doesn't.

Perhaps you mean that they produce exactly the same output when fully evaluated. But I doubt that their behavior is *exactly* the same in every aspect. Given the differences in strictness between sets and lists, can you prove that nub and nubOrd have *exactly* the same strictness properties in every possible case?

Yes. The elements are are evaluated in the same order as in `nub`. We can see that by comparing the implementations. If you have objects that need not be fully evaluated to use (==) on them, and need not fully evaluated *in a different way* when using `compare` on them, then of course the strictness properties are different. That is why one function has "Eq a =>" and the other one "Ord a =>" at the front.

...

...
...ordNub is *always* faster than nub, even for singleton lists.

This is also a claim that I doubt. You say that you tested the case of singletons. How about this:

2 : replicate 100000 1 ++ [3]

Well, you could optimize for that by having nubOrd squeeze runs of of equal elements in the input. But then how about something like this:

[3,2,1] ++ take 100000 (cycle [3,2,1]) ++ [4]

Or variations on that, cycling some other fairly small number of elements. Set containers must do extra comparisons, whereas the of cost running down the first few hundred elements of a linked list (as long as the compiler fuses the iteration down to a tight loop in the output code and you remain within the cache) is near zero. How could nubOrd be as fast as ord in these kinds of cases?

I make the following, really cool proposal: * You add your test cases that you just mentioned to the benchmark list in "ordnub.hs" * run `cabal build` * run `dist/build/ordnub/ordnub -o report.html`. Then you will see that ordNub is, indeed, faster.

...

(as long as the compiler fuses the iteration down to a tight loop in the output code and you remain within the cache)

Even though a list is conceptually simpler than a set, a list node and a set node are not that different (as I explained in the last email). Both have a content and a pointer to the next node. The only difference is that the Set node is optimised with strictness and unboxing, and the list one is not.

...

The only advantage of nubOrd is for very large lists where the complexity advantage overtakes the excellent constants of nub to provide better speed.

Sorry, I think this is absolutely invented. What "excellent constants" do you mean? How large are "very large lists"? I find lists of length 1000 a very common case. An `n log2 n` algorithm is already 100 times faster on those than an n² one. I do not dispute that there exist n² algorithms that are faster than others with better complexity for small inputs. nub is not one of them.

Yitzchak Gale

17 Oct 17 Oct

4:56 p.m.

Niklas Hambüchen wrote:

...

I make the following, really cool proposal:> * You add your test cases that you just mentioned to the benchmark list in "ordnub.hs"...

No thanks. Those were not meant to be test cases to add to a benchmark suite - they were meant to get you thinking a little more deeply about execution models of Haskell in general, and the GHC execution model in particular, given your claim that nubOrd is better than ord is *every* case, in *every* sense of "better". But that is a side issue. Everyone agrees with you that it is useful to add a good implementation of nubOrd to the standard libraries in some convenient and logical place. Let's get back to focusing on that. Thanks, Yitz

Niklas Hambüchen

6:52 p.m.

On 17/10/13 17:56, Yitzchak Gale wrote:

...

Let's get back to focusing on that.

Agreed. My main question is still: Can the Data.List <-> Data.Set issue be resolved? Is it OK to introduce a cyclic dependency between the two, making use of a .boot.hs file?

Andreas Abel

18 Oct 18 Oct

1:03 p.m.

On 17.10.2013 20:52, Niklas Hambüchen wrote:

...

Can the Data.List <-> Data.Set issue be resolved?

Is it OK to introduce a cyclic dependency between the two, making use of a .boot.hs file?

As long as the users of Data.List and Data.Set do not have to fiddle with {-# SOURCE #-} imports, I do not see an issue, so please go ahead. Cheers, Andreas P.S.: The Agda source has tons of .hs-boot files, I hate them, but what can you do? -- Andreas Abel <>< Du bist der geliebte Mensch. Theoretical Computer Science, University of Munich Oettingenstr. 67, D-80538 Munich, GERMANY andreas.abel@ifi.lmu.de http://www2.tcs.ifi.lmu.de/~abel/

Stijn van Drongelen

1:20 p.m.

On Fri, Oct 18, 2013 at 3:03 PM, Andreas Abel wrote:

...

P.S.: The Agda source has tons of .hs-boot files, I hate them, but what can you do?

Implement support for mutually recursive modules? ;)

Dan Doel

1:49 p.m.

On Thu, Oct 17, 2013 at 2:52 PM, Niklas Hambüchen wrote:

...

Can the Data.List <-> Data.Set issue be resolved?

Is it OK to introduce a cyclic dependency between the two, making use of a .boot.hs file?

Data.List and Data.Set aren't even members of the same package. So hs-boot files won't help. -- Dan

Niklas Hambüchen

2:19 p.m.

...

Data.List and Data.Set aren't even members of the same package. So hs-boot files won't help.

Damn, you are right. I even mentioned this myself earlier. It would need a set in base. I wonder if it was ok to use the Data.Set implementation internally in base, without exposing it.

Yitzchak Gale

6:47 p.m.

Niklas Hambüchen wrote:

...

It would need a set in base. I wonder if it was ok to use the Data.Set implementation internally in base, without exposing it.

Ah yes, this is what stopped previous attempts at getting ordNub in. The GHC team is very keen on getting more things *out* of base. Unless something has changed, getting them to include new things in base, even if hidden, would be extremely difficult. Adding additional dependencies and maintenance burdens would add extra work for the overburdened GHC team and slow the GHC release cycle. How about adding a module named something like Data.List.Nub to containers and putting ordNub in there? Or just go back to the idea of putting it in Data.Set. -Yitz

David Thomas

7:13 p.m.

With such a big speedup and not a lot of downside, I'd really rather it wind up central enough to be used in preference to nub wherever possible (including in base!). Perhaps they can duplicate the minimum functionality necessary from Data.Set, internally behind the scenes? Alternatively, maybe we can move nub out of base? On Fri, Oct 18, 2013 at 11:47 AM, Yitzchak Gale wrote:

...

Niklas Hambüchen wrote:

...
It would need a set in base. I wonder if it was ok to use the Data.Set implementation internally in base, without exposing it.

Ah yes, this is what stopped previous attempts at getting ordNub in.

The GHC team is very keen on getting more things *out* of base. Unless something has changed, getting them to include new things in base, even if hidden, would be extremely difficult. Adding additional dependencies and maintenance burdens would add extra work for the overburdened GHC team and slow the GHC release cycle.

How about adding a module named something like Data.List.Nub to containers and putting ordNub in there? Or just go back to the idea of putting it in Data.Set.

-Yitz _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Yitzchak Gale

8:11 p.m.

David Thomas wrote:

...

With such a big speedup and not a lot of downside

That is far from proven, and frankly, almost certainly untrue. But again, it isn't the point here. There *are* case where you do need nubOrd, and those cases are common enough for it to make sense to get it into the libraries.

...

I'd really rather it wind up central enough to be used in preference to nub wherever possible

You are welcome to your own preference. I, and many other experienced Haskellers I know, prefer nub in many cases. Let's just make both easily available please.

...

(including in base!).

I wouldn't mind having ordNub in base. But let's at least get it in somewhere convenient and easily usable. If you focus the energy on trying to get it into base, this proposal will likely die the same death as all previous attempts.

...

Perhaps they can duplicate the minimum functionality necessary from Data.Set, internally behind the scenes?

Why should the GHC team agree to the maintenance burden of that? They have enough of a challenge keeping up with the work they already have. But you know what, give it a try. Just don't get stuck on that point if you meet with resistance.

...

Alternatively, maybe we can move nub out of base?

No. nub is part of the Haskell standard, so it's required to be there. Even if you disagree with my view that it should be there on its own merits. -Yitz

Dan Doel

8:16 p.m.

On Fri, Oct 18, 2013 at 3:13 PM, David Thomas wrote:

...

With such a big speedup and not a lot of downside, I'd really rather it wind up central enough to be used in preference to nub wherever possible (including in base!). Perhaps they can duplicate the minimum functionality necessary from Data.Set, internally behind the scenes? Alternatively, maybe we can move nub out of base?

The string 'nub' does not appear anywhere (that I can find) in base outside of Data.List. All occurrences of 'nub' in Data.List are in the definition and export of nub and nubBy, except one use of nubBy in unionBy, which cannot switch to ordNub, and would have to have its own alternate function. Of all the libraries packaged with GHC, the only one I see that uses nub outside of tests is Cabal, which already depends on containers. And in fact, one of the libraries I grepped through _was_ containers, because it's already about as close as you can get to being in base without actually being there, and is certainly as close as many other things will be once we split up base into more packages. I don't think it's really accurate to suggest that containers is less core than base. It's not just, 'some library.' -- Dan

Roman Cheplyaka

12 Oct 12 Oct

9:18 p.m.

* Anthony Cowley [2013-10-12 15:43:57-0400]

...

On Oct 12, 2013, at 2:47 PM, Niklas Hambüchen wrote:

...
I would like to come back to the original question:

How can ordNub be added to base?

I guess we agree that Data.List is the right module for a function of type Ord a => [a] -> [a], but this introduces

* a cyclic dependency between Data.List and Data.Set * a base dependency on containers.

What is the right way to go with that?

Should ordNub be introduced as part of Data.Set, as Conrad suggested?

It does not really have anything to do with Set, apart from being implemented with it.

I think nub's behavior is rather set-related, so I don't really understand the objection to putting it in Data.Set.

It's not Set (in the data structure sense) related. It's list-related, because it clearly acts on lists. Therefore, it belongs to Data.List. Besides, we already have the precedent of the slow nub being in Data.List. Roman

Thomas DuBuisson

15 Jul 15 Jul

3:16 a.m.

Just so people are aware - five years ago the notion of nubOrd and nubWith was discussed and a consensus reached on including nubOrd. I think Bart got too busy, didn't submit a final patch, and no one with commit access actually commited any code. http://haskell.1045720.n5.nabble.com/GHC-2717-Add-nubWith-nubOrd-td3159919.h... I fully support an efficient nub implementation making its way into base - it's far past time. Using Set seems sensible. Cheers, Thomas On Sun, Jul 14, 2013 at 4:20 AM, Niklas Hambüchen wrote:

...

tldr: nub is abnormally slow, we shouldn't use it, but we do.

As you might know, Data.List.nub is O(n²). (*)

As you might not know, almost *all* practical Haskell projects use it, and that in places where an Ord instance is given, e.g. happy, Xmonad, ghc-mod, Agda, darcs, QuickCheck, yesod, shake, Cabal, haddock, and 600 more (see https://github.com/nh2/haskell-ordnub).

I've taken the Ord-based O(n * log n) implementation from yi using a Set:

ordNub :: (Ord a) => [a] -> [a] ordNub l = go empty l where go _ [] = [] go s (x:xs) = if x `member` s then go s xs else x : go (insert x s) xs

and put benchmarks on http://htmlpreview.github.io/?https://github.com/nh2/haskell-ordnub/blob/1f0... (compare `nub` vs `ordNub`).

`ordNub` is not only in a different complexity class, but even seems to perform better than nub for very small numbers of actually different list elements (that's the numbers before the benchmark names).

(The benchmark also shows some other potential problem: Using a state monad to keep the set instead of a function argument can be up to 20 times slower. Should that happen?)

What do you think about ordNub?

I've seen a proposal from 5 years ago about adding a *sort*Nub function started by Neil, but it just died.

(*) The mentioned complexity is for the (very common) worst case, in which the number of different elements in the list grows with the list (alias you don't have an N element list with always only 5 different things inside).

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Jason Dagit

5:26 a.m.

On Sun, Jul 14, 2013 at 4:20 AM, Niklas Hambüchen wrote:

...

tldr: nub is abnormally slow, we shouldn't use it, but we do.

As you might know, Data.List.nub is O(n²). (*)

As you might not know, almost *all* practical Haskell projects use it, and that in places where an Ord instance is given, e.g. happy, Xmonad, ghc-mod, Agda, darcs, QuickCheck, yesod, shake, Cabal, haddock, and 600 more (see https://github.com/nh2/haskell-ordnub).

I've taken the Ord-based O(n * log n) implementation from yi using a Set:

ordNub :: (Ord a) => [a] -> [a] ordNub l = go empty l where go _ [] = [] go s (x:xs) = if x `member` s then go s xs else x : go (insert x s) xs

and put benchmarks on http://htmlpreview.github.io/?https://github.com/nh2/haskell-ordnub/blob/1f0... (compare `nub` vs `ordNub`).

Richard Bird has a book, "Pearls of Functional Algorithm Design" that is meant to teach a form of deriving algorithms from the properties we ask of them. In this book, he gives a possible derivation of ordNub, simply called nub in the book, following the methodology he is teaching. He notes in the text that this derivation feels more complicated than it ought. Here is his version: http://lpaste.net/87625 I just sent you a pull request to add that one and S.toList . S.fromList that was suggested in this thread. I don't think those two implementations are faster than the others but it's nice to have them for completeness. Jason

Niklas Hambüchen

6:43 a.m.

Hey Jason, would you mind giving a short idea of what the point of Bird's implementation is / from what properties it is derived? Also, running the QuickCheck tests you added, it doesn't give the same output (order) as nub. On 15/07/13 13:26, Jason Dagit wrote:

...

Richard Bird has a book, "Pearls of Functional Algorithm Design" that is meant to teach a form of deriving algorithms from the properties we ask of them. In this book, he gives a possible derivation of ordNub, simply called nub in the book, following the methodology he is teaching. He notes in the text that this derivation feels more complicated than it ought.

Here is his version: http://lpaste.net/87625

I just sent you a pull request to add that one and S.toList . S.fromList that was suggested in this thread. I don't think those two implementations are faster than the others but it's nice to have them for completeness.

Jason

Andreas Abel

16 Jul 16 Jul

10:24 p.m.

On 14.07.2013 13:20, Niklas Hambüchen wrote:

...

I've taken the Ord-based O(n * log n) implementation from yi using a Set:

ordNub :: (Ord a) => [a] -> [a] ordNub l = go empty l where go _ [] = [] go s (x:xs) = if x `member` s then go s xs else x : go (insert x s) xs

(The benchmark also shows some other potential problem: Using a state monad to keep the set instead of a function argument can be up to 20 times slower. Should that happen?)

I cannot say whether this should happen, but your code about can be straight-forwardly refactored using a *Reader* monad. import Control.Monad.Reader import Data.Functor ((<$>)) import qualified Data.Set as Set -- ifM still not in Control.Monad ifM mc md me = do { c <- mc; if c then md else me } ordNub :: (Ord a) => [a] -> [a] ordNub l = runReader (go l) Set.empty where go [] = return [] go (x:xs) = ifM ((x `Set.member`) <$> ask) (go xs) ((x :) <$> local (Set.insert x) (go xs)) test = ordNub [1,2,4,1,3,5,2,3,4,5,6,1] Of course, this does not lend itself to an application of filterM. In fact, your implementation is already in the (Set a ->) reader monad, in normalized form. It looks already optimal to me. Cheers, Andreas -- Andreas Abel <>< Du bist der geliebte Mensch. Theoretical Computer Science, University of Munich Oettingenstr. 67, D-80538 Munich, GERMANY andreas.abel@ifi.lmu.de http://www2.tcs.ifi.lmu.de/~abel/

Niklas Hambüchen

26 Aug 26 Aug

11:59 a.m.

On 14/07/13 20:20, Niklas Hambüchen wrote:

...

As you might not know, almost *all* practical Haskell projects use it, and that in places where an Ord instance is given, e.g. happy, Xmonad, ghc-mod, Agda, darcs, QuickCheck, yesod, shake, Cabal, haddock, and 600 more (see https://github.com/nh2/haskell-ordnub).

GHC uses nub. Also let me stress again that the n² case happens even if there are no duplicates.

David Feuer

1 Jan 1 Jan

1:42 p.m.

I think ordNub in Data.Set, hashNub in Data.HashSet or somewhere, and bitsNub in Data.Bits (for FiniteBits things). On Jul 14, 2013 7:21 AM, "Niklas Hambüchen" wrote:

...

tldr: nub is abnormally slow, we shouldn't use it, but we do.

As you might know, Data.List.nub is O(n²). (*)

As you might not know, almost *all* practical Haskell projects use it, and that in places where an Ord instance is given, e.g. happy, Xmonad, ghc-mod, Agda, darcs, QuickCheck, yesod, shake, Cabal, haddock, and 600 more (see https://github.com/nh2/haskell-ordnub).

I've taken the Ord-based O(n * log n) implementation from yi using a Set:

ordNub :: (Ord a) => [a] -> [a] ordNub l = go empty l where go _ [] = [] go s (x:xs) = if x `member` s then go s xs else x : go (insert x s) xs

and put benchmarks on

http://htmlpreview.github.io/?https://github.com/nh2/haskell-ordnub/blob/1f0... (compare `nub` vs `ordNub`).

`ordNub` is not only in a different complexity class, but even seems to perform better than nub for very small numbers of actually different list elements (that's the numbers before the benchmark names).

(The benchmark also shows some other potential problem: Using a state monad to keep the set instead of a function argument can be up to 20 times slower. Should that happen?)

What do you think about ordNub?

I've seen a proposal from 5 years ago about adding a *sort*Nub function started by Neil, but it just died.

(*) The mentioned complexity is for the (very common) worst case, in which the number of different elements in the list grows with the list (alias you don't have an N element list with always only 5 different things inside).

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

David Feuer

1:43 p.m.

Er.. that last might be to hard. But Data.IntSet can support its own nub version. On Jan 1, 2015 8:42 AM, "David Feuer" wrote:

...

I think ordNub in Data.Set, hashNub in Data.HashSet or somewhere, and bitsNub in Data.Bits (for FiniteBits things). On Jul 14, 2013 7:21 AM, "Niklas Hambüchen" wrote:

...
tldr: nub is abnormally slow, we shouldn't use it, but we do.

As you might know, Data.List.nub is O(n²). (*)

As you might not know, almost *all* practical Haskell projects use it, and that in places where an Ord instance is given, e.g. happy, Xmonad, ghc-mod, Agda, darcs, QuickCheck, yesod, shake, Cabal, haddock, and 600 more (see https://github.com/nh2/haskell-ordnub).

I've taken the Ord-based O(n * log n) implementation from yi using a Set:

ordNub :: (Ord a) => [a] -> [a] ordNub l = go empty l where go _ [] = [] go s (x:xs) = if x `member` s then go s xs else x : go (insert x s) xs

and put benchmarks on

http://htmlpreview.github.io/?https://github.com/nh2/haskell-ordnub/blob/1f0... (compare `nub` vs `ordNub`).

`ordNub` is not only in a different complexity class, but even seems to perform better than nub for very small numbers of actually different list elements (that's the numbers before the benchmark names).

(The benchmark also shows some other potential problem: Using a state monad to keep the set instead of a function argument can be up to 20 times slower. Should that happen?)

What do you think about ordNub?

I've seen a proposal from 5 years ago about adding a *sort*Nub function started by Neil, but it just died.

(*) The mentioned complexity is for the (very common) worst case, in which the number of different elements in the list grows with the list (alias you don't have an N element list with always only 5 different things inside).

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

3842

Age (days ago)

4378

Last active (days ago)

List overview

Download

49 comments

23 participants

participants (23)

Andreas Abel
AntC
Anthony Cowley
Brandon Allbery
Clark Gaebel
Conrad Parker
Dan Doel
David Feuer
David Thomas
Francesco Mazzoli
Ivan Lazar Miljenovic
Jason Dagit
Joey Adams
John Lato
Ketil Malde
Kim-Ee Yeoh
Niklas Hambüchen
Richard A. O'Keefe
Roman Cheplyaka
Stijn van Drongelen
Thomas DuBuisson
Tom Ellis
Yitzchak Gale

ordNub

Roman Cheplyaka

Anthony Cowley

Stijn van Drongelen

Roman Cheplyaka

tags

participants (23)