Proposal #2629: Data.List: Replace nub; add nubOrd, nubInt, nubWith

http://hackage.haskell.org/trac/ghc/ticket/2629 Everyone always complains about nub, but nobody ever does anything about it except (map head . group . sort), which isn't lazy and isn't always faster. :-) I've implemented a new function nubWith that takes a "stop list" as an argument and filters its target list against the stop list. I've then re-implemented nub and implemented nubOrd and nubInt in terms of nubWith: the stop list is a typeclass, so these implementations are trivial and new implementations are easily added. nubBy is left alone, since there's nothing obvious to be done about it. All of the nubs are still fully lazy. Basic QuickCheck tests are provided, and pass. Performance benchmarking code is provided. The performance of my nub implementation is quite comparable to that of the standard one. My nubOrd and nubInt implementations are dramatically faster, since they use a Set and IntSet respectively for the stop list. In particular, they are performant on long lists with long nubs, unlike the basic nub. My implementation is available via git at git://svcs.cs.pdx.edu/git/nub.git or can be browsed at http://svcs.cs.pdx.edu/gitweb?p=nub.git;a=tree and has a maybe-outdated tarball at http://svcs.cs.pdx.edu/haskell/nub.tar.gz The Nub.hs file itself is attached to the proposal. If the proposal is accepted, I will prepare a patch against current GHC library top-of-tree, but for now it seems easier for everyone to just look at the bits in their current natural habitat.

On Sun, Sep 28, 2008 at 12:49 AM, Bart Massey
http://hackage.haskell.org/trac/ghc/ticket/2629
Everyone always complains about nub, but nobody ever does anything about it except (map head . group . sort), which isn't lazy and isn't always faster. :-)
I've implemented a new function nubWith that takes a "stop list" as an argument and filters its target list against the stop list. I've then re-implemented nub and implemented nubOrd and nubInt in terms of nubWith: the stop list is a typeclass, so these implementations are trivial and new implementations are easily added. nubBy is left alone, since there's nothing obvious to be done about it. All of the nubs are still fully lazy.
Basic QuickCheck tests are provided, and pass.
Performance benchmarking code is provided. The performance of my nub implementation is quite comparable to that of the standard one. My nubOrd and nubInt implementations are dramatically faster, since they use a Set and IntSet respectively for the stop list. In particular, they are performant on long lists with long nubs, unlike the basic nub.
My implementation is available via git at git://svcs.cs.pdx.edu/git/nub.git or can be browsed at http://svcs.cs.pdx.edu/gitweb?p=nub.git;a=tree and has a maybe-outdated tarball at http://svcs.cs.pdx.edu/haskell/nub.tar.gz The Nub.hs file itself is attached to the proposal. If the proposal is accepted, I will prepare a patch against current GHC library top-of-tree, but for now it seems easier for everyone to just look at the bits in their current natural habitat.
_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries
Hi all, This seems like a good idea but it's kind of strange to have three different exposed versions of nub. Would it be possible to hide them, hide the StopList typeclass and use {-# RULES #-} pragmas to use the faster versions when possible? Alex

I also think that it is better to use RULES and to hide at least
nubInt. For nubOrd I am not sure whether it should be exported or
hidden because it requires different type class.
Isn't the usage of multi-parameter type class with functional
dependencies and flexible instances an overkill for such a simple job?
I would write nubWith in this way:
nubWith :: (e -> s -> Maybe s) -> s -> [e] -> [e]
nubWith _ _ [] = []
nubWith f s (e : es)
case f e s of
Just s -> e : nubWith f s es
Nothing -> nubWith f s es
It is portable and perhaps more flexible. With your type class you
assume that it is efficient and possible to have separated memberS and
insertS which might not be the case. If you have only one function
then you leave the decision to the implementor.
Regards,
Krasimir
On Sun, Sep 28, 2008 at 8:05 PM, Alexander Dunlap
On Sun, Sep 28, 2008 at 12:49 AM, Bart Massey
wrote: http://hackage.haskell.org/trac/ghc/ticket/2629
Everyone always complains about nub, but nobody ever does anything about it except (map head . group . sort), which isn't lazy and isn't always faster. :-)
I've implemented a new function nubWith that takes a "stop list" as an argument and filters its target list against the stop list. I've then re-implemented nub and implemented nubOrd and nubInt in terms of nubWith: the stop list is a typeclass, so these implementations are trivial and new implementations are easily added. nubBy is left alone, since there's nothing obvious to be done about it. All of the nubs are still fully lazy.
Basic QuickCheck tests are provided, and pass.
Performance benchmarking code is provided. The performance of my nub implementation is quite comparable to that of the standard one. My nubOrd and nubInt implementations are dramatically faster, since they use a Set and IntSet respectively for the stop list. In particular, they are performant on long lists with long nubs, unlike the basic nub.
My implementation is available via git at git://svcs.cs.pdx.edu/git/nub.git or can be browsed at http://svcs.cs.pdx.edu/gitweb?p=nub.git;a=tree and has a maybe-outdated tarball at http://svcs.cs.pdx.edu/haskell/nub.tar.gz The Nub.hs file itself is attached to the proposal. If the proposal is accepted, I will prepare a patch against current GHC library top-of-tree, but for now it seems easier for everyone to just look at the bits in their current natural habitat.
_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries
Hi all,
This seems like a good idea but it's kind of strange to have three different exposed versions of nub. Would it be possible to hide them, hide the StopList typeclass and use {-# RULES #-} pragmas to use the faster versions when possible?
Alex _______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

At Sun, 28 Sep 2008 11:05:42 -0700, Alexander Dunlap wrote:
This seems like a good idea but it's kind of strange to have three different exposed versions of nub. Would it be possible to hide them, hide the StopList typeclass and use {-# RULES #-} pragmas to use the faster versions when possible?
The RULES pragma is GHC-specific. Would it make sense to export the faster versions so that people targeting alternative compilers could use them? Additionally, I think RULES only take effect if you are compiling with optimizations, but not if you just load the files into GHCi. I have written programs which run fine with nubOrd but are unacceptable with nubEq. So, I would certainly want the option to force the use of nubOrd at all times, rather than rely on the compiler to sometimes make the required optimization for me. j.

Jeremy Shaw wrote:
Additionally, I think RULES only take effect if you are compiling with optimizations, but not if you just load the files into GHCi. I have written programs which run fine with nubOrd but are unacceptable with nubEq. So, I would certainly want the option to force the use of nubOrd at all times, rather than rely on the compiler to sometimes make the required optimization for me.
indeed! Could we provide a function named "nubEq" too, so that you can be explicit about only requiring (Eq) when needed, if you want to get rid of all the "deprecated" (word-ambiguous and aesthetically too enticingly short) uses of "nub". -Isaac

Alexander Dunlap wrote:
This seems like a good idea but it's kind of strange to have three different exposed versions of nub. Would it be possible to hide them, hide the StopList typeclass and use {-# RULES #-} pragmas to use the faster versions when possible?
I don't think that using RULES pragmas is a good solution to the problem. Instead, there's an old-fashioned way to use the name nub for all cases: make it a member of the Eq typeclass! class Eq a where (==) :: a -> a -> Bool (/=) :: a -> a -> Bool nub :: [a] -> [a] nub = -- default definition Now, nub can be specialized at will. Changing the default definition of nub to use Set if we have an Ord class is tricky, though. Basically, we would need this proposal: http://www.haskell.org/haskellwiki/Class_system_extension_proposal#Allowing_... Does it have a formal definition yet and guarantees that it works? In any case, consider nub to be another example for the usefulness of this proposal. Regards, apfelmus

On Mon, Sep 29, 2008 at 9:40 AM, apfelmus
Alexander Dunlap wrote:
This seems like a good idea but it's kind of strange to have three different exposed versions of nub. Would it be possible to hide them, hide the StopList typeclass and use {-# RULES #-} pragmas to use the faster versions when possible?
I don't think that using RULES pragmas is a good solution to the problem.
Why not? I thought that was the major purpose of RULES - to implement transformations that don't affect semantics. It seems silly to clutter up classes with extra functions just for efficiency or to have to change programs every time the types change. The fact that RULES don't work in GHCi is admittedly a downside; is there any plan to change that? Alex

On 2008 Sep 29, at 18:55, Alexander Dunlap wrote:
On Mon, Sep 29, 2008 at 9:40 AM, apfelmus
wrote: Alexander Dunlap wrote:
This seems like a good idea but it's kind of strange to have three different exposed versions of nub. Would it be possible to hide them, hide the StopList typeclass and use {-# RULES #-} pragmas to use the faster versions when possible?
I don't think that using RULES pragmas is a good solution to the problem.
Why not? I thought that was the major purpose of RULES - to implement transformations that don't affect semantics. It seems silly to clutter
So what of yhc, nhc98, jhc? ghc is not the be-all and end-all of Haskell. -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

On Mon, Sep 29, 2008 at 4:06 PM, Brandon S. Allbery KF8NH
On 2008 Sep 29, at 18:55, Alexander Dunlap wrote:
On Mon, Sep 29, 2008 at 9:40 AM, apfelmus
wrote: Alexander Dunlap wrote:
This seems like a good idea but it's kind of strange to have three different exposed versions of nub. Would it be possible to hide them, hide the StopList typeclass and use {-# RULES #-} pragmas to use the faster versions when possible?
I don't think that using RULES pragmas is a good solution to the problem.
Why not? I thought that was the major purpose of RULES - to implement transformations that don't affect semantics. It seems silly to clutter
So what of yhc, nhc98, jhc? ghc is not the be-all and end-all of Haskell.
Well, yes, but presumably different compilers could optimize in different ways. (Perhaps they could use some class behind the scenes, or their own optimizing mechanism?) My point is that we shouldn't have multiple exposed functions for _exactly_ the same semantic operation. Alex

On 2008 Sep 29, at 19:40, Alexander Dunlap wrote:
On Mon, Sep 29, 2008 at 4:06 PM, Brandon S. Allbery KF8NH
wrote: On 2008 Sep 29, at 18:55, Alexander Dunlap wrote:
On Mon, Sep 29, 2008 at 9:40 AM, apfelmus
wrote: Alexander Dunlap wrote:
This seems like a good idea but it's kind of strange to have three different exposed versions of nub. Would it be possible to hide them, hide the StopList typeclass and use {-# RULES #-} pragmas to use the faster versions when possible?
I don't think that using RULES pragmas is a good solution to the problem.
Why not? I thought that was the major purpose of RULES - to implement transformations that don't affect semantics. It seems silly to clutter
So what of yhc, nhc98, jhc? ghc is not the be-all and end-all of Haskell.
Well, yes, but presumably different compilers could optimize in different ways. (Perhaps they could use some class behind the scenes, or their own optimizing mechanism?) My point is that we shouldn't have multiple exposed functions for _exactly_ the same semantic operation.
Tell it to the Prelude. (^), (^^), (**) anyone? -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

On Mon, 29 Sep 2008, Brandon S. Allbery KF8NH wrote:
On 2008 Sep 29, at 19:40, Alexander Dunlap wrote:
Well, yes, but presumably different compilers could optimize in different ways. (Perhaps they could use some class behind the scenes, or their own optimizing mechanism?) My point is that we shouldn't have multiple exposed functions for _exactly_ the same semantic operation.
Tell it to the Prelude. (^), (^^), (**) anyone?
I don't see the relation between the different kind of powers and the different implementations of 'nub'. As far as I understand 'nubInt' is just an optimization of the generic 'nub'. In contrast to that, the three power operators denote really different operations: http://www.haskell.org/haskellwiki/Power_function I think it is reasonable to drop in an optimized 'nub' for certain types by an optimizer rule, just like it is legitimate to drop in a special GSL function for (GSLMatrix.map sin), or a determinant algorithm for Double matrices which uses division, whereas the generic determinant computation does not need divisions.

Alexander Dunlap wrote:
apfelmus wrote:
Alexander Dunlap wrote:
This seems like a good idea but it's kind of strange to have three different exposed versions of nub. Would it be possible to hide them, hide the StopList typeclass and use {-# RULES #-} pragmas to use the faster versions when possible?
I don't think that using RULES pragmas is a good solution to the problem.
Why not? I thought that was the major purpose of RULES - to implement transformations that don't affect semantics. It seems silly to clutter up classes with extra functions just for efficiency or to have to change programs every time the types change. The fact that RULES don't work in GHCi is admittedly a downside; is there any plan to change that?
Why not type classes? :) Their major purpose is to overload a value at different types, like nub :: Eq a => [a] -> [a] nub :: Ord a => [a] -> [a] nub :: [Int] -> [Int] and that's exactly what we're trying to do. Adding nub with a default definition is a two-line change to the definition of the Eq class and it's completely backwards compatible, no existing program needs to be changed. But as said, Haskell98 is currently unable to specialize nub for a general Ord context, we would need some extension about superclasses for that. But this extension is desirable for Functor, Monad and Applicative anyway. The argument that the semantics of the different nubs are all the same does have merits, but we're already putting functions into the class for efficiency reasons, like (/=) and (==) (>),(>=) and compare (/) and recip tan and sin,cos (>>) and (>>=) etc. Regards, apfelmus

Bart Massey
http://hackage.haskell.org/trac/ghc/ticket/2629
Everyone always complains about nub, but nobody ever does anything about it except (map head . group . sort), which isn't lazy and isn't always faster.
Just thought I'd post a note to thank you all for the informative and interesting discussion of my proposal so far. Thanks much to all who are helping to figure out what we should do! It looks to me like there's consensus on at least one point so far. If someone is really not agreeing with this, please post. (1) The StopList class is too "heavyweight" to naturally live in Data.List and should not be exported. Given this, the obvious choices are (a) to drop nubWith from the interface altogether, or (b) to replace it with a version that takes a stoplist function and value as an argument, as suggested by Krasimir Angelov earlier: nubWith :: (a -> b -> Maybe b) -> b -> [a] -> [a] I've implemented and benchmarked (b), and it seems to perform fine. You can get it from git://svcs.cs.pdx.edu/git/nub.git in the branch nubWith-function if you want to look for yourself. (Another possibility which hasn't been discussed is (c1) to include a new module with StopList and some related functions in it, and make Data.List depend on it, or (c2) to throw StopLists into some other existing base module such as Data.Function. Why Data.Function? Because the other thing I want them for is a later proposal to add transitiveClosure and reflexiveTransitiveClosure of functions to the libraries, and this is the natural place to put those.) Areas that still need discussion include: (2) It's probably reasonable to drop nubInt from the interface. It has roughly the same complexity as nubOrd, and thus should just be an optimization rule, since we want to avoid nubX for an arbitrarily long list X of typeclasses. (3) Since nubOrd has a dramatically different asymptotic time complexity than nub, enough so that many programs that need the former effectively won't work with the latter, we should expose nubOrd separately and not rely on optimizer magic to pick it. (At least I *hope* there's some consensus here, since I strongly agree with those who feel this way :-). Time and space usage are, in practical terms, that part of the semantics that distinguish a programming language from a mathematical notation. I'm a programmer.) Note, though, that one can easily imagine a nubAscii that is linear-time rather than the n log n time of nubOrd / nubInt, using a small bit vector to track the Chars. Certainly nubBool has this property. Hmm. What's our criterion for when the performance difference between two functions constitutes a practical semantic difference? I'm claiming it's asymptotic complexity class, in which case (a) we should probably figure out how to expose some kind of nubFinite. Or we could just take the position that in the vast universe of possible functions, our library cannot provide them all. In which case (1b) nubWith is back in. I guess I tend toward (1b). (4) Should (a) the StopList class and friends also be banned from the *implementation*? Or is it (b) OK to use them internally in the source code of Data.List? I'm actually more inclined toward (a), ironically. It makes for simpler and more portable code, and since we're not going to give the user the StopLists, we can avoid being so clever. (5) Isaac Dupree proposes nubEq as a synonym for nub, for use in cleaning up code. Is this a good idea? In particular, would we deprecate nub at this point? I'll roll a new proposal draft soon based on current and near future feedback. Again, thanks to all. Bart Massey bart <at> cs.pdx.edu [Is there really an address miner anywhere in the world that's still really fooled by the above?]

On Tuesday 30 September 2008 1:46:23 am Bart Massey wrote:
Just thought I'd post a note to thank you all for the informative and interesting discussion of my proposal so far. Thanks much to all who are helping to figure out what we should do!
One additional issue: Data.List is in the base package. Data.Set is in containers. Data.Set also imports Data.List, so using sets in the latter creates a circular dependency. The latter isn't as big a deal, but circular package dependences sounds like a problem to me (as I imagine containers depends on base). Something else to discuss. :) -- Dan P.S.: On point 4, I'd probably lean away from StopList as well. Requiring fundeps seems undesirable for a rather core, H98-esque module.

Dan Doel
Data.List is in the base package. Data.Set is in containers.
Data.Set also imports Data.List, so using sets in the latter creates a circular dependency.
Oh. Ow. My bad for not doing a proper patch in the first place, which would have exposed the problem before it got this far. My apologies. Thanks much for the catch. You said that the circular module dependency is "not as big a deal"? I'm not getting why...it seems pretty serious to me. Is there some refactoring that makes it go away without doing huge violence to the existing structure? I mean, we could split the Data.List module up into pieces to break the dependency, but that doesn't seem like it's going to happen... I'd completely forgotten about the recent package split of containers and base. No idea what to do there. I suppose at worst we could have a Data.Nubs module, but that just seems silly, as all it's going to export at this point is nubOrd and (StopList-less) nubWith. Besides, what package do we put it in? Do we now have container, base, and nubs?? :-) :-) (In general, it seems like this sort of thing is going to become a problem as the libraries grow. I note, for example, that Data.Function is in base, which makes putting any reasonable implementation of transitiveClosure in there pretty hard for this same reason.) Any suggestions from anyone about how to proceed? Or are we just done at this point? So close... Bart Massey bart <at> cs.pdx.edu

On Tuesday 30 September 2008 12:24:29 pm Bart Massey wrote:
You said that the circular module dependency is "not as big a deal"? I'm not getting why...it seems pretty serious to me. Is there some refactoring that makes it go away without doing huge violence to the existing structure? I mean, we could split the Data.List module up into pieces to break the dependency, but that doesn't seem like it's going to happen...
Well, in GHC, at least, there are ways to handle mutually recursive modules. I've never encountered the need myself, so I don't know the specifics, but it involves making an hs-boot file or something of the sort. Of course, that only works if they're in the same package, I suspect, so maybe it is a big problem. :) I don't know how any other compilers handle such things.
I'd completely forgotten about the recent package split of containers and base. No idea what to do there. I suppose at worst we could have a Data.Nubs module, but that just seems silly, as all it's going to export at this point is nubOrd and (StopList-less) nubWith. Besides, what package do we put it in? Do we now have container, base, and nubs?? :-) :-)
(In general, it seems like this sort of thing is going to become a problem as the libraries grow. I note, for example, that Data.Function is in base, which makes putting any reasonable implementation of transitiveClosure in there pretty hard for this same reason.)
Any suggestions from anyone about how to proceed? Or are we just done at this point? So close...
Lists are a container, so I guess it'd be nice if Data.List could be moved to containers, but I think the prelude imports and uses stuff from it, and you're not going to move that. nubOrd could be stuck in Data.Set, but that's kind of an odd place for it. Other than that, I'm stumped. -- Dan

On Tue, 30 Sep 2008, Dan Doel wrote:
Lists are a container, so I guess it'd be nice if Data.List could be moved to containers, but I think the prelude imports and uses stuff from it, and you're not going to move that.
nubOrd could be stuck in Data.Set, but that's kind of an odd place for it.
Data.List could import list functions from Prelude and re-export them.

Bart Massey
Any suggestions from anyone about how to proceed? Or are we just done at this point? So close...
OK, I've got a new version of nubOrd together. This one is just nubOrd, no nubWith, no nubInt, nothing fancy. There's two variants, nubOrd' and nubOrd''. The latter gets a slight performance win by omitting some work that would preserve the order of output with respect to nub. The basic strategy of nubOrd' is to pull off a prefix l of the input list as long as the current stoplist s, nubSort l, merge l with s to get a new stoplist s', use s in a merge step to filter l, then sort the remainder of l back to the original order. Finally, the resulting l' and s' are prepended to a recursive call on the rest of the list. nubOrd' has roughly the same time complexity as the set-based nubOrd did. This is because each step of nubOrd requires log m time to search the stoplist, and each collection of steps in nubOrd' requires amortized log m time for the sorts. Let's do some side by side comparison: nub nubSort nubOrd nubOrd' nubOrd'' laziness full no full spine spine worst-case O(mn) O(n log n) O(n log m) O(n log m) O(n log m) complexity (m symbols, length n) relative 1x 50x 1.1x 5x 5x runtime with m = 1 absolute 84s 0.14s 0.16s 0.35s 0.21s runtime with m = n = 10^5 dependencies no no Data.Set no no outside (containers) Data.List outputs in yes no yes yes no order of first mention code small tiny smallish wtf wtf-lite complexity This table suggests to me that nubOrd' is viable, assuming spine-laziness is sufficient. It solves the time problems of nub and nubSort, solves the dependency problem of nubOrd, and preserves the order of its outputs. I don't like the 5x slowdown for small m, or the non-laziness on elements, but I guess I am willing to take them to get the other stuff. Maybe someone or the compiler can do some optimization later. The code for nubOrd', in all its hideous glory, is at the end of this post. Comments, corrections and improvements welcome. Bart Massey bart <at> cs.pdx.edu merge :: Ord e => [e] -> [e] -> [e] merge l1 [] = l1 merge [] l2 = l2 merge l1@(e1 : e1s) l2@(e2 : e2s) | e1 < e2 = e1 : merge e1s l2 | e1 > e2 = e2 : merge l1 e2s | otherwise = merge l1 e2s nubOrd' :: Ord e => [e] -> [e] nubOrd' [] = [] nubOrd' (e : es) = e : go [e] es where go _ [] = [] go s l = l1' ++ go s' l2 where (l1, l2) = splitAt (length s) l curl = filterDups $ sort $ zip l1 ([1..] :: [Integer]) s' = merge s (map fst curl) l1' = map fst $ sortBy flipcmp $ stopMerge s curl (a, b) `flipcmp` (c, d) = (b, a) `compare` (d, c) filterDups [] = [] filterDups (s1@(e1, _) : (e2, _) : ss) | e1 == e2 = filterDups (s1 : ss) filterDups (s1 : ss) = s1 : filterDups ss stopMerge _ [] = [] stopMerge [] m2 = m2 stopMerge m1@(e1 : e1s) m2@(s2@(e2, _) : s2s) | e1 < e2 = stopMerge e1s m2 | e1 > e2 = s2 : stopMerge m1 s2s | otherwise = stopMerge m1 s2s

I need to know what the community wants me to do to close out my proposal to add nubOrd to the standard libraries. After ruling out a lot of marginal choices, I guess I see three leading alternatives, all of which have negatives. I'd love to have some feedback on these so I can start thinking about other things. 1) Stick nubOrd' from my previous message into Data.List and call it a day. Advantages: Does no violence to the current library structure. Provides a nubOrd that has O(n log m) asymptotic performance, and performs reasonably in practice with large m. Disadvantages: In the worst case, 5x performance is left on the floor. Not particularly lazy: will work with infinite lists, but not with lists terminating in bottom; will only randomly work with lists containing bottom elements. Quite ugly implementation. No nubWith. My score: 2/5 2) Stick (non-StopList) nubWith in Data.List. Stick nubOrd in Data.Set, implemented using nubWith. Advantages: Provides a highly efficient, fully lazy nubOrd. Provides nubWith. Reasonable implementation. Disadvantages: Sticking nubOrd in Data.Set is weird. My score: 4/5 3) Leave well enough alone. Advantages: Full-on inherency. Leaves no weird mess for future librarians. Disadvantages: No nubOrd means that folks keep writing nubSort and/or using nub in situations where it might fall over from poor performance. Everyone having their own nub* implementations is a maintenance problem. No nubWith means that it's more work to implement one's own nub*, making this problem worse. My score: 1/5 Comments gratefully received. Based on them, I'll supersede Proposal #2629 with the appropriate replacement Proposal. My patch for (2) works fine with GHC 6.8.3. I'm working on compiling for current top-of-tree right now; should have it shortly. Thanks. Bart Massey bart <at> cs.pdx.edu

Bart Massey wrote:
2) Stick (non-StopList) nubWith in Data.List. Stick nubOrd in Data.Set, implemented using nubWith.
Advantages: Provides a highly efficient, fully lazy nubOrd. Provides nubWith. Reasonable implementation.
Disadvantages: Sticking nubOrd in Data.Set is weird.
My score: 4/5
indeed a weird location, but only an import-line needs to change if someone invents a super-duper nubOrd that's even faster but depends on more libraries. Would it be better to stick it in a new module e.g. Data.Set.Nub? -Isaac

Isaac Dupree
Bart Massey wrote:
2) Stick (non-StopList) nubWith in Data.List.
Disadvantages: Sticking nubOrd in Data.Set is weird.
indeed a weird location. Would it be better to stick it in a new module e.g. Data.Set.Nub?
Another possibility I considered is to stick nubOrd in Data.List.Nub. As far as I know, since containers depends on base and Data.Set depends on Data.List, there's nothing preventing us from putting Data.List.Nub in containers even though Data.List is in base. Does someone know otherwise?
only an import-line needs to change if someone invents a super-duper nubOrd that's even faster but depends on more libraries.
I believe that, given sufficient effort, I could prove that the nubOrd as given is optimal in asymptotic complexity. The constant factors seem really good and entirely dependent on those of the Set implementation. So I'm not so worried about a faster nubOrd. :-) Leaving nubOrd in Data.Set does make a weird kind of sense; we put the various nub* in whatever module their underlying stop list representation is in. Thus nubIx might go in Data.Array. Also, it's less of a change to the existing structure. If we made a Data.List.Nub, presumably as more nub* implementations were added, they would all pile in there. One problem with this plan is that if someone decides to write a nub* using a stop list from a module in a different package that depends on containers, we're going to end up with the same circular dependency problem we have now. However, getting back to your original suggestion, I think it's also a little weird to potentially have a bunch of modules with a special submodule containing one function, so I guess I'd probably rather just stick it in Data.Set and be done with it. What do others think?

On Sat, Oct 4, 2008 at 3:26 AM, Bart Massey
I need to know what the community wants me to do to close out my proposal to add nubOrd to the standard libraries. After ruling out a lot of marginal choices, I guess I see three leading alternatives, all of which have negatives. I'd love to have some feedback on these so I can start thinking about other things.
I realize I'm coming into this discussion late, but has anyone
surveyed existing Haskell code to see how often nub is used? How many
of those cases require the specific functionality of nub, as opposed
to something like "Set.toList . Set.fromList"? I may be misjudging
things, but I can't imagine it's often enough to justify three or more
implementations in the standard library.
If there are really a lot of cases where people need a collection that
(1) has no duplicates, and (2) preserves an arbitrary order of
elements, maybe we'd be better off designing a data structure
specifically for that.
--
Dave Menendez

On 2008 Oct 6, at 0:07, David Menendez wrote:
On Sat, Oct 4, 2008 at 3:26 AM, Bart Massey
wrote: I need to know what the community wants me to do to close out my proposal to add nubOrd to the standard libraries. After ruling out a lot of marginal choices, I guess I see three leading alternatives, all of which have negatives. I'd love to have some feedback on these so I can start thinking about other things.
I realize I'm coming into this discussion late, but has anyone surveyed existing Haskell code to see how often nub is used? How many
It's not so much that nub is used often. it's that it's often *reimplemented*.
If there are really a lot of cases where people need a collection that (1) has no duplicates, and (2) preserves an arbitrary order of elements, maybe we'd be better off designing a data structure specifically for that.
Feel free to contribute. Make sure it supports the whole spectrum of list operations, including list comprehensions. -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

On Mon, Oct 6, 2008 at 12:13 AM, Brandon S. Allbery KF8NH
On 2008 Oct 6, at 0:07, David Menendez wrote:
On Sat, Oct 4, 2008 at 3:26 AM, Bart Massey
wrote: I need to know what the community wants me to do to close out my proposal to add nubOrd to the standard libraries. After ruling out a lot of marginal choices, I guess I see three leading alternatives, all of which have negatives. I'd love to have some feedback on these so I can start thinking about other things.
I realize I'm coming into this discussion late, but has anyone surveyed existing Haskell code to see how often nub is used? How many
It's not so much that nub is used often. it's that it's often *reimplemented*.
By people who *need* a faster nub, or by people who figured they could make nub faster by changing the type? How often does someone need the exact behavior of nub, instead of something faster like "Set.toList . Set.fromList"? All this effort to optimize a seemingly obscure function feels like overkill. We might be better off just adding a note to the documentation of nub suggesting that people use Data.Set instead. (We can't get rid of it, because it's in the H98 Report.) In any case, my only objection is to complicating the standard libraries. If someone wants to post a faster nub to Hackage, that's fine.
If there are really a lot of cases where people need a collection that (1) has no duplicates, and (2) preserves an arbitrary order of elements, maybe we'd be better off designing a data structure specifically for that.
Feel free to contribute. Make sure it supports the whole spectrum of list operations, including list comprehensions.
Demonstrate the need, and I'd be happy to contribute.
--
Dave Menendez

On Sun, Oct 5, 2008 at 11:07 PM, David Menendez
On Sat, Oct 4, 2008 at 3:26 AM, Bart Massey
wrote: I need to know what the community wants me to do to close out my proposal to add nubOrd to the standard libraries. After ruling out a lot of marginal choices, I guess I see three leading alternatives, all of which have negatives. I'd love to have some feedback on these so I can start thinking about other things.
I realize I'm coming into this discussion late, but has anyone surveyed existing Haskell code to see how often nub is used? How many of those cases require the specific functionality of nub, as opposed to something like "Set.toList . Set.fromList"? I may be misjudging things, but I can't imagine it's often enough to justify three or more implementations in the standard library.
If there are really a lot of cases where people need a collection that (1) has no duplicates, and (2) preserves an arbitrary order of elements, maybe we'd be better off designing a data structure specifically for that.
-- Dave Menendez
nub . sort (and the reverse) are very frequent. The attached sort.txt was produced by find bin/ -name "*.hs" -exec sh -c "grep sort {} | grep nub && echo {}" \; > sort.txt It's very crude and obviously produces lots of false positives & negatives, but even so, there's at least >50 uses of nub . sort etc. -- gwern

Bart Massey wrote:
2) Stick (non-StopList) nubWith in Data.List. Stick nubOrd in Data.Set, implemented using nubWith.
Advantages: Provides a highly efficient, fully lazy nubOrd. Provides nubWith. Reasonable implementation.
Disadvantages: Sticking nubOrd in Data.Set is weird.
My score: 4/5
(2) seems reasonable to me, as long as we improve the documentation for nub to point out the quadratic complexity and direct users who want a faster version to Data.Set.nubOrd. Cheers, Simon

The StopList class [...]
BTW, that is a horrible name for what is simply a *set* (look at the methods you wrote: empty, insert, member) The design problem here is that you don't want to use Data.Set because currently it is an implementation, not an interface. J.W.

Johannes Waldmann
The StopList class [...]
BTW, that is a horrible name for what is simply a *set* (look at the methods you wrote: empty, insert, member).
I originally called the typeclass Set, but it isn't, so I changed it. It omits some Set functionality. One couldn't write an implementation of union or intersect with it, because there's no way to enumerate its elements. In any case, it's moot, because we're ditching StopLists. :-)
The design problem here is that you don't want to use Data.Set because currently it is an implementation, not an interface.
If by that you mean that it would be nice if Set was a typeclass, and there was a [] implementation as well as the current OrdSet implementation, I tend to agree. C'est la vie.

On Tue, Sep 30, 2008 at 1:46 AM, Bart Massey wrote:
Note, though, that one can easily imagine a nubAscii that is linear-time rather than the n log n time of nubOrd / nubInt, using a small bit vector to track the Chars. Certainly nubBool has this property. Hmm. What's our criterion for when the performance difference between two functions constitutes a practical semantic difference? I'm claiming it's asymptotic complexity class, in which case (a) we should probably figure out how to expose some kind of nubFinite. Or we could just take the position that in the vast universe of possible functions, our library cannot provide them all. In which case (1b) nubWith is back in. I guess I tend toward (1b).
Actually, the asymptotic complexity (measured in terms of list length) of nubOrd is identical to the asymptotic complexity of nubAscii, nubBool or nubFinite. They differ by a constant factor of the log(# possible data values). David

On Tue, Sep 30, 2008 at 11:58:07AM -0400, David Roundy wrote:
On Tue, Sep 30, 2008 at 1:46 AM, Bart Massey wrote:
Note, though, that one can easily imagine a nubAscii that is linear-time rather than the n log n time of nubOrd / nubInt, using a small bit vector to track the Chars. Certainly nubBool has this property. Hmm. What's our criterion for when the performance difference between two functions constitutes a practical semantic difference? I'm claiming it's asymptotic complexity class, in which case (a) we should probably figure out how to expose some kind of nubFinite. Or we could just take the position that in the vast universe of possible functions, our library cannot provide them all. In which case (1b) nubWith is back in. I guess I tend toward (1b).
Actually, the asymptotic complexity (measured in terms of list length) of nubOrd is identical to the asymptotic complexity of nubAscii, nubBool or nubFinite. They differ by a constant factor of the log(# possible data values).
Incidentally, for data types with finite value ranges, nub itself isn't bad, as it'd just O(n * # of values present). So, for instance, nub on a list of Bool is likely to be faster than nubOrd on a list of bools, and it's certainly O(n)... I just ran a timing, and nub takes about as long as length does. David

David Roundy wrote:
Actually, the asymptotic complexity (measured in terms of list length) of nubOrd is identical to the asymptotic complexity of nubAscii, nubBool or nubFinite. They differ by a constant factor of the log(# possible data values).
except for infinite possible data values, such as Integer or String -- then they don't differ by a constant factor of infinity! (And the constant factor on, say, Int64 and higher could be a lot larger than some implementations might be able to make it, I'm not sure) -Isaac

I think that the easiest way to think about the asymptotic complexity of the nub* functions is in terms of the length n of the input list and the number of unique elements m in that list, as I did in a previous posting. My original point in this thread, which I stated badly, is this: Since array access is constant time, one can imagine an implementation nubIx that has O(n) worst-case time on any Ix datatype, regardless of m. This contrasts with the O(mn) complexity of nub and the O(n log m) complexity of nubOrd on values of type Ix. It is true that for any fixed m any nub* is O(n), but IMHO this fact is a bit misleading; we probably want to think about the asymptotic complexity in m as well as n to get a clear picture of what's going on. Bart Massey bart <at> cs.pdx.edu
participants (15)
-
Alexander Dunlap
-
apfelmus
-
Bart Massey
-
Brandon S. Allbery KF8NH
-
Dan Doel
-
David Menendez
-
David Roundy
-
David Roundy
-
Gwern Branwen
-
Henning Thielemann
-
Isaac Dupree
-
Jeremy Shaw
-
Johannes Waldmann
-
Krasimir Angelov
-
Simon Marlow