
Bart Massey
http://hackage.haskell.org/trac/ghc/ticket/2629
Everyone always complains about nub, but nobody ever does anything about it except (map head . group . sort), which isn't lazy and isn't always faster.
Just thought I'd post a note to thank you all for the informative and interesting discussion of my proposal so far. Thanks much to all who are helping to figure out what we should do! It looks to me like there's consensus on at least one point so far. If someone is really not agreeing with this, please post. (1) The StopList class is too "heavyweight" to naturally live in Data.List and should not be exported. Given this, the obvious choices are (a) to drop nubWith from the interface altogether, or (b) to replace it with a version that takes a stoplist function and value as an argument, as suggested by Krasimir Angelov earlier: nubWith :: (a -> b -> Maybe b) -> b -> [a] -> [a] I've implemented and benchmarked (b), and it seems to perform fine. You can get it from git://svcs.cs.pdx.edu/git/nub.git in the branch nubWith-function if you want to look for yourself. (Another possibility which hasn't been discussed is (c1) to include a new module with StopList and some related functions in it, and make Data.List depend on it, or (c2) to throw StopLists into some other existing base module such as Data.Function. Why Data.Function? Because the other thing I want them for is a later proposal to add transitiveClosure and reflexiveTransitiveClosure of functions to the libraries, and this is the natural place to put those.) Areas that still need discussion include: (2) It's probably reasonable to drop nubInt from the interface. It has roughly the same complexity as nubOrd, and thus should just be an optimization rule, since we want to avoid nubX for an arbitrarily long list X of typeclasses. (3) Since nubOrd has a dramatically different asymptotic time complexity than nub, enough so that many programs that need the former effectively won't work with the latter, we should expose nubOrd separately and not rely on optimizer magic to pick it. (At least I *hope* there's some consensus here, since I strongly agree with those who feel this way :-). Time and space usage are, in practical terms, that part of the semantics that distinguish a programming language from a mathematical notation. I'm a programmer.) Note, though, that one can easily imagine a nubAscii that is linear-time rather than the n log n time of nubOrd / nubInt, using a small bit vector to track the Chars. Certainly nubBool has this property. Hmm. What's our criterion for when the performance difference between two functions constitutes a practical semantic difference? I'm claiming it's asymptotic complexity class, in which case (a) we should probably figure out how to expose some kind of nubFinite. Or we could just take the position that in the vast universe of possible functions, our library cannot provide them all. In which case (1b) nubWith is back in. I guess I tend toward (1b). (4) Should (a) the StopList class and friends also be banned from the *implementation*? Or is it (b) OK to use them internally in the source code of Data.List? I'm actually more inclined toward (a), ironically. It makes for simpler and more portable code, and since we're not going to give the user the StopLists, we can avoid being so clever. (5) Isaac Dupree proposes nubEq as a synonym for nub, for use in cleaning up code. Is this a good idea? In particular, would we deprecate nub at this point? I'll roll a new proposal draft soon based on current and near future feedback. Again, thanks to all. Bart Massey bart <at> cs.pdx.edu [Is there really an address miner anywhere in the world that's still really fooled by the above?]