Re: Proposal #2629: Data.List: Replace nub; add nubOrd, nubInt, nubWith

30 Sep 2008

      Bart Massey  writes:
...
http://hackage.haskell.org/trac/ghc/ticket/2629
Everyone always complains about nub, but nobody ever does
anything about it except (map head . group . sort), which
isn't lazy and isn't always faster.
Just thought I'd post a note to thank you all for the informative and
interesting discussion of my proposal so far.  Thanks much to all who are
helping to figure out what we should do!

It looks to me like there's consensus on at least one point so far.  If someone
is really not agreeing with this, please post.

(1) The StopList class is too "heavyweight" to naturally
    live in Data.List and should not be exported.  Given
    this, the obvious choices are (a) to drop nubWith from
    the interface altogether, or (b) to replace it with a
    version that takes a stoplist function and value as an
    argument, as suggested by Krasimir Angelov earlier:

	nubWith :: (a -> b -> Maybe b) -> b -> [a] -> [a]

    I've implemented and benchmarked (b), and it seems to
    perform fine.  You can get it from
      git://svcs.cs.pdx.edu/git/nub.git
    in the branch nubWith-function if you want to look for
    yourself.

    (Another possibility which hasn't been discussed is (c1)
    to include a new module with StopList and some related
    functions in it, and make Data.List depend on it, or
    (c2) to throw StopLists into some other existing base
    module such as Data.Function.  Why Data.Function?
    Because the other thing I want them for is a later
    proposal to add transitiveClosure and
    reflexiveTransitiveClosure of functions to the
    libraries, and this is the natural place to put those.)

Areas that still need discussion include:

(2) It's probably reasonable to drop nubInt from the
    interface.  It has roughly the same complexity as
    nubOrd, and thus should just be an optimization rule,
    since we want to avoid nubX for an arbitrarily long list
    X of typeclasses.  

(3) Since nubOrd has a dramatically different asymptotic
    time complexity than nub, enough so that many programs
    that need the former effectively won't work with the
    latter, we should expose nubOrd separately and not rely
    on optimizer magic to pick it.  (At least I *hope*
    there's some consensus here, since I strongly agree with
    those who feel this way :-).  Time and space usage are,
    in practical terms, that part of the semantics that
    distinguish a programming language from a mathematical
    notation.  I'm a programmer.)

    Note, though, that one can easily imagine a nubAscii
    that is linear-time rather than the n log n time of
    nubOrd / nubInt, using a small bit vector to track the
    Chars.  Certainly nubBool has this property.  Hmm.
    What's our criterion for when the performance difference
    between two functions constitutes a practical semantic
    difference?  I'm claiming it's asymptotic complexity
    class, in which case (a) we should probably figure out how
    to expose some kind of nubFinite.  Or we could just take
    the position that in the vast universe of possible
    functions, our library cannot provide them all.  In
    which case (1b) nubWith is back in.  I guess I tend
    toward (1b).

(4) Should (a) the StopList class and friends also be banned
    from the *implementation*?  Or is it (b) OK to use them
    internally in the source code of Data.List?  I'm
    actually more inclined toward (a), ironically.  It makes
    for simpler and more portable code, and since we're not
    going to give the user the StopLists, we can avoid being
    so clever.

(5) Isaac Dupree proposes nubEq as a synonym for nub, for
    use in cleaning up code.  Is this a good idea?  In
    particular, would we deprecate nub at this point?

I'll roll a new proposal draft soon based on current and near future feedback. 
Again, thanks to all.

    Bart Massey
    bart <at> cs.pdx.edu
    [Is there really an address miner anywhere in the world
    that's still really fooled by the above?]