nubOrd (Re: Performance horrors)

6 Sep 2008

      IMHO the right way to think about nub is that it filters the
given list with a "stop list".  This code captures that
intuition, and allows new kinds of nubs to be quickly
constructed.  Don't know if anyone wants it, but I thought
I'd show it off, anyhow.

    nubWrt :: (StopList e s) => s -> [e] -> [e]
    nubWrt s [] = []
    nubWrt s (e : es) | memberS e s = nubWrt s es
    nubWrt s (e : es) = e : nubWrt (insertS e s) es

    nub :: (Eq e) => [e] -> [e]
    nub = nubWrt []

    nubOrd :: (Ord e) => [e] -> [e]
    nubOrd = nubWrt S.empty

    nubInt :: [Int] -> [Int]
    nubInt = nubWrt IS.empty

I've omitted the StopList class and instances, which are of
course crucial, but are kind of obvious.  Get the whole code at
  http://svcs.cs.pdx.edu/gitweb?p=nub.git;f=Nub.hs;hb=master
to see the details.

A day of painful fiddling with how to do benchmark timing
(there are three packages in Hackage and I couldn't figure
out how to use any of them) tells me that this version of
nub is about the same speed as the nub in Prelude
(unsurprisingly). This nubOrd is capable of running a list
of 10^7 elements, all different, in about 23 seconds on
my box; nubInt runs that list in about 5 seconds.  Plain old
nub runs a list of 10^5 elements, all different, in
about 70 seconds.  For small lists, or lists where many of
the elements are the same, everything's good: all the
functions are fast and about the same speed.  They all run
lists of 10^8 identical elements in around 5s, although it
turns out that the IntSet member test is close to twice as
fast as list elem.

I haven't QuickCheck tested anything, but my little tests
with ghci make me think the code is probably correct.

Hope this helps.

    Bart Massey
    bart@cs.pdx.edu