Re: [Haskell-cafe] ordNub

16 Oct 2013

      On 16/10/13 17:46, Yitzchak Gale wrote:
...
On Sun, Oct 13, 2013 at 6:25 PM, Niklas Hambüchen  wrote:
...
ordNub behaves has the same behaviour as nub, while (Set.toList .
Set.fromList) doesn't.
Perhaps you mean that they produce exactly the same
output when fully evaluated. But I doubt that their behavior
is *exactly* the same in every aspect. Given the differences
in strictness between sets and lists, can you prove that
nub and nubOrd have *exactly* the same strictness properties
in every possible case?
Yes. The elements are are evaluated in the same order as in `nub`.
We can see that by comparing the implementations.

If you have objects that need not be fully evaluated to use (==) on
them, and need not fully evaluated *in a different way* when using
`compare` on them, then of course the strictness properties are different.
That is why one function has "Eq a =>" and the other one "Ord a =>" at
the front.
...
...
...ordNub is *always* faster than nub, even for singleton lists.
This is also a claim that I doubt. You say that you tested the
case of singletons. How about this:
2 : replicate 100000 1 ++ [3]
Well, you could optimize for that by having nubOrd squeeze
runs of of equal elements in the input. But then how about
something like this:
[3,2,1] ++ take 100000 (cycle [3,2,1]) ++ [4]
Or variations on that, cycling some other fairly small number of
elements. Set containers must do extra comparisons, whereas
the of cost running down the first few hundred elements of a linked
list (as long as the compiler fuses the iteration down to a tight
loop in the output code and you remain within the cache) is near zero.
How could nubOrd be as fast as ord in these kinds of cases?
I make the following, really cool proposal:

* You add your test cases that you just mentioned to the benchmark list
in "ordnub.hs"
* run `cabal build`
* run `dist/build/ordnub/ordnub -o report.html`.

Then you will see that ordNub is, indeed, faster.
...
(as long as the compiler fuses the iteration down to a tight
loop in the output code and you remain within the cache)
Even though a list is conceptually simpler than a set, a list node and a
set node are not that different (as I explained in the last email).
Both have a content and a pointer to the next node. The only difference
is that the Set node is optimised with strictness and unboxing, and the
list one is not.
...
The only advantage of nubOrd
is for very large lists where the complexity advantage overtakes
the excellent constants of nub to provide better speed.
Sorry, I think this is absolutely invented.

What "excellent constants" do you mean?

How large are "very large lists"? I find lists of length 1000 a very
common case. An `n log2 n` algorithm is already 100 times faster on
those than an n² one.

I do not dispute that there exist n² algorithms that are faster than
others with better complexity for small inputs.

nub is not one of them.