Re: [Haskell-cafe] Re: (flawed?) benchmark : sort

10 Mar 2008

      Krzysztof Skrze;tnicki wrote:
...
Ok, my turn now. Let's think about algorithm that takes equivalence 
relation EQ, ordering relation ORD on abstraction classes generated by 
this equivalence ( -> equivalence classes ) and divides given input 
elements XS into appropriate classes and then prints them out according 
to given ordering ORD. If we pose the restriction (let's call it (*)), 
that input XS should have at most one element from every abstraction 
class, we get sorting in a way that you desire. Hovewer, if we don't 
pose that restriction the algorithm still makes perfect sense and is 
given by standard library sortBy.
I see no reason for restriction (*).
I don't understand the above paragraph. Let's consider a simple example:

(sort [a,b]) in the case we have: (compare a b = EQ)

Which of the following 4 possible results are correct/incorrect?
1- [a,a]
2- [a,b]
3- [b,a]
4- [b,b]

I would say they are all correct and equivalent for any "sane" Ord
instance, though from the point of view of space efficiency 1 or 4
are preferable to 2 or 3.
...
For efficiency reasons there could be additional class StrictEq. If the 
type is in that class then we can assume (*) and use more 
space-efficient algorithm.
Now, the problem with
treeSort = concatMap (reverse . snd) . Map.toAscList
             . Map.fromListWith (++) . map (\x -> (x,[x]))
is that i can't tell any compact way of implementing treeSortBy in nice 
manner. This is because Data.Map does not allow us to provide our own 
comparison test instead of compare.
As a practical matter, for benchmarking you should also count the actual
number of comparisons needed, not just execution times for simple types
(Ints presumably).

Also, I think you'll find that the AVL lib gives better performance than
Data.Map, particularly for sorted inputs. Unfortunately you can't use
this implementation in the standard libs without making the AVL lib a
standard lib (the same happens to be true of Data.Map too, thought this
is widely perceived as being standard because of ghc library bundling
:-)

But actually I would say that if either (both) of these is faster than
the the standard sort then this is some kind of performance bug with
the current ghc release. They weren't faster last time I tested (with
Ints).

I also happen to think that sort should be made an Ord class method,
so that trie based sorts are possible (which should be faster for
complex data types). We should only use sort = sortBy compare as
the default.

Regards
--
Adrian Hey