
Ketil Z. Malde
I'll hereby argue for using a quicksort implementation akin to
sortBy' _ [] = [] sortBy' pc (x:xs) = let (l,e,g) = part3 (`pc` x) xs in sortBy' pc l ++ (x:e) ++ sortBy' pc g where part3 comp xs = p3 [] [] [] comp xs p3 ls es gs _ [] = (ls,es,gs) p3 ls es gs comp (x:xs) = case comp x of LT -> p3 (x:ls) es gs comp xs EQ -> p3 ls (x:es) gs comp xs GT -> p3 ls es (x:gs) comp xs
(hopefully this is fairly bug-free) At least for my data (lots of values, limited range), it appears to speed things up tremendously. I haven't measured more general cases in any detail, though. And one obvious drawback may be that it's not stable, which I think can be alleviated by a few well placed 'reverse's.
Comments welcome!
But sortBy' (compare) [1 .. n] costs too much, even for n = 11000. It costs (on worst data) many times more than mergeSort. ----------------- Serge Mechveliani mechvel@botik.ru

"Serge D. Mechveliani"
But sortBy' (compare) [1 .. n]
costs too much, even for n = 11000. It costs (on worst data) many times more than mergeSort.
Yes, but why do you want to sort sorted data? I think the multiple value cost, i.e. that sortBy (compare) (take n $ repeat 1) is equally expensive, is a bigger problem. Okay, probably because it caught me unawares, but also because the mantra is "quicksort is pessimal on sorted data". But if mergesort (or heapsort for that matter) can be made to behave nicely, I think that's a good alternative. I haven't run numbers, but I was under the impression that mergesort was quite a bit slower than quicksort in the expected case. I, for one, am sorting expected, not worst-case, data :-) <gripe> What's this obsession with worst-case behaviour anyway? While I have linear-time algorithms I could use, I'm using one that's linear expected, quadratic worst-case -- but with better cache behaviour. And why not? There are O(2^n) possible inputs, who cares about the almost none that are pessimal? And that's cache as in the six-orders-of-magnitude access time difference between RAM vs. disk, not the relatively low cost of L2 cache misses. > One solution I've seen suggested, is to use quicksort to a depth of c log n (for some c), and fall back to mergesort thereafter. Or to pick a random pivot, rather than the first element. BTW, I'm fully in favor of keeping an insertion (or other) sort around that behaves nicely for sorted/almost sorted data, as a separate function available for the discriminating programmer. Okay, that was kind of rambling, to sum up: 1. The default sorting should, in order of approximate preference 1. scale well (probably means O(n log n)) 2. scale beyond available RAM 3. be fast (i.e. have low constant overhead) ?. be stable (always a nice property) 2. Other sorts should be provided for special cases that the programmer might know about, and where a different algorithm could be a win, possibly: - short sequences (bubble?) - sequences of sequences (radix?) - almost sorted/reverse sorted sequences (bubble/insertion?) - limited range (bucket?) -kzm -- If I haven't seen further, it is by standing in the footprints of giants

There seems to be a renewed interest in sorting and in adaptive sorting. A while ago (pre Haskell 98) I compiled a rather extensive library of sorting routines you may want to look at http://www.informatik.uni-bonn.de/~ralf/software.html#sort This includes different versions of mergesort, introspective sort (quicksort with a log n depth bound), heapsort etc. Cheers, Ralf

On Fri, Jun 28, 2002 at 09:44:11AM +0200, Ketil Z. Malde wrote:
"Serge D. Mechveliani"
writes: But sortBy' (compare) [1 .. n]
costs too much, even for n = 11000. It costs (on worst data) many times more than mergeSort.
Yes, but why do you want to sort sorted data?
[..]
quickSort will loose much for many data which are `almost' sorted. To detect fast which data are bad for qucikSort, you will, probably, need mergeSort ... ----------------- Serge Mechveliani mechvel@botik.ru

G'day all. On Fri, Jun 28, 2002 at 09:44:11AM +0200, Ketil Z. Malde wrote:
I, for one, am sorting expected, not worst-case, data :-)
<gripe> What's this obsession with worst-case behaviour anyway?
The best algorithm to use is the one which exploits known facts about the data. The converse is: The worst algorithm to use is the one whose worst case behaviour happens on the sort of data you want to feed it. So if, for example, you know that your data will cause expected-case behaviour on average, more power to you. If don't know that your data will cause worst-case behaviour on average, shame on you for not being obsessed enough. Cheers, Andrew Bromage

Andrew J Bromage
I, for one, am sorting expected, not worst-case, data :-) What's this obsession with worst-case behaviour anyway?
The best algorithm to use is the one which exploits known facts about the data. The converse is: The worst algorithm to use is the one whose worst case behaviour happens on the sort of data you want to feed it.
Okay. (As an aside, and for Num type classes, have anybody tried calculating the average, and using that as a pivot? I mean, I know we really want the median, but the average is at least available?)
So if, for example, you know that your data will cause expected-case behaviour on average, more power to you.
Isn't that what "expected-case" means? :-) I was going to run through the statistics to work out the expected running time for a quick sort (i.e. sorting random data); I wrestled a bit with it, and while I'm fairly sure it's pretty close to n log n, the proof is at the moment left as an excercise for the reader... No matter, let's look at string searching: In theory, Boyers-Moore or Knuth-Morris-Pratt are faster than a naive method (ie. just using something like find pat = or . map (pat `isPrefixOf`) . tails This is worst-case O(n*m), since you may have to traverse the whole pattern each time before deciding it's not a prefix of a tail, and moving on. The competition is O(n+m) which sounds a lot better. But for random data, the chance of a match in the first character is equal to the alphabet size, two matches is alpsz², and so on. Most of the time, we only traverse one or two places before finding a mismatch, and since the algorithm is so much simpler than the alternatives, it's generally faster (I implemented KMP and measured, and I think you need *highly* repetitive data for it to be worth the trouble) -kzm -- If I haven't seen further, it is by standing in the footprints of giants

You guys know about the expected running time of randomized quicksort, right?
There is also the heapsort algorithm. Both are fine for *any* data that you
can't do with a bucket/bin sort.
Keep hacking,
--
Eray Ozkural (exa)

On Sun, 20 Oct 2002 15:08:30 +0300
Eray Ozkural
You guys know about the expected running time of randomized quicksort, right? There is also the heapsort algorithm. Both are fine for *any* data that you can't do with a bucket/bin sort.
I've often wondered why the heapsort isn't used for the standard sort function. I heard that one reason hugs used insertion sort was for good lazy behaviour, you don't have to pay for the whole N^2 sort at once, each element that is actually requred pays just N. With heapsort you'd get this behaviour too. You pay N for the first element (to build the heap) and then log N for each subsequent element. Quicksort on the other hand is monolithic as is mergsort (right?). Duncan

On Sun, 20 Oct 2002, Duncan Coutts wrote:
On Sun, 20 Oct 2002 15:08:30 +0300 Eray Ozkural
wrote: You guys know about the expected running time of randomized quicksort, right? There is also the heapsort algorithm. Both are fine for *any* data that you can't do with a bucket/bin sort.
I've often wondered why the heapsort isn't used for the standard sort function.
I heard that one reason hugs used insertion sort was for good lazy behaviour, you don't have to pay for the whole N^2 sort at once, each element that is actually requred pays just N.
With heapsort you'd get this behaviour too. You pay N for the first element (to build the heap) and then log N for each subsequent element.
Quicksort on the other hand is monolithic as is mergsort (right?).
I don't know whether as implemented it is effectively monolithic, but surely it needn't be: if you ask for just the first item then at each level you only do the partition corresponding to the elements smaller than the partition, i.e. assuming `perfect partitioning' you do n + n/2 + n/4 +... approx 2n work to find the smallest element. What I'm not sure about is that this is done by creating a tree structure (either explicitly or implicitly) and I haven't throught about whether flattening that to a list is done in a way which minimises the cost of producing the whole list at the expense of making it less lazy. ___cheers,_dave_________________________________________________________ www.cs.bris.ac.uk/~tweed/ | `It's no good going home to practise email:tweed@cs.bris.ac.uk | a Special Outdoor Song which Has To Be work tel:(0117) 954-5250 | Sung In The Snow' -- Winnie the Pooh

G'day all.
On Sun, 20 Oct 2002 15:08:30 +0300
Eray Ozkural
You guys know about the expected running time of randomized quicksort, right?
My guess is that a well-written merge sort would probably beat it over large lists because choosing a random element from a linked list takes O(N) time. I'd like to be proven wrong, though, so why don't you implement it and report back what you find?
There is also the heapsort algorithm. Both are fine for *any* data that you can't do with a bucket/bin sort.
On Sun, Oct 20, 2002 at 01:29:33PM +0100, Duncan Coutts wrote:
I've often wondered why the heapsort isn't used for the standard sort function.
Over linked lists you wouldn't use a heap, but rather you'd use a tree-based priority queue. At this point, my guess is that the cost of intermediate storage might dominate. Once again, don't take my word for it. Try it.
I heard that one reason hugs used insertion sort was for good lazy behaviour, you don't have to pay for the whole N^2 sort at once, each element that is actually requred pays just N.
With heapsort you'd get this behaviour too. You pay N for the first element (to build the heap) and then log N for each subsequent element.
Quicksort on the other hand is monolithic as is mergsort (right?).
Actually, quick sort can be written to take only O(log N) time to pull out an element after an initial O(N) cost in setting things up. This definition will do the trick, assuming that choose_pivot_element costs no more than O(K) time over a list of length K: qsort :: (Ord a) => [a] -> [a] qsort xs = qsort' xs [] qsort' [] = id qsort' [x] = (x:) qsort' xs = let pivot = choose_pivot_element xs in qsort' (filter (<=pivot) xs) . qsort' (filter(>pivot) xs) What it boils down to is that both heap sort and quick sort were designed to sort arrays, so we shouldn't be surprised if they perform poorly on linked lists. Cheers, Andrew Bromage

Andrew J Bromage
Once again, don't take my word for it. Try it.
Ralf Hinze has a library of *lots* of sorting algorithms on his web page. I didn't have time to do a bit of hacking to make them run properly, but if somebody wants to pick up the gauntlet, that looks like a good starting point. -kzm -- If I haven't seen further, it is by standing in the footprints of giants
participants (8)
-
Andrew J Bromage
-
Andrew J Bromage
-
D. Tweed
-
Duncan Coutts
-
Eray Ozkural
-
ketil@ii.uib.no
-
Ralf Hinze
-
Serge D. Mechveliani