
By the way, the comment about the stack was an aside for me. The enigma of the performance change is what I would like to understand. To be honest, it bothers me a little bit because I just did not expect it. Also to be frank, some one should check me since I found this out by accident later in the evening which means that you should not trust me too much. I do not believe that I am doing anything silly here; I just do not know. Bertram appears to be on to something, but the experiment below is what I observed. Only the random list for Data.List.sort overflows the default stack which makes sense given Bertram's argument but not has much too me given the sorted, reversed, and same element list below. The command sorting will run Data.List.sort on a list or sortX which is just Data.List.sort with these two aforementioned lines crossed. The random list is made from Random, the sorted list is [1..1000000], the reversed list is [1000000,999999..1], and the list of zeros is just take 1000000 $ repeat 0. % ./sorting --random sort 1000000 Stack space overflow: current size 8388608 bytes. Use `+RTS -Ksize' to increase it. % ./sorting --sorted sort 1000000 Length of sorted sorted list = 1000000 % ./sorting --reversed sort 1000000 Length of sorted reversed sorted list = 1000000 % ./sorting --same sort 1000000 Length of sorted list of zeros = 1000000 % ./sorting --random sortX 1000000 Length of sorted random list = 1000000 % ./sorting --sorted sortX 1000000 Length of sorted sorted list = 1000000 % ./sorting --reversed sortX 1000000 Length of sorted reversed sorted list = 1000000 % ./sorting --same sortX 1000000 Length of sorted list of zeros = 1000000 Here is the GC for the random case but only on 100000 Int to avoid the stack overflow. I do not know what to make of this. % ./sorting --random sort 100000 +RTS -sstderr Length of sorted random list = 100000 191,878,444 bytes allocated in the heap 111,808,176 bytes copied during GC (scavenged) 20,928,592 bytes copied during GC (not scavenged) 24,830,520 bytes maximum residency (8 sample(s)) 336 collections in generation 0 ( 0.61s) 8 collections in generation 1 ( 0.24s) 58 Mb total memory in use INIT time 0.00s ( 0.00s elapsed) MUT time 0.67s ( 0.66s elapsed) GC time 0.85s ( 0.91s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 1.52s ( 1.58s elapsed) %GC time 55.8% (57.9% elapsed) Alloc rate 285,514,704 bytes per MUT second Productivity 44.2% of total user, 42.5% of total elapsed % ./sorting --random sortX 100000 +RTS -sstderr Length of sorted random list = 100000 175,301,948 bytes allocated in the heap 100,767,660 bytes copied during GC (scavenged) 20,974,892 bytes copied during GC (not scavenged) 13,687,992 bytes maximum residency (13 sample(s)) 337 collections in generation 0 ( 0.41s) 13 collections in generation 1 ( 0.28s) 38 Mb total memory in use INIT time 0.00s ( 0.00s elapsed) MUT time 0.51s ( 0.55s elapsed) GC time 0.69s ( 0.69s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 1.21s ( 1.24s elapsed) %GC time 57.3% (55.8% elapsed) Alloc rate 339,713,364 bytes per MUT second Productivity 42.4% of total user, 41.3% of total elapsed And for the record, the times on my box: # Needs to be at least 50M for the -K option % time ./sorting --random sort 1000000 +RTS -K50M Length of sorted random list = 1000000 real 0m38.16s user 0m37.34s sys 0m0.61s % time ./sorting --random sortX 1000000 Length of sorted random list = 1000000 real 0m19.36s user 0m18.94s sys 0m0.37s Cheers, - Marcus Bertram Felgenhauer wrote:
Marcus D. Gabriel wrote:
By a rather indirect route, I discovered that I obtain an almost factor of two improvement in performance in Data.List.sort if I make one small change in the implementation of the function merge which supports mergesort and hence sortBy and sort. Admittedly, the improvement was only noticeable to me when sorting for example one million random Int. The current code in libraries/base/Data/List.hs for merge is
merge :: (a -> a -> Ordering) -> [a] -> [a] -> [a] merge cmp xs [] = xs merge cmp [] ys = ys
[snip]
and all that I did was
merge :: (a -> a -> Ordering) -> [a] -> [a] -> [a] merge cmp [] ys = ys merge cmp xs [] = xs
[snip]
that is, I swapped the order of the first two lines. By the way, the second version is much less likely to overflow the stack than the first version.
Can some confirm this? If you confirm it, can someone explain to me why one obtains this performance improvement? I currently just do not grasp the point.
I'm not sure why there is a performance difference, but there is one major difference in the behaviour of these two implementations:
The library version evaluates the list from right to left (i.e. it compares the last two elements of the list first), because the third argument of 'merge' is forced before the second one.
Swapping those two lines of 'merge' results in a version which compares the first two elements of the list first.
This means that the library version produces a stack overflow on lists generated in an iterate like fashion (say, take 1000000 [0..]). The modified version produces a stack overflow on the reverse of that list, but I believe such lists are much rarer in practice.
Did you look at the GC stats for the two sort implementations?
Bertram