Re: GHC Data.List.sort performance question

15 Jan 2008

      By the way, the comment about the stack was an aside for me. The
enigma of the performance change is what I would like to understand.
To be honest, it bothers me a little bit because I just did not expect
it. Also to be frank, some one should check me since I found this out by
accident later in the evening which means that you should not trust me
too much. I do not believe that I am doing anything silly here; I just
do not know.

Bertram appears to be on to something, but the experiment below is
what I observed. Only the random list for Data.List.sort overflows
the default stack which makes sense given Bertram's argument but not
has much too me given the sorted, reversed, and same element list
below.

The command sorting will run Data.List.sort on a list or sortX which
is just Data.List.sort with these two aforementioned lines crossed.
The random list is made from Random, the sorted list is [1..1000000],
the reversed list is [1000000,999999..1], and the list of zeros is
just take 1000000 $ repeat 0.

% ./sorting --random sort 1000000
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize' to increase it.

% ./sorting --sorted sort 1000000
Length of sorted sorted list = 1000000

% ./sorting --reversed sort 1000000
Length of sorted reversed sorted list = 1000000

% ./sorting --same sort 1000000
Length of sorted list of zeros = 1000000

% ./sorting --random sortX 1000000
Length of sorted random list = 1000000

% ./sorting --sorted sortX 1000000
Length of sorted sorted list = 1000000

% ./sorting --reversed sortX 1000000
Length of sorted reversed sorted list = 1000000

% ./sorting --same sortX 1000000
Length of sorted list of zeros = 1000000

Here is the GC for the random case but only on 100000 Int to avoid the
stack overflow. I do not know what to make of this.

% ./sorting --random sort 100000 +RTS -sstderr
Length of sorted random list = 100000

191,878,444 bytes allocated in the heap
111,808,176 bytes copied during GC (scavenged)
20,928,592 bytes copied during GC (not scavenged)
24,830,520 bytes maximum residency (8 sample(s))

336 collections in generation 0 ( 0.61s)
8 collections in generation 1 ( 0.24s)

58 Mb total memory in use

INIT time 0.00s ( 0.00s elapsed)
MUT time 0.67s ( 0.66s elapsed)
GC time 0.85s ( 0.91s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 1.52s ( 1.58s elapsed)

%GC time 55.8% (57.9% elapsed)

Alloc rate 285,514,704 bytes per MUT second

Productivity 44.2% of total user, 42.5% of total elapsed

% ./sorting --random sortX 100000 +RTS -sstderr
Length of sorted random list = 100000

175,301,948 bytes allocated in the heap
100,767,660 bytes copied during GC (scavenged)
20,974,892 bytes copied during GC (not scavenged)
13,687,992 bytes maximum residency (13 sample(s))

337 collections in generation 0 ( 0.41s)
13 collections in generation 1 ( 0.28s)

38 Mb total memory in use

INIT time 0.00s ( 0.00s elapsed)
MUT time 0.51s ( 0.55s elapsed)
GC time 0.69s ( 0.69s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 1.21s ( 1.24s elapsed)

%GC time 57.3% (55.8% elapsed)

Alloc rate 339,713,364 bytes per MUT second

Productivity 42.4% of total user, 41.3% of total elapsed

And for the record, the times on my box:

# Needs to be at least 50M for the -K option
% time ./sorting --random sort 1000000 +RTS -K50M
Length of sorted random list = 1000000

real 0m38.16s
user 0m37.34s
sys 0m0.61s

% time ./sorting --random sortX 1000000
Length of sorted random list = 1000000

real 0m19.36s
user 0m18.94s
sys 0m0.37s

Cheers,
- Marcus

Bertram Felgenhauer wrote:
...
Marcus D. Gabriel wrote:
...
By a rather indirect route, I discovered that I obtain an almost
factor of two improvement in performance in Data.List.sort if I make
one small change in the implementation of the function merge which
supports mergesort and hence sortBy and sort.  Admittedly, the
improvement was only noticeable to me when sorting for example one
million random Int.  The current code in libraries/base/Data/List.hs
for merge is
merge :: (a -> a -> Ordering) -> [a] -> [a] -> [a]
merge cmp xs [] = xs
merge cmp [] ys = ys
[snip]
...
and all that I did was
merge :: (a -> a -> Ordering) -> [a] -> [a] -> [a]
merge cmp [] ys = ys
merge cmp xs [] = xs
[snip]
...
that is, I swapped the order of the first two lines.  By the way, the
second version is much less likely to overflow the stack than the
first version.
Can some confirm this?  If you confirm it, can someone explain to me
why one obtains this performance improvement?  I currently just do not
grasp the point.
I'm not sure why there is a performance difference, but there is
one major difference in the behaviour of these two implementations:
The library version evaluates the list from right to left (i.e. it
compares the last two elements of the list first), because the third
argument of 'merge' is forced before the second one.
Swapping those two lines of 'merge' results in a version which compares
the first two elements of the list first.
This means that the library version produces a stack overflow on
lists generated in an iterate like fashion (say, take 1000000 [0..]).
The modified version produces a stack overflow on the reverse of
that list, but I believe such lists are much rarer in practice.
Did you look at the GC stats for the two sort implementations?
Bertram