Weird profiling behaviour
Hi, I have what I think is a really strange problem. I have a fair sized problem, which involves sorting a data set, first on labels (which are Strings) and then on scores (which are Ints). The strange thing is that string sorting is *vastly* faster than int scoring! Now, I've tried modifying the sorting routinges, moving them into the same module, and so on, to no avail. Is there any likely explanation? The pipeline is basically sortBy int_cmp . compound_equal . sortBy string_cmp I hesitate to post the code, since it's part of a rather large program, and in the hope that somebody will pop up and say that oh, yes, it's a well know feature of 'sortBy'. Or that it's an artifact of profiling. Or something like that. Any, and I mean any, suggestions really, really welcome. -kzm -- If I haven't seen further, it is by standing in the footprints of giants
Ketil Z. Malde wrote:
I have what I think is a really strange problem. I have a fair sized problem, which involves sorting a data set, first on labels (which are Strings) and then on scores (which are Ints).
The strange thing is that string sorting is *vastly* faster than int scoring! Now, I've tried modifying the sorting routinges, moving them into the same module, and so on, to no avail. Is there any likely explanation? The pipeline is basically
sortBy int_cmp . compound_equal . sortBy string_cmp
I hesitate to post the code, since it's part of a rather large program, and in the hope that somebody will pop up and say that oh, yes, it's a well know feature of 'sortBy'. Or that it's an artifact of profiling. Or something like that.
Any, and I mean any, suggestions really, really welcome.
-kzm
Could it be that the string-comparison sort simply has less sorting to do than the int-comparison sort? The default definition of sortBy uses insertion sort, so if the string-sort input happens to be already sorted it takes linear time and if the int-sort input happens to be in reverse order it takes quadratic time. Colin R
Colin Runciman
Could it be that the string-comparison sort simply has less sorting to do than the int-comparison sort?
Not quite improbable, hang on while I print the profiling (with comparison in its own function): Yes, that seems to be the case, for 90K values to sort, I get 7M string comparisons and 321M integer comparisons. I'm running a new test now, with a larger number of values to sort, we'll see how it goes. Looks promising, thanks!
The default definition of sortBy uses insertion sort
I have vague recollection of the wisdom of this choice being questioned on these lists or others, and even vaguer recollection of it actually being a good choice. Comments, anybody? -kzm -- If I haven't seen further, it is by standing in the footprints of giants
ketil@ii.uib.no (Ketil Z. Malde) writes:
for 90K values to sort, I get 7M string comparisons and 321M integer
..and with different parameters giving 127K values, ie. a factor of 1.4, I get 12M and 614M comparisons, *very* close to the expected O(n²) behavior of insertion sort.
The default definition of sortBy uses insertion sort
I have vague recollection of the wisdom of this choice being questioned
And now I think I'm about question it as well... -kzm (writing his own O(n log n) sortBy as we speak) -- If I haven't seen further, it is by standing in the footprints of giants
On Wednesday 26 June 2002 04:19 am, Colin Runciman wrote:
Could it be that the string-comparison sort simply has less sorting to do than the int-comparison sort? The default definition of sortBy uses insertion sort, so if the string-sort input happens to be already sorted it takes linear time and if the int-sort input happens to be in reverse order it takes quadratic time.
A question I would like to ask is which compiler and libraries are you using? In the December 2001 version of Hugs, sortBy is implemented with the default definition of insertion sort. But if you are using ghc, you are probably using a quicksort algorithm. (Recently the ghc libraries were switched to use mergesort instead, but I'm not sure whether this made it into any of the released versions.) Remember that quicksort behaves quadratically in the worst case, which can happen with pre-sorted input. Maybe this can explain why the second round of sorting is so slow? - Brian Huffman
participants (3)
-
Brian Huffman -
Colin Runciman -
ketil@ii.uib.no