
Andrew Coppin wrote:
The results of this little benchmark utterly defy comprehension. Allow me to enumerate:
Weird thing #1: The first time you sort the data, it takes a few seconds. The other 7 times, it takes a split second - roughly 100x faster. Wuh?
The test data was a CAF. I changed it to a regular function, and this behaviour vanished. Now *all* runs of a given test take approximately the same amount of time (to within a few ms anyway).
Weird thing #2: The parallel version runs *faster* than the sequential one in all cases - even with SMP disabled! (We're only talking a few percent faster, but still.)
Now the parallel version is only faster with 1 worker thread. With more (even just 2) it becomes *slower* than the sequential version. Interestingly, it *does* now seem to be using more than 50% CPU. So it seems to actually be doing more work, just less efficiently. My first guess would be that it's creating data in a different order and thus stressing the GC more or something. Or maybe it's just that the algorithm sparks millions of really *tiny* items, which waste a lot of time. (It's a very simple implementation, so I was somewhat expecting this.) I'll try tuning further...
Weird thing #3: Adding the "-threaded" compiler option makes *everything* run a few percent faster. Even with only 1 OS thread.
Still true.
Weird thing #4: Adding "-N2" makes *everything* slow down a few percent. In particular, Task Manager shows only one CPU core in use.
Adding -N2 does still slow everything down, but not by very much. (Except the truely parallel stuff - that shows quite a big slowdown.) Task Manager does now show about 60% CPU usage instead of 50%. (I have exactly 2 physical cores.)