
Hello Andrew, Saturday, April 19, 2008, 6:56:10 PM, you wrote:
OK, so just for fun, I decided to try implementing a parallel merge sort
coincedence - now i'm writing a parallel compression algorithm, very much like parallel bzip2, but using ghc, of course
Weird thing #1: The first time you sort the data, it takes a few seconds. The other 7 times, it takes a split second - roughly 100x faster. Wuh?
this looks like disk caching effects. if data are read from disj on first run and from disk cache on the next runs, this only means that your algorithm works faster than reading its data from disk
Weird thing #2: The parallel version runs *faster* than the sequential one in all cases - even with SMP disabled! (We're only talking a few percent faster, but still.)
Weird thing #3: Adding the "-threaded" compiler option makes *everything* run a few percent faster. Even with only 1 OS thread.
there are plenty of reasons: first, -threaded make i/o overlapped with calculations. second, parallel version may exhibit better cpu cache behavior - such as processing all data in cache before sending it back to memory
Weird thing #4: Adding "-N2" makes *everything* slow down a few percent. In particular, Task Manager shows only one CPU core in use.
it's easy - your algorithm isn't really parallel, and you just forced ghc to move it from one core to another. it's overhead of moving data around :)
Can anybody explain any of this behaviour? I have no idea what I'm benchmarking, but it certainly doesn't appear to be the performance of a parallel merge sort!
there are many subtle effects making optimization much more interesting than using simple schemas ;) it's why i like it so much :)) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com