
OK, so just for fun, I decided to try implementing a parallel merge sort using the seq and par combinators. My plan was to generate some psuedo-random data and time how long it takes to sort it. To try to account for lazy evaluation, what the program actually does is this: 1. Write the input data to disk without any sorting. (This ought to force it to be fully evaluated.) 2. Sort and save the data to disk 8 times. (So I can average the runtimes.) This is done with two data sets - one with 1 million items, and another with 2 million rows. Each data set is run through both the purely sequential algorithm and a simple parallel one. (Split the list in half, merge-sort each half in parallel, and then merge them.) The results of this little benchmark utterly defy comprehension. Allow me to enumerate: Weird thing #1: The first time you sort the data, it takes a few seconds. The other 7 times, it takes a split second - roughly 100x faster. Wuh? Weird thing #2: The parallel version runs *faster* than the sequential one in all cases - even with SMP disabled! (We're only talking a few percent faster, but still.) Weird thing #3: Adding the "-threaded" compiler option makes *everything* run a few percent faster. Even with only 1 OS thread. Weird thing #4: Adding "-N2" makes *everything* slow down a few percent. In particular, Task Manager shows only one CPU core in use. Adding more than 2 OS threads makes everything slow down even further - but that's hardly surprising. Can anybody explain any of this behaviour? I have no idea what I'm benchmarking, but it certainly doesn't appear to be the performance of a parallel merge sort!

On Sat, Apr 19, 2008 at 10:56 AM, Andrew Coppin
Can anybody explain any of this behaviour? I have no idea what I'm benchmarking, but it certainly doesn't appear to be the performance of a parallel merge sort!
It would be much easier to draw sound conclusions if you would post your code. -- Denis

Denis Bueno wrote:
It would be much easier to draw sound conclusions if you would post your code.
Erm... good point. See attachments. module Sort where import Control.Parallel import Control.Parallel.Strategies split0 [] = [] split0 (x:xs) = x : split1 xs split1 [] = [] split1 (x:xs) = split0 xs merge xs [] = xs merge [] ys = ys merge (x:xs) (y:ys) | x < y = x : merge xs (y:ys) | otherwise = y : merge (x:xs) ys msort [] = [] msort [x] = [x] msort xs = let xs0 = msort (split0 xs) xs1 = msort (split1 xs) in merge xs0 xs1 msortP [] = [] msortP [x] = [x] msortP xs = let xs0 = msort (split0 xs) xs1 = msort (split1 xs) in seqList rwhnf xs0 `par` seqList rwhnf xs1 `seq` merge xs0 xs1 list = [5,4,6,3,7,2,8,1,9,0] module Time where import System.CPUTime time :: IO () -> IO Integer time fn = do t0 <- getCPUTime fn t1 <- getCPUTime return (t1 - t0) ps_ms = 1000000000 :: Integer ps_s = ps_ms * 1000 :: Integer module Main where import Data.Word import System.IO import GHC.Conc (numCapabilities) import Sort import Time type Test = (String,[Word32]) random = iterate (\x -> 1664525 * x + 1013904223) 7 :: [Word32] test1m = ("1M",take 1000000 random) test2m = ("2M",take 2000000 random) type Algo = (String, [Word32] -> [Word32]) algo_seq_msort = ("MergeSortSeq", msort) algo_par_msort = ("MergeSortPar", msortP) dump :: [Word32] -> String dump = unlines . map show run_tests :: Algo -> Test -> IO () run_tests (name,fn) (title,xs) = do echo "\n" let f1 = name ++ "--" ++ title ++ "--In.txt" echo $ "Writing '" ++ f1 ++ "'..."; hFlush stdout nullT <- time (writeFile f1 (dump xs)) echo $ " Took " ++ show (nullT `div` ps_ms) ++ " ms.\n" mapM_ (\n -> do let f2 = name ++ "--" ++ title ++ "--Out" ++ show n ++ ".txt" echo $ "Writing '" ++ f2 ++ "'..."; hFlush stdout sortT <- time (writeFile f2 (dump (fn xs))) echo $ " Took " ++ show (sortT `div` ps_ms) ++ " ms.\n" ) [1..8] echo msg = do hPutStr stdout msg hPutStr stderr msg main = do echo $ "CPU threads = " ++ show numCapabilities ++ ".\n" mapM_ (\test -> mapM_ (\algo -> run_tests algo test) [algo_seq_msort, algo_par_msort] ) [test1m, test2m]

Hello Andrew, Saturday, April 19, 2008, 6:56:10 PM, you wrote:
OK, so just for fun, I decided to try implementing a parallel merge sort
coincedence - now i'm writing a parallel compression algorithm, very much like parallel bzip2, but using ghc, of course
Weird thing #1: The first time you sort the data, it takes a few seconds. The other 7 times, it takes a split second - roughly 100x faster. Wuh?
this looks like disk caching effects. if data are read from disj on first run and from disk cache on the next runs, this only means that your algorithm works faster than reading its data from disk
Weird thing #2: The parallel version runs *faster* than the sequential one in all cases - even with SMP disabled! (We're only talking a few percent faster, but still.)
Weird thing #3: Adding the "-threaded" compiler option makes *everything* run a few percent faster. Even with only 1 OS thread.
there are plenty of reasons: first, -threaded make i/o overlapped with calculations. second, parallel version may exhibit better cpu cache behavior - such as processing all data in cache before sending it back to memory
Weird thing #4: Adding "-N2" makes *everything* slow down a few percent. In particular, Task Manager shows only one CPU core in use.
it's easy - your algorithm isn't really parallel, and you just forced ghc to move it from one core to another. it's overhead of moving data around :)
Can anybody explain any of this behaviour? I have no idea what I'm benchmarking, but it certainly doesn't appear to be the performance of a parallel merge sort!
there are many subtle effects making optimization much more interesting than using simple schemas ;) it's why i like it so much :)) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin wrote:
Weird thing #1: The first time you sort the data, it takes a few seconds. The other 7 times, it takes a split second - roughly 100x faster. Wuh?
this looks like disk caching effects. if data are read from disj on first run and from disk cache on the next runs, this only means that your algorithm works faster than reading its data from disk
Negative. No data is ever *read* from disk, only *written* to disk. (And each test writes to a different file.) The data to be sorted is generated using a trivial LCG PRNG.
there are plenty of reasons: first, -threaded make i/o overlapped with calculations.
Not with -N1.
second, parallel version may exhibit better cpu cache behavior - such as processing all data in cache before sending it back to memory
Again, with -N1, it is *still* only using 1 CPU core.
Weird thing #4: Adding "-N2" makes *everything* slow down a few percent. In particular, Task Manager shows only one CPU core in use.
it's easy - your algorithm isn't really parallel.
Fails to explain why the parallel version is faster than the sequential one (even with no parallelism), or why the sequential algorithm should slow down with more threads. (Surely the extra threads just sit idle?)
there are many subtle effects making optimization much more interesting than using simple schemas ;) it's why i like it so much :))
Well, based on the results I've seen so far, it seems that parallelism is a complete waste of time because it doesn't gain you anything. And that doesn't make a lot of sense...

Hello Andrew, Saturday, April 19, 2008, 7:50:30 PM, you wrote:
this looks like disk caching effects. if data are read from disj on first run and from disk cache on the next runs, this only means that your algorithm works faster than reading its data from disk
Negative. No data is ever *read* from disk, only *written* to disk. (And each test writes to a different file.)
The data to be sorted is generated using a trivial LCG PRNG.
if you don't generate new data for each sorting run, this means that data generation is much slower than sorting. don't forget about ghc laziness :)
there are plenty of reasons: first, -threaded make i/o overlapped with calculations.
Not with -N1.
are you sure? :) afaik, -threaded RTS uses dedicated i/o thread despite of -N setting (which controls only amount of threads running *user* code)
second, parallel version may exhibit better cpu cache behavior - such as processing all data in cache before sending it back to memory
Again, with -N1, it is *still* only using 1 CPU core.
parallel version is different from sequential one and it process data in another order. for example, imagine tar+gzip algorithm which runs sequentially and write intermediate results to the disk. the same algorithm, being multithreaded, will compress data on the fly and don't store intermediate data to the HDD despite using only one core. the same effect applies to cpu cache usage
Weird thing #4: Adding "-N2" makes *everything* slow down a few percent. In particular, Task Manager shows only one CPU core in use.
it's easy - your algorithm isn't really parallel.
Fails to explain why the parallel version is faster than the sequential one (even with no parallelism), or why the sequential algorithm should slow down with more threads. (Surely the extra threads just sit idle?)
there are management overheads. with multiple worker threads you have many OS threads which fights for the right to execute single Haskell thread :))
there are many subtle effects making optimization much more interesting than using simple schemas ;) it's why i like it so much :))
Well, based on the results I've seen so far, it seems that parallelism is a complete waste of time because it doesn't gain you anything. And that doesn't make a lot of sense...
i made world fastest compression program using multithreading, so the problem is definitely on other side ;) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin wrote:
Hello Andrew,
Saturday, April 19, 2008, 7:50:30 PM, you wrote:
The data to be sorted is generated using a trivial LCG PRNG.
if you don't generate new data for each sorting run, this means that data generation is much slower than sorting. don't forget about ghc laziness :)
...which is why the unsorted data is written to disk *before* I attempt to sort it. This is to ensure it's fully evaluated, and give me some idea how long it takes to write to disk. The [first] sorting stage is about 10x slower than the initial writing stage - and yet, subsequent sorting is 10x faster than the initial unsorted write.
Not with -N1.
are you sure? :) afaik, -threaded RTS uses dedicated i/o thread despite of -N setting (which controls only amount of threads running *user* code)
Interesting. Without consulting a GHC RTS expert, I guess there's no way to know. Certainly I doubt my program is I/O-bounded. It only writes about 11 MB of data to a text file. I would think converting Word32 -> String is the slow part.
Again, with -N1, it is *still* only using 1 CPU core.
parallel version is different from sequential one and it process data in another order.
Yeah, perhaps it improves RAM usage or something...
Fails to explain why the parallel version is faster than the sequential one (even with no parallelism), or why the sequential algorithm should slow down with more threads. (Surely the extra threads just sit idle?)
there are management overheads. with multiple worker threads you have many OS threads which fights for the right to execute single Haskell thread :))
As I understand it, sparked work is placed into a FIFO queue, and a set of worker threads poll this queue, remove the first item and begin executing. There is no "fighting"; once a work item has been picked by a thread, it runs to completion. Now, if there are more OS threads than physical CPU cores, they will fight it out at the OS level who runs first... ;-)

OK, well I now have so much data sitting in from of me I don't even know *what* I'm seeing any more. I have made several significant discoveries though... Consider the following: msort [] = [] msort [x] = [x] msort xs = let xs0 = msort (split0 xs) xs1 = msort (split1 xs) in merge xs0 xs1 This takes roughly 14 seconds to sort a list of one million Word32 values. If I now change the final line to read in listSeq rwhnf xs0 `seq` listSeq rwhnf xs1 `seq` merge xs0 xs1 it now takes 8 seconds to do the same job. Notice that this is still completely sequential execution. It's just executing in a different order. (And, at first glance, doing slightly more work.) Of all the benchmarks I've performed, I have yet to find anything that goes faster than this. If I make it so that xs0 is computed in parallel with xs1 instead of in series, then it goes at roughly the same speed (but with more variation) if the number of real threads is 1. If you add more real threads, execution slows down. (Go figure!) I was expecting running parallel at just the top few levels and then switching to pure sequential for the lower levels to give the best performance. But the numbers I have seem to say that more parallel = slower, with 100% sequential giving me the fastest time of all. The next insight happens when you look at the GC statistics. Both the unmarked and the explicitly sequential program are giving me roughly 55% GC time and 45% user time. (!!) Obviously this is a Very Bad Thing. I discovered that simply by adding -H200m to the command line, I can make both programs speed up by about 20% or so. (They then drop down to roughly 25% GC time. Adding more RAM doesn't seem to make any difference.) I had assumed that the explicitly sequential program was faster because it was somehow demanding less GC time, but that doesn't appear to be the case - both GC time and user time are lower for the explicitly sequential version. And adding more heap space doesn't make the (large) speed difference go away. Is the strictness of the seq operator making GHC come up with different a Core implementation for this function or something? I have no idea. With the extra heap space, the speed difference between the sequential and parallel programs becomes smaller. The sequential version *is* still faster, however. I have no explanation for why that might be. Adding more heap also seems to make the runtimes more variable. (I run each test 8 times. One test, the fastest run was 7 seconds and the slowest was 11 seconds. That's quite a variation. The sequential algorithm only varies by a few milliseconds each time.) In short, it seems my little sorting algorithm test is *actually* just stressing out the GC engine, and I'm "really" benchmarking different GC settings. :-( Questions: 1. Does running the GC force all threads to stop? I know some GC designs do this, but I have no idea how the one GHC implements works. 2. Is the GC still single-threaded? (GHC 6.8.2 here.) 3. Is there any way for a running Haskell program to find out how much heap space is currently allocated / used / free? I know you can find out how much wall time and CPU time you've used, but I couldn't find anything for RAM usage.

On Apr 20, 2008, at 15:41 , Andrew Coppin wrote:
1. Does running the GC force all threads to stop? I know some GC designs do this, but I have no idea how the one GHC implements works.
2. Is the GC still single-threaded? (GHC 6.8.2 here.)
Full GC is single-threaded and stops the entire program, yes. IIRC GHC's runtime tries to do incremental GC to minimize the need for a full GC.
3. Is there any way for a running Haskell program to find out how much heap space is currently allocated / used / free? I know you can find out how much wall time and CPU time you've used, but I couldn't find anything for RAM usage.
You're looking for "heap profiling" in the GHC manual. -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

On Apr 20, 2008, at 15:51 , Brandon S. Allbery KF8NH wrote:
On Apr 20, 2008, at 15:41 , Andrew Coppin wrote:
3. Is there any way for a running Haskell program to find out how much heap space is currently allocated / used / free? I know you can find out how much wall time and CPU time you've used, but I couldn't find anything for RAM usage.
You're looking for "heap profiling" in the GHC manual.
Wait, no, I misread. I don't know if or how a running program could introspect its own heap usage. -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

Brandon S. Allbery KF8NH wrote:
On Apr 20, 2008, at 15:41 , Andrew Coppin wrote:
1. Does running the GC force all threads to stop? I know some GC designs do this, but I have no idea how the one GHC implements works.
2. Is the GC still single-threaded? (GHC 6.8.2 here.)
Full GC is single-threaded and stops the entire program, yes. IIRC GHC's runtime tries to do incremental GC to minimize the need for a full GC.
My brain is telling me I've read something somewhere that had in-depth information about GHC's GC implementation - but I can't remember where I saw it... (Maybe the developer wiki? I'll go look there anyway, they might have some interesting goodies.) Anyway, the sequential version of the program clearly indicates that the thing is doing *a lot* of GC. If the GC sometimes stops the entire program, that could sure limit parallelism real fast... I suppose the only way to know for sure is to turn the heap size up absurdly high so that GC is *never* required, and see what that does. FWIW, I do see higher CPU usage figures with a larger heap...

Hello Andrew, Monday, April 21, 2008, 12:05:28 AM, you wrote:
My brain is telling me I've read something somewhere that had in-depth information about GHC's GC implementation - but I can't remember where I saw it... (Maybe the developer wiki? I'll go look there anyway, they might have some interesting goodies.)
look at the end of Arrays wiki page and further references. in particular, play with -A too -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Hello Andrew, Sunday, April 20, 2008, 11:41:52 PM, you wrote: yes, GC behavior has significant impact on any new language (i mean java, c#, f# and so on)
1. Does running the GC force all threads to stop? I know some GC designs do this, but I have no idea how the one GHC implements works.
yes
2. Is the GC still single-threaded? (GHC 6.8.2 here.)
yes. multi-threaded GC is planned gor next ghc version, afair
3. Is there any way for a running Haskell program to find out how much heap space is currently allocated / used / free?
i think it's possible by asking internal RTS vars. SM once suggested to add to GHC library that provides official way to ask this info but no volunteer was happen :) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin wrote:
3. Is there any way for a running Haskell program to find out how much heap space is currently allocated / used / free?
i think it's possible by asking internal RTS vars. SM once suggested to add to GHC library that provides official way to ask this info but no volunteer was happen :)
The RTS can spit out aggregate data just with a CLI switch (and it doesn't appear to affect runtime noticably). You don't even need to compile with profiling enabled. This seems to indicate that the data is easy to collect, there's just no path for accessing it yet. I'm no GHC developer, but from the outside it "appears" to be a fairly simple problem. If I knew anything about the RTS, I'd volunteer myself. But I suspect this is one of those jobs that requires knowledge of C... :-(

Bulat Ziganshin wrote:
yes. multi-threaded GC is planned gor next ghc version, afair
To be clear, it'll be a parallel GC, not a concurrent one. The former still stops all threads, but runs the collector on multiple cores at once. The latter performs collection while mutator threads are still running, and is a lot trickier to implement. (For a fine knee-slapping time, try reading a Java GC tuning guide.)

On Apr 19, 2008, at 11:50 , Andrew Coppin wrote:
Bulat Ziganshin wrote:
there are plenty of reasons: first, -threaded make i/o overlapped with calculations.
Not with -N1.
Depending on how it's implemented (I not being a ghc guru), possibly even with -N1 as long as it's using the thread-capable runtime. (Note that "make -j2" is known to be optimal on single-processor machines, specifically because I/O tends to overlap with CPU.)
second, parallel version may exhibit better cpu cache behavior - such as processing all data in cache before sending it back to memory
Again, with -N1, it is *still* only using 1 CPU core.
And again, this may well be an effect of using the thread-*capable* runtime. You can't generally multiplex memory accesses in SMP, so you may well want to delay and batch operations to/from main memory as much as possible to reduce lock contention for memory access.
Well, based on the results I've seen so far, it seems that parallelism is a complete waste of time because it doesn't gain you anything. And that doesn't make a lot of sense...
Easy parallelism is still an unsolved problem; naive parallelism generally is no better than sequential and often worse, because naive parallelism generally fails to account for lock / resource contention. (Note that resource locking will be done by the threaded runtime even with only one thread, so you will see some slowdowns especially in I/O-related code.) Haskell can only help you so much with this; you need to design your algorithms to parallel properly. In addition, laziness can result in naive parallelism being a no-op because the only thing parallelized is some operation that trivially returns a lazy thunk that is later forced in the main thread. Careful strictness analysis is necessary in non-strict languages to make sure you are actually parallelizing what you want to. -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

Hello Brandon, Saturday, April 19, 2008, 8:24:03 PM, you wrote:
contention. (Note that resource locking will be done by the threaded runtime even with only one thread, so you will see some slowdowns especially in I/O-related code.)
yes, i forget about this. Simon wrote once that locking decrease performance by a few percents compared to single-threaded runtime -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin wrote:
Hello Brandon,
Saturday, April 19, 2008, 8:24:03 PM, you wrote:
contention. (Note that resource locking will be done by the threaded runtime even with only one thread, so you will see some slowdowns especially in I/O-related code.)
yes, i forget about this. Simon wrote once that locking decrease performance by a few percents compared to single-threaded runtime
...which is why I'm so damned surprised that the threaded RTS is *faster* than the monoprocessor RTS [which presumably lacks such locking overhead].

I can't offer definite answers to your questions, but I can suggest a few issues you should consider: 1. Merge sort doesn't parallelize all that well--when the blocks are small, the parallelization overhead is large in comparison with the productive work that is to be done, and when the blocks get large, the amount of parallelization possible is not great. Quicksort and quickersort, of course, suffer from the same issue. The end result is that your timings will be heavily dependent on your hardware, software, and the properties of the particular data set you use for testing. 2. You need to account for I/O buffering (not only by your OP system in RAM, but by your disk controller)--after the first set of I/O operations, your data may be in buffers, so subsequent uses may retrieve data from buffers rather than from the disk itself. Similarly, you also have to take into account paging and cache issues, which could make the first run much slower than immediate subsequent runs. 3. A better benchmark would be provided by a counting sort, which does parallelize well (O(n * (n/k), where k is the number of processors, and n is the number of elements to be sorted). A major advantage of using a counting sort for benchmarking is that it runs slowly enough to make it relatively easy to compare sequential and parallel timings. 4. Depending on your system defaults, there may also be memory allocation issues that need to be taken into account (which could also easily cause the first run to be considerably slower than subsequent runs made immediately after the first). Murray Gross Brooklyn College On Sat, 19 Apr 2008, Andrew Coppin wrote:
OK, so just for fun, I decided to try implementing a parallel merge sort using the seq and par combinators. My plan was to generate some psuedo-random data and time how long it takes to sort it. To try to account for lazy evaluation, what the program actually does is this:
1. Write the input data to disk without any sorting. (This ought to force it to be fully evaluated.) 2. Sort and save the data to disk 8 times. (So I can average the runtimes.)
This is done with two data sets - one with 1 million items, and another with 2 million rows. Each data set is run through both the purely sequential algorithm and a simple parallel one. (Split the list in half, merge-sort each half in parallel, and then merge them.)
The results of this little benchmark utterly defy comprehension. Allow me to enumerate:
Weird thing #1: The first time you sort the data, it takes a few seconds. The other 7 times, it takes a split second - roughly 100x faster. Wuh?
Weird thing #2: The parallel version runs *faster* than the sequential one in all cases - even with SMP disabled! (We're only talking a few percent faster, but still.)
Weird thing #3: Adding the "-threaded" compiler option makes *everything* run a few percent faster. Even with only 1 OS thread.
Weird thing #4: Adding "-N2" makes *everything* slow down a few percent. In particular, Task Manager shows only one CPU core in use.
Adding more than 2 OS threads makes everything slow down even further - but that's hardly surprising.
Can anybody explain any of this behaviour? I have no idea what I'm benchmarking, but it certainly doesn't appear to be the performance of a parallel merge sort!
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Apr 19, 2008, at 11:53 , Murray Gross wrote:
2. You need to account for I/O buffering (not only by your OP system in RAM, but by your disk controller)--after the first set of I/O operations, your data may be in buffers, so subsequent uses may retrieve data from buffers rather than from the disk itself. Similarly, you also have to take into account paging and cache issues, which could make the first run much slower than immediate subsequent runs.
Note also that, unless you use SCSI or very high-end SATA drives, they ignore requests to disable buffering. (References on request, you can probably find them by poking around http://www.pdl.cmu.edu/. Short summary: consumer drives are optimized for benchmarks, not for data safety. This is why early 32-bit Windows releases often lost data on shutdown until the shutdown was modified to sleep for 10-15 seconds.) -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

On Apr 19, 2008, at 9:56 AM, Andrew Coppin wrote:
Weird thing #3: Adding the "-threaded" compiler option makes *everything* run a few percent faster. Even with only 1 OS thread.
I had a similar thing happen to me once. (http://geekrant.wordpress.com/2007/11/29/holy-shmoly-ghc-does-some-magic-all... ) It bothered me at the time, but as Simon Marlow said in the comments:
I wouldn’t believe these figures too much. I couldn’t repeat the same effects here, but since this is such a tiny fragment of code, small effects are magnified, and differences can sometimes appear and disappear depending the day of the week.
- Jake McArthur

Okay, here are my thoughts: On Apr 19, 2008, at 9:56 AM, Andrew Coppin wrote:
Weird thing #1: The first time you sort the data, it takes a few seconds. The other 7 times, it takes a split second - roughly 100x faster. Wuh?
This looks like standard memoization to me. I know, I know, GHC doesn't automagically memoize… it still has some behaviors I personally would label as a sort of primitive memoization, and this is one of them. I learned about this by losing an argument. ;)
Weird thing #2: The parallel version runs *faster* than the sequential one in all cases - even with SMP disabled! (We're only talking a few percent faster, but still.)
Weird thing #3: Adding the "-threaded" compiler option makes *everything* run a few percent faster. Even with only 1 OS thread.
I think these are not noteworthy. Weird things happen in benchmarks (which is why I have learned not to trust them).
Weird thing #4: Adding "-N2" makes *everything* slow down a few percent. In particular, Task Manager shows only one CPU core in use.
Then your algorithm must not truly be parallel. That is the only explanation I can think of. - Jake McArthur

Andrew Coppin wrote:
The results of this little benchmark utterly defy comprehension. Allow me to enumerate:
Weird thing #1: The first time you sort the data, it takes a few seconds. The other 7 times, it takes a split second - roughly 100x faster. Wuh?
The test data was a CAF. I changed it to a regular function, and this behaviour vanished. Now *all* runs of a given test take approximately the same amount of time (to within a few ms anyway).
Weird thing #2: The parallel version runs *faster* than the sequential one in all cases - even with SMP disabled! (We're only talking a few percent faster, but still.)
Now the parallel version is only faster with 1 worker thread. With more (even just 2) it becomes *slower* than the sequential version. Interestingly, it *does* now seem to be using more than 50% CPU. So it seems to actually be doing more work, just less efficiently. My first guess would be that it's creating data in a different order and thus stressing the GC more or something. Or maybe it's just that the algorithm sparks millions of really *tiny* items, which waste a lot of time. (It's a very simple implementation, so I was somewhat expecting this.) I'll try tuning further...
Weird thing #3: Adding the "-threaded" compiler option makes *everything* run a few percent faster. Even with only 1 OS thread.
Still true.
Weird thing #4: Adding "-N2" makes *everything* slow down a few percent. In particular, Task Manager shows only one CPU core in use.
Adding -N2 does still slow everything down, but not by very much. (Except the truely parallel stuff - that shows quite a big slowdown.) Task Manager does now show about 60% CPU usage instead of 50%. (I have exactly 2 physical cores.)
participants (7)
-
Andrew Coppin
-
Brandon S. Allbery KF8NH
-
Bryan O'Sullivan
-
Bulat Ziganshin
-
Denis Bueno
-
Jake Mcarthur
-
Murray Gross