While working on the Shootout, I noticed the following benchmarks:
The same program becomes almost 4 times slower when compiled with --threaded and run with +RTS -N5 -- even though the multi-core benchmark really only ever uses one processor.
Other languages seem to have found a way of arranging these threads in a way such that parallelism actually happens, but as it stands, compiling this benchmark without --threaded actually makes Haskell competitive against the genuinely parallel alternatives in other languages...which is unusual by itself.
I wanted to throw this out for people to discuss, because I'd like to see it improved. As it stands, I'm going to submit a version which asks not to be compiled with --threaded (and has a few other improvements).