
On 8/22/07, Brandon Michael Moore
Automatic threading is inherently limited by data dependencies. You can't run a function that branches on an argument in parallel with the computation producing that argument. Even with arbitrarily many processors and no communication overhead there is a limit to how much parallelism you can squeeze from any given program.
Yes. Its worse than that in fact, because many real-world problems will involve functions that have side-effects, but we know the side-effects dont matter. The parallelisation algo of course doesnt know they dont matter (unless we tell it). Example: imagine we want to copy files from one machine to five others. Copying a file has a clear side-effect, but since we're copying to 5 independent machines, we can copy to each machine in parallel. The algo doesnt know that this is ok.
You should read "Feedback Directed Implicit Parallelism" http://research.microsoft.com/~tharris/papers/2007-fdip.pdf and perhaps "Limits to Implicit Parallelism in Functional Applications" http://www.detreville.org/papers/Limits.pdf
Ok
In short, with zero overhead and an oracle for scheduling you can get a factor of at most 2x to 32x by implicitly parallelizing existing Haskell code. In practice, with execution overhead it's a gain of perhaps 10% to 80% on a 4-core system.
This is a good argument that it's not enough to prototype on a 4 core system, but we really need some way to simulate a 1024 core system to carry out meaningful benchmarks.
You can do a lot better if you expect people to rewrite code, but "automatic threading" suggests something completely effortless.
Yes, I tend to overuse the word "automatic" ;-)
I think you can get much better results if you work on the programming style in connection with a better runtime. You can think of data parallel Haskell as a new programming style with more implicit parallelims, and the runtime support to exploit it.
Yes, you're right.
If you have cores to waste, you might try rewrites like
f x => case x of C1 a1 a2 -> f (C1 a1 a2) C2 b -> f (C2 b) C3 -> f C3
and then speculatively execute several of the case branches. If you don't throw away too much type information you should even be able to do it at runtime.
That's a good observation. Sortof anti-laziness :-D