
Am 20.01.2016 um 13:16 schrieb Serguey Zefirov:
You are unneccessary overly pessimistic, let me show you somethings you, probably, have not thought or heard about.
Okaaaay...
A demonstration from the industry, albeit not quite hardware industry:
http://www.disneyanimation.com/technology/innovations/hyperion - "Hyperion handles several million light rays at a time by sorting and bundling them together according to their directions. When the rays are grouped in this way, many of the rays in a bundle hit the same object in the same region of space. This similarity of ray hits then allows us – and the computer – to optimize the calculations for the objects hit."
Sure. If you have a few gazillion of identical algorithms, you can parallelize on that. That's the reason why 3D cards even took off, the graphics pipeline grew processing capabilities and evolved into a (rather restricted) GPU core model. So it's not necessarily impossible to build something useful, merely very unlikely.
Then, let me bring up an old idea of mine: https://mail.haskell.org/pipermail/haskell-cafe/2009-August/065327.html
Basically, we can group identical closures into vectors, ready for SIMD instructions to operate over them. The "vectors" should work just like Data.Vector.Unboxed - instead of vector of tuple of arguments there should be a tuple of vectors with individual arguments (and results to update for lazy evaluation).
Combine this with sorting of addresses in case of references and you can get a lot of speedup by doing... not much.
Heh. Such stuff could work - *provided* that you can really make a case of having enough similar work. Still, I'd work on making a model of that on GPGPU hardware first. Two advantages: 1) No hardware investment. 2) You can see what the low-hanging fruit are and get a rough first idea how much parallelization really gives you. The other approach: See what you can get out of a Xeon with really many cores (14, or even more). Compare the single-GPGPU vs. multi-GPGPU speedup with the single-CPUcore vs. multi-CPUcore speedup. That might provide insight into how well the interconnects and cache coherence protocols interfere with the multicore speedup. Why I'm so central on multicore? Because that's where hardware is going to go, because hardware isn't going to clock much higher but people will still want to improve performance. Actually I think that single-core improvements aren't going to be very important. First on my list would be exploiting multicore, second cache locality. There's more to be gotten from there than from specialized hardware, IMVHO. Regards, Jo