
Albert Y. C. Lai wrote:
If you worry that the sum thread and the length thread are not synchronized and therefore there is still no bound on the list prefix kept in memory, I'm sure you can improve it by one of the chunking strategies.
I'm more worried that data now has to go through tiny CPU <--> RAM busses rather than zipping along at the CPU's core frequency as it would if only 1 CPU were involved. I would think the cache issues make the sequential version a fair bit faster. (Not to mention that no GC is required.)
As our hardware grows more cores but not more gigahertz, it may become detrimal to fuse. Fusion implies entangling, an anti-thesis to parallelism. One day we may call this an optimisation: unfuse code hand-fused by programmers, then parallelize. Optimisation people on the imperative side are already doing this (they have more hand-fused code than we do), though targetting at older hardware like vector machines and SIMD, not yet multiple cores.
The key to parallel processing is splitting a task into *independent* tasks so you can do them in parallel. If they're still interdependent [like the above], I suspect you're not gaining much.