
Hugh Perkins wrote:
- parallelism must be quite coarse to offset overheads (which I think is the problem with expecting things like map and fold to parallelised automagically; they're just too small grained for it to be worthwhile)
Someone else said that. I dont understand what you mean.
If you do 1+2+3+4, it's not very efficient to spawn a new thread, have that compute 1+2, compute 3+4 in the current thread, synchronise, and do the final addition. You waste more time starting and stopping threads than doing useful work. If you're going to spawn a new thread, it needs to do "lots" of work for it to be worth it. And this is kinda what makes this stuff so tricky - between lazyness and the Halting Problem, it's hard to work out what is "worth" sparking a thread for...
Imagine I have a map like this:
map f [1..100000]
... and I have 64 cores..
So I divide the 100000 elements in my list by 64, and give each chunk of 1000-ish elements to each thread. Easy I think?
...you realise that that's a linked-list you have there, right? ;-) Now, if you had an *array*, and you know that "f" does a constant amount of work for all elements, and that all array elements will eventually be "needed"...
Note that NDP is doing this too, *but NDP is trying to do this for assymetric elements at compile-time, which is generally impossible*. I propose doing it at runtime, inside the VM, which is trivial and optimal.
Parallel and trivial in the same sentence? Hmm, I wonder... how did this problem not get solved 20 years ago? Oh, wait, maybe because it's harder than it looks. ;-) It certainly looks *easier* to partition the work at run-time than compile-time. But trivial? I doubt it. (For a start, and autosensing magic you add mustn't slow the program down more than the parallelism it's supposed to be generating!)
It's trivial to improve on NDP at runtime. For example:
- we have a map of 64 elements, so we give one element to each thread - all of the pieces but one finishes really quickly (say, within 1 second) - so, we start to look at what is happening within the remaining piece -> turns out it contains another map, so we split that map amongst our idle threads -> and so on
Easy, flexible.
Seems to require running some sort of "monitor" process which can somehow "see" what operation(s) a thread is trying to do so it can split up the work and share it around. That's going to add overhead. Now that's OK so long as the gains outweight the overhead...
- lazy evaluation makes it harder
Not at runtime ;-)
O RLY? The problem with things being lazy is that you don't know whether some data is "needed" until the moment you actually come to need it. OTOH, ideally you'd want to know way in advance so that you can compute it in parallel and it will already be "ready" when you come to use it. But with lazyness, you can't [easily] do that. Doing the resting at run-time isn't helping you here...
Basically, it's harder than it sounds.
No ;-) The more I talk about this the more confident I feel that this is an easy, optimal solution.
How does the saying go? "Anything is easy for the man who does not have to implement it..."