
On Feb 11, 2010, at 9:41 PM, Johann Höchtl wrote:
In a presentation of Guy Steele for ICFP 2009 in Edinburgh: http://www.vimeo.com/6624203 he "considers foldl and foldr" harmful as they hinder parallelism because of "Process first element, then the rest" Instead he proposes a divide and merge aproach, especially in the light of going parallel.
I think I first heard about that issue in the 80s. Let me just take an Erlang perspective on this for a few minutes. Ignoring types, the classic foldl in Erlang (which is strict) is foldl(F, A, [X|Xs]) -> foldl(F, F(X, A), Xs); foldl(_, A, []) -> A. In a strict language, you have to wait for F to finish before you can go on to the next element. In a non-strict language, you don't. The interesting question is how far F can get just given X, before it has to look at A. Suppose we can factor F(X, A) into G(H(X), A) where H is rather time-consuming, but G is cheap (maybe it's an add). So we write foldl_map(G, H, A, [X|Xs]) -> foldl_map(G, H, G(H(X), A), Xs); foldl_map(_, _, A, []) -> A. Now we can parallelise it. We'll spark off a separate thread for each call of H. pfoldl_map(G, H, A, Xs) -> Outer = self(), Pids = [spawn(fun () -> Outer ! {self(),H(X)} end || X <- Xs], foldl(G, A, [receive {Pid,Val} -> Val end | Pid <- Pids]). If N is the length of the list and G is O(1) we have O(N) time to traverse the list and O(N) time to collect and fold the results. The H calculations can take place in parallel. Provided each H calculation is expensive enough, we may not _care_ about the O(N) bits. In fact, if they aren't, we probably shouldn't be worrying about parallelism here. The problem exists when we *can't* factor F(X, A) into G(H(X), A) with a cheap G and costly H. Then divide-and-parallel-conquer using k-way trees gives us a computation depth of $\log_k N$ calls of G waiting for results to come back. As I say, I met this stuff in the 80s. It's hardly news. Indeed, if I remember correctly, back when Occam was hot stuff people were worried about PAR i = 0 for n ... which forks n processes. Doing that one at a time in the parent process takes O(n) time. I believe something was done about making this work by recursive bisection.