Re: [Haskell-cafe] optimising for vector units

MR K P SCHUPKE
As far as I understand it haskells lazy evaluation resembles dataflow computation - IE values are computed on demand.
This problem has already been solved for hardware, you end up with what is termed a super-scalar architecture.
Instructions read from memory fill a re-order buffer. As computations complete value slots in the reorder buffer are filled with the result values. When an instruction has all its values it can be executed, and will be assigned to the next available execution unit.
I am not too sure of the internal architecture of GHC, but I can imagine that funtions waiting to be run will have slots for arguments, that will be filled in when values become available.
No. A function may not use all of its arguments, so there's no point in evaluating all arguments---you have to find out which ones are actually needed. Terminology note: GHC uses the STG language internally. In STG, evaluation is always triggered by a case expression, which evaluates precisely one expression (not necessarily an argument of anything, either!).
The whole point is that a function with three parameters can spark three threads
This is actually a bad idea: the point of non-strict languages and lazy evaluation is to allow lots of values which are (a) unused and (b) expensive to evaluate. If you evaluate everything (even in parallel), there's a lot of wasted work if the programmer is making optimal use of the non-strictness of the language. Furthermore, if de-forestation fails anywhere, you've likely got another problem. Consider the following program:
n `choose` m = product $ take m $ [n, n-1..]
Which translates (after some optimisation) to STG equivalent to:
downFrom n = let n1 = (-) n 1 in let xn1 = downFrom n1 in let xn2 = n:xn1 in xn2 choose n m = let xn1 = downFrom n in let xn2 = take m xn1 in product xn2
Now, (:) cannot `spark' a thread to evaluate xn1 in downFrom, because that would end up sparking infinitely many threads---which may be possible, but nothing you've said suggests that it is.
to calculate each parameter without worrying about the side affects of each function. No locking is necessary because you only spark one thread for each argument.
This is true for tree reduction, but not for graph reduction. Consider, e.g., the fact that the call to product in choose and the recursive call to downFrom walk down (and hence evaluate) the same list. Since the cons cells making up this list are evaluated by two different functions, some sort of locking is indeed necessary to ensure that you don't end up with two threads for each of the (infinitely many) cons cells in the list.
I suspect I haven't fully understood the difficaulty in doing this, care to enlighten me?
Certainly. See above. Jon Cast

Just wanted to explore a few points you made: Jon Cast wrote:
No. A function may not use all of its arguments, so there's no point in evaluating all arguments---you have to find out which ones are actually needed. Terminology note: GHC uses the STG language internally. In STG, evaluation is always triggered by a case expression, which evaluates precisely one expression (not necessarily an argument of anything, either!)
If a function never uses an argument you dont need to evaluate it... remember in dataflow we start from the end of the program working backwards so we know all future demands on function arguments...
Now, (:) cannot `spark' a thread to evaluate xn1 in downFrom, because that would end up sparking infinitely many threads---which may be possible, but nothing you've said suggests that it is.
Again think of lazy lists in their 'stream' interpretation. We model them as a FIFO queue and can set high and low water marks in the FIFO. We can use hysteresis to ensure work gets done in reasonably large chunks. I guess my description didn't include all the details of the model I had in mind... you obviously dont want to unroll all recursive functions... I was thinking along the lines of expressions like let a = c*d + e*f where you can obviously execute the two multiplies in parallel (the multiplies being the arguments to the (+) function. Also it is obvious you don't need a new thread for one of the arguments (a function of one argument would use the same thread)... In the general recursive case: fna x y z = fna (x+1) (y-1) (z*3) again we can do the three simple arithmetic operations in parallel but have to wait for all of them to finish before the call to (fna) can be made. So there isn't an infinite expansion of threads there is one thead that does the recursion and one of the arithmetic operations and two that are forked and reaped each iteration.
This is true for tree reduction, but not for graph reduction. Consider, e.g., the fact that the call to product in choose and the recursive call to downFrom walk down (and hence evaluate) the same list. Since the cons cells making up this list are evaluated by two different functions, some sort of locking is indeed necessary to ensure that you don't end up with two threads for each of the (infinitely many) cons cells in the list.
I suspect I haven't fully understood the difficaulty in doing this, care to enlighten me?
A final point someone made about the cost of starting threads... Surely
The list is read only (this is a declarative language) no locking is necessary on read only data structures. the obvious approach is to start one OS thread per execution unit, and do all the thread starting with the very lightweight haskell threads... I guess in this case I can see where locking would be necessary... Keean

MR K P SCHUPKE wrote:
A final point someone made about the cost of starting threads... Surely the obvious approach is to start one OS thread per execution unit, and do all the thread starting with the very lightweight haskell threads...
That was me. I think you're underestimating the cost of starting threads even in this very lightweight world. Again, tens of instructions between startup, shutdown, and (very important) synchronization to make sure that other threads see the results which were produced. It *can* be done without locks, but it often can't be done without memory fences of some sort. If you actually have to start OS threads, the cost is thousands or tens of thousands of cycles. Even with very lightweight thread creation, you probably don't want to do a multiplication in a separate thread---unless it's, say, an Integer multiply with a minimum of a thousand digits or so. -Jan-Willem Maessen
Keean
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
participants (3)
-
Jan-Willem Maessen - Sun Labs East
-
Jon Cast
-
MR K P SCHUPKE