Hi Clark,The question of sequential loops in Accelerate has come up a few times in thepast. The main sticking point is knowing how to implement them in a way thatencourages efficient programs; avoiding irregular arrays (iteration depths),discouraging scalar versions of collective combinators, etc. Basically we needto avoid thread divergence where possible, else the GPU SIMD hardware won't bewell utilised.I've experimented a little with this. If you check the generated code of themandelbrot program [1], you'll see it has recovered the _fixed_ iteration depthinto a scalar loop.I'll hack up something like a while loop for value iteration and see whathappens. Tentative proposal (perhaps just the second):iterate :: Exp Int -- fixed iteration count (maybe just a regular Int)-> (Exp a -> Exp a) -- function to repeatedly apply-> Exp a -- initial (seed) value-> Exp awhile :: (Exp a -> Exp Bool)-> (Exp a -> Exp a)-> Exp a-> Exp aIt would be good if we could collect some additional concrete applications andsee if there are better ways to map certain classes of sequential iteration toefficient GPU code.Additional thoughts/comments welcome!Cheers,-TrevOn 12/12/2012, at 4:34 PM, Clark Gaebel <cgaebel@uwaterloo.ca> wrote:Hi Trevor (and cafe),
I've been playing more and more with accelerate, and I find it quite annoying that there are no loops. It makes implementing many algorithms much harder than it should be.
For example, I would love to submit a patch to fix issue #52 [0] on github by implementing MWC64X [1], but it's very hard to port the OpenCL code on that page when it's impossible to write kernel expressions with loops. Also, that means there are no high-level combinators I'm used to for my sequential code (such as map and fold) that would work on an accelerate CUDA kernel.
As a nice strawman example, how would one implement the following kernel in accelerate, assuming 'rand_next', 'rand_get', and 'rand_skip' can all be implemented cheaply? :
typedef uint64_t rand_state;
__device__ rand_state rand_next(rand_state s);
__device__ uint32_t rand_get(rand_state s);
__device__ rand_state rand_skip(rand_state s, uint64_t distance);
__device__ uint32_t round_to_next_pow2(uint32_t n);
// Fills an array with random numbers given a random seed,
// a maximum random number to generate, and an output
// array to put the result in. The output will be in the range
// [0, rand_max).
__kernel__ void fill_random(rand_state start_state, uint32_t rand_max, uint32_t* out) {
rand_state current_state = start_state;
int i = blockDim.x*blockIdx.x + threadIdx.x;
// assumes we skip less than 1 million times per element...
current_state = rand_skip(current_state, i*1e6);
uint32_t mask = round_to_next_pow2(rand_max) - 1;
uint32_t result;
do {
result = rand_get(current_state);
current_state = rand_next(current_state);
} while(result & mask >= rand_max);
out[i] = result;
} // note: code was neither debugged, run, nor compiled.
Thanks,
- Clark
[0] https://github.com/AccelerateHS/accelerate/issues/52
[1] http://cas.ee.ic.ac.uk/people/dt10/research/rngs-gpu-mwc64x.html
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe