
Hi Trevor (and cafe), I've been playing more and more with accelerate, and I find it quite annoying that there are no loops. It makes implementing many algorithms much harder than it should be. For example, I would love to submit a patch to fix issue #52 [0] on github by implementing MWC64X [1], but it's very hard to port the OpenCL code on that page when it's impossible to write kernel expressions with loops. Also, that means there are no high-level combinators I'm used to for my sequential code (such as map and fold) that would work on an accelerate CUDA kernel. As a nice strawman example, how would one implement the following kernel in accelerate, assuming 'rand_next', 'rand_get', and 'rand_skip' can all be implemented cheaply? : typedef uint64_t rand_state; __device__ rand_state rand_next(rand_state s); __device__ uint32_t rand_get(rand_state s); __device__ rand_state rand_skip(rand_state s, uint64_t distance); __device__ uint32_t round_to_next_pow2(uint32_t n); // Fills an array with random numbers given a random seed, // a maximum random number to generate, and an output // array to put the result in. The output will be in the range // [0, rand_max). __kernel__ void fill_random(rand_state start_state, uint32_t rand_max, uint32_t* out) { rand_state current_state = start_state; int i = blockDim.x*blockIdx.x + threadIdx.x; // assumes we skip less than 1 million times per element... current_state = rand_skip(current_state, i*1e6); uint32_t mask = round_to_next_pow2(rand_max) - 1; uint32_t result; do { result = rand_get(current_state); current_state = rand_next(current_state); } while(result & mask >= rand_max); out[i] = result; } // note: code was neither debugged, run, nor compiled. Thanks, - Clark [0] https://github.com/AccelerateHS/accelerate/issues/52 [1] http://cas.ee.ic.ac.uk/people/dt10/research/rngs-gpu-mwc64x.html