[Haskell-cafe] Kernel Loops in Accelerate

12 Dec 2012

      Hi Trevor (and cafe),

I've been playing more and more with accelerate, and I find it quite
annoying that there are no loops. It makes implementing many algorithms
much harder than it should be.

For example, I would love to submit a patch to fix issue #52 [0] on github
by implementing MWC64X [1], but it's very hard to port the OpenCL code on
that page when it's impossible to write kernel expressions with loops.
Also, that means there are no high-level combinators I'm used to for my
sequential code (such as map and fold) that would work on an accelerate
CUDA kernel.

As a nice strawman example, how would one implement the following kernel in
accelerate, assuming 'rand_next', 'rand_get', and 'rand_skip' can all be
implemented cheaply? :

typedef uint64_t rand_state;

__device__ rand_state rand_next(rand_state s);
__device__ uint32_t rand_get(rand_state s);
__device__ rand_state rand_skip(rand_state s, uint64_t distance);
__device__ uint32_t round_to_next_pow2(uint32_t n);

// Fills an array with random numbers given a random seed,
// a maximum random number to generate, and an output
// array to put the result in. The output will be in the range
// [0, rand_max).
__kernel__ void fill_random(rand_state start_state, uint32_t rand_max,
uint32_t* out) {
    rand_state current_state = start_state;
    int i = blockDim.x*blockIdx.x + threadIdx.x;
    // assumes we skip less than 1 million times per element...
    current_state = rand_skip(current_state, i*1e6);
    uint32_t mask = round_to_next_pow2(rand_max) - 1;
    uint32_t result;
    do {
        result = rand_get(current_state);
        current_state = rand_next(current_state);
    } while(result & mask >= rand_max);

    out[i] = result;
} // note: code was neither debugged, run, nor compiled.

Thanks,
  - Clark

[0] https://github.com/AccelerateHS/accelerate/issues/52
[1] http://cas.ee.ic.ac.uk/people/dt10/research/rngs-gpu-mwc64x.html

[Haskell-cafe] Kernel Loops in Accelerate

Clark Gaebel