On Thu, Dec 13, 2012 at 7:47 AM, Trevor L. McDonell <tmcdonell@cse.unsw.edu.au> wrote:

Hi Clark,

The question of sequential loops in Accelerate has come up a few times in the

past. The main sticking point is knowing how to implement them in a way that
encourages efficient programs; avoiding irregular arrays (iteration depths),
discouraging scalar versions of collective combinators, etc. Basically we need

to avoid thread divergence where possible, else the GPU SIMD hardware won't be
well utilised.

I've experimented a little with this. If you check the generated code of the

mandelbrot program [1], you'll see it has recovered the _fixed_ iteration depth
into a scalar loop.

I'll hack up something like a while loop for value iteration and see what

happens. Tentative proposal (perhaps just the second):

iterate :: Exp Int -- fixed iteration count (maybe just a regular Int)
-> (Exp a -> Exp a) -- function to repeatedly apply

-> Exp a -- initial (seed) value
-> Exp a

while :: (Exp a -> Exp Bool)
-> (Exp a -> Exp a)
-> Exp a

-> Exp a

It would be good if we could collect some additional concrete applications and
see if there are better ways to map certain classes of sequential iteration to

efficient GPU code.

Additional thoughts/comments welcome!

Cheers,
-Trev

[1]: https://github.com/AccelerateHS/accelerate-examples/tree/master/examples/mandelbrot

On 12/12/2012, at 4:34 PM, Clark Gaebel <cgaebel@uwaterloo.ca> wrote:

Hi Trevor (and cafe),

I've been playing more and more with accelerate, and I find it quite annoying that there are no loops. It makes implementing many algorithms much harder than it should be.

For example, I would love to submit a patch to fix issue #52 [0] on github by implementing MWC64X [1], but it's very hard to port the OpenCL code on that page when it's impossible to write kernel expressions with loops. Also, that means there are no high-level combinators I'm used to for my sequential code (such as map and fold) that would work on an accelerate CUDA kernel.

As a nice strawman example, how would one implement the following kernel in accelerate, assuming 'rand_next', 'rand_get', and 'rand_skip' can all be implemented cheaply? :

typedef uint64_t rand_state;

__device__ rand_state rand_next(rand_state s);
__device__ uint32_t rand_get(rand_state s);
__device__ rand_state rand_skip(rand_state s, uint64_t distance);
__device__ uint32_t round_to_next_pow2(uint32_t n);

// Fills an array with random numbers given a random seed,
// a maximum random number to generate, and an output
// array to put the result in. The output will be in the range
// [0, rand_max).
__kernel__ void fill_random(rand_state start_state, uint32_t rand_max, uint32_t* out) {
    rand_state current_state = start_state;
    int i = blockDim.x*blockIdx.x + threadIdx.x;
    // assumes we skip less than 1 million times per element...
    current_state = rand_skip(current_state, i*1e6);
    uint32_t mask = round_to_next_pow2(rand_max) - 1;
    uint32_t result;
    do {
        result = rand_get(current_state);
        current_state = rand_next(current_state);
    } while(result & mask >= rand_max);

    out[i] = result;
} // note: code was neither debugged, run, nor compiled.

Thanks,
- Clark

[0] https://github.com/AccelerateHS/accelerate/issues/52
[1] http://cas.ee.ic.ac.uk/people/dt10/research/rngs-gpu-mwc64x.html

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe