Hello,

DPH seems to build parallel vectors at the level of scalar elements (doubles, say).  Is this a design decision aimed at targettiing GPUs?  If I am filtering an hour's worth of multichannel data (an array of (Vector Double)) then off the top of my head I would think that the optimal efficiency would be achieved on n cpu cores with each core filtering one channel, rather than trying to do anything fancy with processing vectors in parallel.

I say this because filters (which could be assembled from arrow structures) feedback across (regions of) a vector.  Do GPUs have some sort of shift operation optimisation?  In other words, if I have a (constant) matrix A, my filter, and a datastream, x, where x_i(t+1) = x_{i-1}(t), can a GPU perform Ax in O(length(x))?

Otherwise, given the cost of moving data to and from the GPU, I would guess that one sequential algorithm per core is faster (Concurrent Haskell) and that there is a granularity barrier.

Cheers,

Vivian