Re: [Haskell-cafe] repa parallelization results

17 Mar 2015

      if you want to understand writing matrix algorithms for the GPU,
http://icl.cs.utk.edu/magma/pubs/index.html has lots of reading resources.
The cost model for memory and computation on a GPU is different from the
CPU cost model.

and http://icl.cs.utk.edu/magma/software/ has a bunch of high performance
gpu kernels for a bunch of architectures.

accelerate is good to look at, at minimum because it manages to focus on a
fragment of functionall array programs that are moderately easy to optmize
for gpu

On Mon, Mar 16, 2015 at 11:27 PM, Anatoly Yakovenko 
wrote:
...
you mean by hand?  or using accelerate?  I am not sure GPU's would do
that much better at matrix multiplication.  According to that paper
its pretty cache bound, so having a bigger l1/l2 would have a higher
impact on performance then more cores, or wider simd.
either way, i am more trying to understand how the parallelization
techniques work in Repa, what usecases are they applicable in, and how
to pick sequential or parallel versions of functions.
Anatoly
...
You want to throw your parallelizable matrix operations to the GPU cores.
MATLAB can now do this and I believe it is starting to be built into R so
that R can use the GPU cores..
On Mon, Mar 16, 2015 at 5:11 AM, Anatoly Yakovenko <
aeyakovenko@gmail.com>
wrote:
...
hmm, so i was wrong
https://gist.github.com/aeyakovenko/0af788390ee9d980c1d6
the first version performed the best, even when running with -N1
agains the sequential version.
On Sun, Mar 15, 2015 at 8:04 PM, Carter Schonwald
 wrote:
...
you're getting on the right track! :)
the idea you're reaching for is "parallel work depth".  Eg, if instead
of
foldl' (which has O(n) work depth), you had a "parallel" fold that
kinda
...
...
looks like a recursive split and then merge version of the fold
operation,
you'd have O(log n) work depth. (and that'd likely be faster!). But
...
...
...
you'd notice "below some threshold, its better to compute
sequentially,
because the overhead of parallization is too big".
etc etc. (the point i'm trying to reach for is that effective
parallelization requires a pretty rich understanding of your
application
/
software / hardware cost model)
likewise, REPA is really only going to shine on workloads that look
"pointwise" or "flat", at least with the current iteration. Its
...
...
...
a
good idea to look at the various example codes that are available for
repa
and acccelerate, because you'll notice that the codes which are
especially
performant have that "flat" style of paralellims
On Sun, Mar 15, 2015 at 7:16 PM, Anatoly Yakovenko

wrote:
...
Ok, got it. I picked the wrong function to try to understand how Repa
parallelizes :)
So whats a good heuristic for using the parallel versions vs
sequential for Repa?
Do the internals try to parallelize every element?  or does it fuse
them into some small number of parallelized tasks?
So just based from my observations
f (Z :. r :. c) = r * c
a <- computeP (fromFunction f)
a `deepSeqArray` sumAllP a
should be faster then:
let a = computeS $ fromFunction f
a `deepSeqArray` sumAllP $ a
but probably slower then
sumAllS $ computeS $ fromFunction f
Since an intermediate array is not even computed.
Thanks,
Anatoly
On Sun, Mar 15, 2015 at 1:41 PM, Carter Schonwald
 wrote:
...
Read that paper I linked. Anything else I say will be a rehash of
that
paper. :)
On Mar 15, 2015 4:21 PM, "Anatoly Yakovenko" <
aeyakovenko@gmail.com>
...
...
wrote:
>
> Ok, so whats the difference between the sequence and parallel
> versions? does the parallel one contain a thunk for every element
in
> the output?
>
> On Sun, Mar 15, 2015 at 12:44 PM, Carter Schonwald
>  wrote:
> > Read what I linked.
> > You are benchmarking repa for exactly the pessimal workload that
> > it
> > is
> > bad
> > at.
> >
> > Repa is for point wise parallel and local convolution parallel
> > programs.
> > The way repa can express matrix multiplication is exactly the
> > worst
> > way
> > to
> > implement a parallel matrix mult.  Like, pretty pessimal wrt a
> > memory
> > traffic / communication complexity metric of performance.
> >
> > Benchmark something like image blur algorithms and repa will
> > really
> > shine.
> >
> > Right now your benchmark is the repa equivalent of noticing that
> > random
> > access on singly linked lists is slow :)
> >
> > On Mar 15, 2015 2:44 PM, "Anatoly Yakovenko"
> > 
> > wrote:
> >>
> >> I am not really focusing on matrix multiply specifically.  So
...
...
...
...
...
> >> real
> >> problem is that the implementation using parallelized functions
> >> is
> >> slower then the sequential one, and adding more threads makes
it
> >> barely as fast as the sequential one.
> >>
> >> So why would i ever use the parallelized versions?
> >>
> >>
> >> On Sat, Mar 14, 2015 at 9:24 AM, Carter Schonwald
> >>  wrote:
> >> > http://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf
> >> > this
> >> > paper
> >> > (among many others by the blis project) articulates some of
...
...
...
...
...
> >> > ideas
> >> > i
> >> > allude to pretty well (with pictures!)
> >> >
> >> > On Sat, Mar 14, 2015 at 12:21 PM, Carter Schonwald
> >> >  wrote:
> >> >>
> >> >> dense matrix product is not an algorithm that makes sense in
> >> >> repa's
> >> >> execution model,
> >> >> in square matrix multiply of two N x N matrices, each result
> >> >> entry
> >> >> depends
> >> >> on 2n values total across the  two input matrices.
> >> >> even then, thats actually the wrong way to parallelize dense
> >> >> matrix
> >> >> product! its worth reading the papers about goto blas and
On Mon, Mar 16, 2015 at 6:20 PM, KC  wrote:
then
probably
the
the
the
...
...
...
...
...
> >> >> more
> >> >> recent
> >> >> blis project. a high performance dense matrix multipy winds
up
> >> >> needing
> >> >> to do
> >> >> some nested array parallelism with mutable updates to have
> >> >> efficient
> >> >> sharing
> >> >> of sub computations!
> >> >>
> >> >>
> >> >>
> >> >> On Fri, Mar 13, 2015 at 9:03 PM, Anatoly Yakovenko
> >> >> 
> >> >> wrote:
> >> >>>
> >> >>> you think the backed would make any difference?  this seems
> >> >>> like
> >> >>> a
> >> >>> runtime issue to me, how are the threads scheduled by the
ghc
> >> >>> runtime?
> >> >>>
> >> >>> On Fri, Mar 13, 2015 at 4:58 PM, KC 
wrote:
> >> >>> > How is the LLVM?
> >> >>> >
> >> >>> > --
> >> >>> > --
> >> >>> >
> >> >>> > Sent from an expensive device which will be obsolete in a
> >> >>> > few
> >> >>> > months!
> >> >>> > :D
> >> >>> >
> >> >>> > Casey
> >> >>> >
> >> >>> >
> >> >>> > On Mar 13, 2015 10:24 AM, "Anatoly Yakovenko"
> >> >>> > 
> >> >>> > wrote:
> >> >>> >>
> >> >>> >>
https://gist.github.com/aeyakovenko/bf558697a0b3f377f9e8
> >> >>> >>
> >> >>> >>
> >> >>> >> so i am seeing basically results with N4 that are as
good
> >> >>> >> as
> >> >>> >> using
> >> >>> >> sequential computation on my macbook for the matrix
> >> >>> >> multiply
> >> >>> >> algorithm.  any idea why?
> >> >>> >>
> >> >>> >> Thanks,
> >> >>> >> Anatoly
> >> >>> >> _______________________________________________
> >> >>> >> Haskell-Cafe mailing list
> >> >>> >> Haskell-Cafe@haskell.org
> >> >>> >>
> >> >>> >>
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
> >> >>> _______________________________________________
> >> >>> Haskell-Cafe mailing list
> >> >>> Haskell-Cafe@haskell.org
> >> >>>
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
> >> >>
> >> >>
> >> >
--
--
Sent from an expensive device which will be obsolete in a few months! :D
Casey
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe