repa parallelization results

older
Call for Participation: BOB 2016...

Anatoly Yakovenko

13 Mar 2015 13 Mar '15

5:23 p.m.

https://gist.github.com/aeyakovenko/bf558697a0b3f377f9e8 so i am seeing basically results with N4 that are as good as using sequential computation on my macbook for the matrix multiply algorithm. any idea why? Thanks, Anatoly

Show replies by date

KC

13 Mar 13 Mar

11:58 p.m.

How is the LLVM? -- -- Sent from an expensive device which will be obsolete in a few months! :D Casey On Mar 13, 2015 10:24 AM, "Anatoly Yakovenko" wrote:

...

https://gist.github.com/aeyakovenko/bf558697a0b3f377f9e8

so i am seeing basically results with N4 that are as good as using sequential computation on my macbook for the matrix multiply algorithm. any idea why?

Thanks, Anatoly _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

Anatoly Yakovenko

14 Mar 14 Mar

1:03 a.m.

you think the backed would make any difference? this seems like a runtime issue to me, how are the threads scheduled by the ghc runtime? On Fri, Mar 13, 2015 at 4:58 PM, KC wrote:

...

How is the LLVM?

-- --

Sent from an expensive device which will be obsolete in a few months! :D

Casey

On Mar 13, 2015 10:24 AM, "Anatoly Yakovenko" wrote:

...
https://gist.github.com/aeyakovenko/bf558697a0b3f377f9e8

so i am seeing basically results with N4 that are as good as using sequential computation on my macbook for the matrix multiply algorithm. any idea why?

Thanks, Anatoly _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

Carter Schonwald

4:21 p.m.

dense matrix product is not an algorithm that makes sense in repa's execution model, in square matrix multiply of two N x N matrices, each result entry depends on 2n values total across the two input matrices. even then, thats actually the wrong way to parallelize dense matrix product! its worth reading the papers about goto blas and the more recent blis project. a high performance dense matrix multipy winds up needing to do some nested array parallelism with mutable updates to have efficient sharing of sub computations! On Fri, Mar 13, 2015 at 9:03 PM, Anatoly Yakovenko wrote:

...

you think the backed would make any difference? this seems like a runtime issue to me, how are the threads scheduled by the ghc runtime?

On Fri, Mar 13, 2015 at 4:58 PM, KC wrote:

...
How is the LLVM?

-- --

Sent from an expensive device which will be obsolete in a few months! :D

Casey

On Mar 13, 2015 10:24 AM, "Anatoly Yakovenko" wrote:

...
https://gist.github.com/aeyakovenko/bf558697a0b3f377f9e8

so i am seeing basically results with N4 that are as good as using sequential computation on my macbook for the matrix multiply algorithm. any idea why?

Thanks, Anatoly _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

Carter Schonwald

4:24 p.m.

http://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf this paper (among many others by the blis project) articulates some of the ideas i allude to pretty well (with pictures!) On Sat, Mar 14, 2015 at 12:21 PM, Carter Schonwald < carter.schonwald@gmail.com> wrote:

...

dense matrix product is not an algorithm that makes sense in repa's execution model, in square matrix multiply of two N x N matrices, each result entry depends on 2n values total across the two input matrices. even then, thats actually the wrong way to parallelize dense matrix product! its worth reading the papers about goto blas and the more recent blis project. a high performance dense matrix multipy winds up needing to do some nested array parallelism with mutable updates to have efficient sharing of sub computations!

On Fri, Mar 13, 2015 at 9:03 PM, Anatoly Yakovenko wrote:

...
you think the backed would make any difference? this seems like a runtime issue to me, how are the threads scheduled by the ghc runtime?

On Fri, Mar 13, 2015 at 4:58 PM, KC wrote:

...
How is the LLVM?

-- --

Sent from an expensive device which will be obsolete in a few months! :D

Casey

On Mar 13, 2015 10:24 AM, "Anatoly Yakovenko" wrote:

...
https://gist.github.com/aeyakovenko/bf558697a0b3f377f9e8

so i am seeing basically results with N4 that are as good as using sequential computation on my macbook for the matrix multiply algorithm. any idea why?

Thanks, Anatoly _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

Anatoly Yakovenko

15 Mar 15 Mar

6:44 p.m.

I am not really focusing on matrix multiply specifically. So the real problem is that the implementation using parallelized functions is slower then the sequential one, and adding more threads makes it barely as fast as the sequential one. So why would i ever use the parallelized versions? On Sat, Mar 14, 2015 at 9:24 AM, Carter Schonwald wrote:

...

http://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf this paper (among many others by the blis project) articulates some of the ideas i allude to pretty well (with pictures!)

On Sat, Mar 14, 2015 at 12:21 PM, Carter Schonwald wrote:

...
dense matrix product is not an algorithm that makes sense in repa's execution model, in square matrix multiply of two N x N matrices, each result entry depends on 2n values total across the two input matrices. even then, thats actually the wrong way to parallelize dense matrix product! its worth reading the papers about goto blas and the more recent blis project. a high performance dense matrix multipy winds up needing to do some nested array parallelism with mutable updates to have efficient sharing of sub computations!

On Fri, Mar 13, 2015 at 9:03 PM, Anatoly Yakovenko wrote:

...
you think the backed would make any difference? this seems like a runtime issue to me, how are the threads scheduled by the ghc runtime?

On Fri, Mar 13, 2015 at 4:58 PM, KC wrote:

...
How is the LLVM?

-- --

Sent from an expensive device which will be obsolete in a few months! :D

Casey

On Mar 13, 2015 10:24 AM, "Anatoly Yakovenko" wrote:

...
https://gist.github.com/aeyakovenko/bf558697a0b3f377f9e8

so i am seeing basically results with N4 that are as good as using sequential computation on my macbook for the matrix multiply algorithm. any idea why?

Thanks, Anatoly _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

Carter Schonwald

7:44 p.m.

Read what I linked. You are benchmarking repa for exactly the pessimal workload that it is bad at. Repa is for point wise parallel and local convolution parallel programs. The way repa can express matrix multiplication is exactly the worst way to implement a parallel matrix mult. Like, pretty pessimal wrt a memory traffic / communication complexity metric of performance. Benchmark something like image blur algorithms and repa will really shine. Right now your benchmark is the repa equivalent of noticing that random access on singly linked lists is slow :) On Mar 15, 2015 2:44 PM, "Anatoly Yakovenko" wrote:

...

I am not really focusing on matrix multiply specifically. So the real problem is that the implementation using parallelized functions is slower then the sequential one, and adding more threads makes it barely as fast as the sequential one.

So why would i ever use the parallelized versions?

On Sat, Mar 14, 2015 at 9:24 AM, Carter Schonwald wrote:

...
http://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf this paper (among many others by the blis project) articulates some of the ideas i allude to pretty well (with pictures!)

On Sat, Mar 14, 2015 at 12:21 PM, Carter Schonwald wrote:

...
dense matrix product is not an algorithm that makes sense in repa's execution model, in square matrix multiply of two N x N matrices, each result entry

depends

...
on 2n values total across the two input matrices. even then, thats actually the wrong way to parallelize dense matrix product! its worth reading the papers about goto blas and the more recent blis project. a high performance dense matrix multipy winds up needing to do some nested array parallelism with mutable updates to have efficient sharing of sub computations!

On Fri, Mar 13, 2015 at 9:03 PM, Anatoly Yakovenko < aeyakovenko@gmail.com> wrote:

...
you think the backed would make any difference? this seems like a runtime issue to me, how are the threads scheduled by the ghc runtime?

On Fri, Mar 13, 2015 at 4:58 PM, KC wrote:

...
How is the LLVM?

-- --

Sent from an expensive device which will be obsolete in a few months! :D

Casey

On Mar 13, 2015 10:24 AM, "Anatoly Yakovenko"
...
...
...
wrote:

...
https://gist.github.com/aeyakovenko/bf558697a0b3f377f9e8

so i am seeing basically results with N4 that are as good as using sequential computation on my macbook for the matrix multiply algorithm. any idea why?

Thanks, Anatoly _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

Anatoly Yakovenko

8:21 p.m.

Ok, so whats the difference between the sequence and parallel versions? does the parallel one contain a thunk for every element in the output? On Sun, Mar 15, 2015 at 12:44 PM, Carter Schonwald wrote:

...

Read what I linked. You are benchmarking repa for exactly the pessimal workload that it is bad at.

Repa is for point wise parallel and local convolution parallel programs. The way repa can express matrix multiplication is exactly the worst way to implement a parallel matrix mult. Like, pretty pessimal wrt a memory traffic / communication complexity metric of performance.

Benchmark something like image blur algorithms and repa will really shine.

Right now your benchmark is the repa equivalent of noticing that random access on singly linked lists is slow :)

On Mar 15, 2015 2:44 PM, "Anatoly Yakovenko" wrote:

...
I am not really focusing on matrix multiply specifically. So the real problem is that the implementation using parallelized functions is slower then the sequential one, and adding more threads makes it barely as fast as the sequential one.

So why would i ever use the parallelized versions?

On Sat, Mar 14, 2015 at 9:24 AM, Carter Schonwald wrote:

...
http://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf this paper (among many others by the blis project) articulates some of the ideas i allude to pretty well (with pictures!)

On Sat, Mar 14, 2015 at 12:21 PM, Carter Schonwald wrote:

...
dense matrix product is not an algorithm that makes sense in repa's execution model, in square matrix multiply of two N x N matrices, each result entry depends on 2n values total across the two input matrices. even then, thats actually the wrong way to parallelize dense matrix product! its worth reading the papers about goto blas and the more recent blis project. a high performance dense matrix multipy winds up needing to do some nested array parallelism with mutable updates to have efficient sharing of sub computations!

On Fri, Mar 13, 2015 at 9:03 PM, Anatoly Yakovenko wrote:

...
you think the backed would make any difference? this seems like a runtime issue to me, how are the threads scheduled by the ghc runtime?

On Fri, Mar 13, 2015 at 4:58 PM, KC wrote:

...
How is the LLVM?

-- --

Sent from an expensive device which will be obsolete in a few months! :D

Casey

On Mar 13, 2015 10:24 AM, "Anatoly Yakovenko" wrote: > > https://gist.github.com/aeyakovenko/bf558697a0b3f377f9e8 > > > so i am seeing basically results with N4 that are as good as using > sequential computation on my macbook for the matrix multiply > algorithm. any idea why? > > Thanks, > Anatoly > _______________________________________________ > Haskell-Cafe mailing list > Haskell-Cafe@haskell.org > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

Carter Schonwald

8:40 p.m.

The likely issue is cache thrash. On Mar 15, 2015 4:21 PM, "Anatoly Yakovenko" wrote:

...

Ok, so whats the difference between the sequence and parallel versions? does the parallel one contain a thunk for every element in the output?

...
Read what I linked. You are benchmarking repa for exactly the pessimal workload that it is bad at.

Repa is for point wise parallel and local convolution parallel programs. The way repa can express matrix multiplication is exactly the worst way to implement a parallel matrix mult. Like, pretty pessimal wrt a memory traffic / communication complexity metric of performance.

Benchmark something like image blur algorithms and repa will really shine.

Right now your benchmark is the repa equivalent of noticing that random access on singly linked lists is slow :)

On Mar 15, 2015 2:44 PM, "Anatoly Yakovenko" wrote:

...
I am not really focusing on matrix multiply specifically. So the real problem is that the implementation using parallelized functions is slower then the sequential one, and adding more threads makes it barely as fast as the sequential one.

So why would i ever use the parallelized versions?

On Sat, Mar 14, 2015 at 9:24 AM, Carter Schonwald wrote:

...
http://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf this

On Sun, Mar 15, 2015 at 12:44 PM, Carter Schonwald wrote: paper

...
...
...
(among many others by the blis project) articulates some of the ideas i allude to pretty well (with pictures!)

On Sat, Mar 14, 2015 at 12:21 PM, Carter Schonwald wrote:

...
dense matrix product is not an algorithm that makes sense in repa's execution model, in square matrix multiply of two N x N matrices, each result entry depends on 2n values total across the two input matrices. even then, thats actually the wrong way to parallelize dense matrix product! its worth reading the papers about goto blas and the more recent blis project. a high performance dense matrix multipy winds up

needing

...
to do some nested array parallelism with mutable updates to have efficient sharing of sub computations!

On Fri, Mar 13, 2015 at 9:03 PM, Anatoly Yakovenko wrote:

...
you think the backed would make any difference? this seems like a runtime issue to me, how are the threads scheduled by the ghc

runtime?

...
On Fri, Mar 13, 2015 at 4:58 PM, KC wrote: > How is the LLVM? > > -- > -- > > Sent from an expensive device which will be obsolete in a few > months! > :D > > Casey > > > On Mar 13, 2015 10:24 AM, "Anatoly Yakovenko" > > wrote: >> >> https://gist.github.com/aeyakovenko/bf558697a0b3f377f9e8 >> >> >> so i am seeing basically results with N4 that are as good as

using

...
>> sequential computation on my macbook for the matrix multiply >> algorithm. any idea why? >> >> Thanks, >> Anatoly >> _______________________________________________ >> Haskell-Cafe mailing list >> Haskell-Cafe@haskell.org >> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

Carter Schonwald

8:41 p.m.

Read that paper I linked. Anything else I say will be a rehash of that paper. :) On Mar 15, 2015 4:21 PM, "Anatoly Yakovenko" wrote:

...

Ok, so whats the difference between the sequence and parallel versions? does the parallel one contain a thunk for every element in the output?

...
Read what I linked. You are benchmarking repa for exactly the pessimal workload that it is bad at.

Repa is for point wise parallel and local convolution parallel programs. The way repa can express matrix multiplication is exactly the worst way to implement a parallel matrix mult. Like, pretty pessimal wrt a memory traffic / communication complexity metric of performance.

Benchmark something like image blur algorithms and repa will really shine.

Right now your benchmark is the repa equivalent of noticing that random access on singly linked lists is slow :)

On Mar 15, 2015 2:44 PM, "Anatoly Yakovenko" wrote:

...
I am not really focusing on matrix multiply specifically. So the real problem is that the implementation using parallelized functions is slower then the sequential one, and adding more threads makes it barely as fast as the sequential one.

So why would i ever use the parallelized versions?

On Sat, Mar 14, 2015 at 9:24 AM, Carter Schonwald wrote:

...
http://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf this

On Sun, Mar 15, 2015 at 12:44 PM, Carter Schonwald wrote: paper

...
...
...
(among many others by the blis project) articulates some of the ideas i allude to pretty well (with pictures!)

On Sat, Mar 14, 2015 at 12:21 PM, Carter Schonwald wrote:

...
dense matrix product is not an algorithm that makes sense in repa's execution model, in square matrix multiply of two N x N matrices, each result entry depends on 2n values total across the two input matrices. even then, thats actually the wrong way to parallelize dense matrix product! its worth reading the papers about goto blas and the more recent blis project. a high performance dense matrix multipy winds up

needing

...
to do some nested array parallelism with mutable updates to have efficient sharing of sub computations!

On Fri, Mar 13, 2015 at 9:03 PM, Anatoly Yakovenko wrote:

...
you think the backed would make any difference? this seems like a runtime issue to me, how are the threads scheduled by the ghc

runtime?

...
On Fri, Mar 13, 2015 at 4:58 PM, KC wrote: > How is the LLVM? > > -- > -- > > Sent from an expensive device which will be obsolete in a few > months! > :D > > Casey > > > On Mar 13, 2015 10:24 AM, "Anatoly Yakovenko" > > wrote: >> >> https://gist.github.com/aeyakovenko/bf558697a0b3f377f9e8 >> >> >> so i am seeing basically results with N4 that are as good as

using

...
>> sequential computation on my macbook for the matrix multiply >> algorithm. any idea why? >> >> Thanks, >> Anatoly >> _______________________________________________ >> Haskell-Cafe mailing list >> Haskell-Cafe@haskell.org >> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

Anatoly Yakovenko

11:16 p.m.

Ok, got it. I picked the wrong function to try to understand how Repa parallelizes :) So whats a good heuristic for using the parallel versions vs sequential for Repa? Do the internals try to parallelize every element? or does it fuse them into some small number of parallelized tasks? So just based from my observations f (Z :. r :. c) = r * c a <- computeP (fromFunction f) a `deepSeqArray` sumAllP a should be faster then: let a = computeS $ fromFunction f a `deepSeqArray` sumAllP $ a but probably slower then sumAllS $ computeS $ fromFunction f Since an intermediate array is not even computed. Thanks, Anatoly On Sun, Mar 15, 2015 at 1:41 PM, Carter Schonwald wrote:

...

Read that paper I linked. Anything else I say will be a rehash of that paper. :)

On Mar 15, 2015 4:21 PM, "Anatoly Yakovenko" wrote:

...
Ok, so whats the difference between the sequence and parallel versions? does the parallel one contain a thunk for every element in the output?

On Sun, Mar 15, 2015 at 12:44 PM, Carter Schonwald wrote:

...
Read what I linked. You are benchmarking repa for exactly the pessimal workload that it is bad at.

Repa is for point wise parallel and local convolution parallel programs. The way repa can express matrix multiplication is exactly the worst way to implement a parallel matrix mult. Like, pretty pessimal wrt a memory traffic / communication complexity metric of performance.

Benchmark something like image blur algorithms and repa will really shine.

Right now your benchmark is the repa equivalent of noticing that random access on singly linked lists is slow :)

On Mar 15, 2015 2:44 PM, "Anatoly Yakovenko" wrote:

...
I am not really focusing on matrix multiply specifically. So the real problem is that the implementation using parallelized functions is slower then the sequential one, and adding more threads makes it barely as fast as the sequential one.

So why would i ever use the parallelized versions?

On Sat, Mar 14, 2015 at 9:24 AM, Carter Schonwald wrote:

...
http://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf this paper (among many others by the blis project) articulates some of the ideas i allude to pretty well (with pictures!)

On Sat, Mar 14, 2015 at 12:21 PM, Carter Schonwald wrote:

...
dense matrix product is not an algorithm that makes sense in repa's execution model, in square matrix multiply of two N x N matrices, each result entry depends on 2n values total across the two input matrices. even then, thats actually the wrong way to parallelize dense matrix product! its worth reading the papers about goto blas and the more recent blis project. a high performance dense matrix multipy winds up needing to do some nested array parallelism with mutable updates to have efficient sharing of sub computations!

On Fri, Mar 13, 2015 at 9:03 PM, Anatoly Yakovenko wrote: > > you think the backed would make any difference? this seems like a > runtime issue to me, how are the threads scheduled by the ghc > runtime? > > On Fri, Mar 13, 2015 at 4:58 PM, KC wrote: > > How is the LLVM? > > > > -- > > -- > > > > Sent from an expensive device which will be obsolete in a few > > months! > > :D > > > > Casey > > > > > > On Mar 13, 2015 10:24 AM, "Anatoly Yakovenko" > > > > wrote: > >> > >> https://gist.github.com/aeyakovenko/bf558697a0b3f377f9e8 > >> > >> > >> so i am seeing basically results with N4 that are as good as > >> using > >> sequential computation on my macbook for the matrix multiply > >> algorithm. any idea why? > >> > >> Thanks, > >> Anatoly > >> _______________________________________________ > >> Haskell-Cafe mailing list > >> Haskell-Cafe@haskell.org > >> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > _______________________________________________ > Haskell-Cafe mailing list > Haskell-Cafe@haskell.org > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

Carter Schonwald

16 Mar 16 Mar

3:04 a.m.

you're getting on the right track! :) the idea you're reaching for is "parallel work depth". Eg, if instead of foldl' (which has O(n) work depth), you had a "parallel" fold that kinda looks like a recursive split and then merge version of the fold operation, you'd have O(log n) work depth. (and that'd likely be faster!). But then you'd notice "below some threshold, its better to compute sequentially, because the overhead of parallization is too big". etc etc. (the point i'm trying to reach for is that effective parallelization requires a pretty rich understanding of your application / software / hardware cost model) likewise, REPA is really only going to shine on workloads that look "pointwise" or "flat", at least with the current iteration. Its probably a good idea to look at the various example codes that are available for repa and acccelerate, because you'll notice that the codes which are especially performant have that "flat" style of paralellims On Sun, Mar 15, 2015 at 7:16 PM, Anatoly Yakovenko wrote:

...

Ok, got it. I picked the wrong function to try to understand how Repa parallelizes :)

So whats a good heuristic for using the parallel versions vs sequential for Repa?

Do the internals try to parallelize every element? or does it fuse them into some small number of parallelized tasks?

So just based from my observations

f (Z :. r :. c) = r * c

a <- computeP (fromFunction f) a `deepSeqArray` sumAllP a

should be faster then:

let a = computeS $ fromFunction f a `deepSeqArray` sumAllP $ a

but probably slower then

sumAllS $ computeS $ fromFunction f

Since an intermediate array is not even computed.

Thanks, Anatoly

...
Read that paper I linked. Anything else I say will be a rehash of that paper. :)

On Mar 15, 2015 4:21 PM, "Anatoly Yakovenko" wrote:

...
Ok, so whats the difference between the sequence and parallel versions? does the parallel one contain a thunk for every element in the output?

On Sun, Mar 15, 2015 at 12:44 PM, Carter Schonwald wrote:

...
Read what I linked. You are benchmarking repa for exactly the pessimal workload that it is bad at.

Repa is for point wise parallel and local convolution parallel

On Sun, Mar 15, 2015 at 1:41 PM, Carter Schonwald wrote: programs.

...
...
...
The way repa can express matrix multiplication is exactly the worst way to implement a parallel matrix mult. Like, pretty pessimal wrt a memory traffic / communication complexity metric of performance.

Benchmark something like image blur algorithms and repa will really shine.

Right now your benchmark is the repa equivalent of noticing that random access on singly linked lists is slow :)

On Mar 15, 2015 2:44 PM, "Anatoly Yakovenko" wrote:

...
I am not really focusing on matrix multiply specifically. So the

real

...
problem is that the implementation using parallelized functions is slower then the sequential one, and adding more threads makes it barely as fast as the sequential one.

So why would i ever use the parallelized versions?

On Sat, Mar 14, 2015 at 9:24 AM, Carter Schonwald wrote:

...
http://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf this paper (among many others by the blis project) articulates some of the ideas i allude to pretty well (with pictures!)

On Sat, Mar 14, 2015 at 12:21 PM, Carter Schonwald wrote: > > dense matrix product is not an algorithm that makes sense in repa's > execution model, > in square matrix multiply of two N x N matrices, each result entry > depends > on 2n values total across the two input matrices. > even then, thats actually the wrong way to parallelize dense matrix > product! its worth reading the papers about goto blas and the more > recent > blis project. a high performance dense matrix multipy winds up > needing > to do > some nested array parallelism with mutable updates to have efficient > sharing > of sub computations! > > > > On Fri, Mar 13, 2015 at 9:03 PM, Anatoly Yakovenko > > wrote: >> >> you think the backed would make any difference? this seems like a >> runtime issue to me, how are the threads scheduled by the ghc >> runtime? >> >> On Fri, Mar 13, 2015 at 4:58 PM, KC wrote: >> > How is the LLVM? >> > >> > -- >> > -- >> > >> > Sent from an expensive device which will be obsolete in a few >> > months! >> > :D >> > >> > Casey >> > >> > >> > On Mar 13, 2015 10:24 AM, "Anatoly Yakovenko" >> > >> > wrote: >> >> >> >> https://gist.github.com/aeyakovenko/bf558697a0b3f377f9e8 >> >> >> >> >> >> so i am seeing basically results with N4 that are as good as >> >> using >> >> sequential computation on my macbook for the matrix multiply >> >> algorithm. any idea why? >> >> >> >> Thanks, >> >> Anatoly >> >> _______________________________________________ >> >> Haskell-Cafe mailing list >> >> Haskell-Cafe@haskell.org >> >> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >> _______________________________________________ >> Haskell-Cafe mailing list >> Haskell-Cafe@haskell.org >> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > >

Anatoly Yakovenko

12:11 p.m.

hmm, so i was wrong https://gist.github.com/aeyakovenko/0af788390ee9d980c1d6 the first version performed the best, even when running with -N1 agains the sequential version. On Sun, Mar 15, 2015 at 8:04 PM, Carter Schonwald wrote:

...

you're getting on the right track! :)

the idea you're reaching for is "parallel work depth". Eg, if instead of foldl' (which has O(n) work depth), you had a "parallel" fold that kinda looks like a recursive split and then merge version of the fold operation, you'd have O(log n) work depth. (and that'd likely be faster!). But then you'd notice "below some threshold, its better to compute sequentially, because the overhead of parallization is too big".

etc etc. (the point i'm trying to reach for is that effective parallelization requires a pretty rich understanding of your application / software / hardware cost model)

likewise, REPA is really only going to shine on workloads that look "pointwise" or "flat", at least with the current iteration. Its probably a good idea to look at the various example codes that are available for repa and acccelerate, because you'll notice that the codes which are especially performant have that "flat" style of paralellims

On Sun, Mar 15, 2015 at 7:16 PM, Anatoly Yakovenko wrote:

...
Ok, got it. I picked the wrong function to try to understand how Repa parallelizes :)

So whats a good heuristic for using the parallel versions vs sequential for Repa?

Do the internals try to parallelize every element? or does it fuse them into some small number of parallelized tasks?

So just based from my observations

f (Z :. r :. c) = r * c

a <- computeP (fromFunction f) a `deepSeqArray` sumAllP a

should be faster then:

let a = computeS $ fromFunction f a `deepSeqArray` sumAllP $ a

but probably slower then

sumAllS $ computeS $ fromFunction f

Since an intermediate array is not even computed.

Thanks, Anatoly

On Sun, Mar 15, 2015 at 1:41 PM, Carter Schonwald wrote:

...
Read that paper I linked. Anything else I say will be a rehash of that paper. :)

On Mar 15, 2015 4:21 PM, "Anatoly Yakovenko" wrote:

...
Ok, so whats the difference between the sequence and parallel versions? does the parallel one contain a thunk for every element in the output?

On Sun, Mar 15, 2015 at 12:44 PM, Carter Schonwald wrote:

...
Read what I linked. You are benchmarking repa for exactly the pessimal workload that it is bad at.

Repa is for point wise parallel and local convolution parallel programs. The way repa can express matrix multiplication is exactly the worst way to implement a parallel matrix mult. Like, pretty pessimal wrt a memory traffic / communication complexity metric of performance.

Benchmark something like image blur algorithms and repa will really shine.

Right now your benchmark is the repa equivalent of noticing that random access on singly linked lists is slow :)

On Mar 15, 2015 2:44 PM, "Anatoly Yakovenko" wrote:

...
I am not really focusing on matrix multiply specifically. So the real problem is that the implementation using parallelized functions is slower then the sequential one, and adding more threads makes it barely as fast as the sequential one.

So why would i ever use the parallelized versions?

On Sat, Mar 14, 2015 at 9:24 AM, Carter Schonwald wrote: > http://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf this > paper > (among many others by the blis project) articulates some of the > ideas > i > allude to pretty well (with pictures!) > > On Sat, Mar 14, 2015 at 12:21 PM, Carter Schonwald > wrote: >> >> dense matrix product is not an algorithm that makes sense in >> repa's >> execution model, >> in square matrix multiply of two N x N matrices, each result >> entry >> depends >> on 2n values total across the two input matrices. >> even then, thats actually the wrong way to parallelize dense >> matrix >> product! its worth reading the papers about goto blas and the >> more >> recent >> blis project. a high performance dense matrix multipy winds up >> needing >> to do >> some nested array parallelism with mutable updates to have >> efficient >> sharing >> of sub computations! >> >> >> >> On Fri, Mar 13, 2015 at 9:03 PM, Anatoly Yakovenko >> >> wrote: >>> >>> you think the backed would make any difference? this seems like >>> a >>> runtime issue to me, how are the threads scheduled by the ghc >>> runtime? >>> >>> On Fri, Mar 13, 2015 at 4:58 PM, KC wrote: >>> > How is the LLVM? >>> > >>> > -- >>> > -- >>> > >>> > Sent from an expensive device which will be obsolete in a few >>> > months! >>> > :D >>> > >>> > Casey >>> > >>> > >>> > On Mar 13, 2015 10:24 AM, "Anatoly Yakovenko" >>> > >>> > wrote: >>> >> >>> >> https://gist.github.com/aeyakovenko/bf558697a0b3f377f9e8 >>> >> >>> >> >>> >> so i am seeing basically results with N4 that are as good as >>> >> using >>> >> sequential computation on my macbook for the matrix multiply >>> >> algorithm. any idea why? >>> >> >>> >> Thanks, >>> >> Anatoly >>> >> _______________________________________________ >>> >> Haskell-Cafe mailing list >>> >> Haskell-Cafe@haskell.org >>> >> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >>> _______________________________________________ >>> Haskell-Cafe mailing list >>> Haskell-Cafe@haskell.org >>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >> >> >

KC

17 Mar 17 Mar

1:20 a.m.

You want to throw your parallelizable matrix operations to the GPU cores. MATLAB can now do this and I believe it is starting to be built into R so that R can use the GPU cores.. On Mon, Mar 16, 2015 at 5:11 AM, Anatoly Yakovenko wrote:

...

hmm, so i was wrong

https://gist.github.com/aeyakovenko/0af788390ee9d980c1d6

the first version performed the best, even when running with -N1 agains the sequential version.

...
you're getting on the right track! :)

the idea you're reaching for is "parallel work depth". Eg, if instead of foldl' (which has O(n) work depth), you had a "parallel" fold that kinda looks like a recursive split and then merge version of the fold operation, you'd have O(log n) work depth. (and that'd likely be faster!). But then you'd notice "below some threshold, its better to compute sequentially, because the overhead of parallization is too big".

etc etc. (the point i'm trying to reach for is that effective parallelization requires a pretty rich understanding of your application / software / hardware cost model)

likewise, REPA is really only going to shine on workloads that look "pointwise" or "flat", at least with the current iteration. Its probably a good idea to look at the various example codes that are available for repa and acccelerate, because you'll notice that the codes which are especially performant have that "flat" style of paralellims

On Sun, Mar 15, 2015 at 7:16 PM, Anatoly Yakovenko < aeyakovenko@gmail.com> wrote:

...
Ok, got it. I picked the wrong function to try to understand how Repa parallelizes :)

So whats a good heuristic for using the parallel versions vs sequential for Repa?

Do the internals try to parallelize every element? or does it fuse them into some small number of parallelized tasks?

So just based from my observations

f (Z :. r :. c) = r * c

a <- computeP (fromFunction f) a `deepSeqArray` sumAllP a

should be faster then:

let a = computeS $ fromFunction f a `deepSeqArray` sumAllP $ a

but probably slower then

sumAllS $ computeS $ fromFunction f

Since an intermediate array is not even computed.

Thanks, Anatoly

On Sun, Mar 15, 2015 at 1:41 PM, Carter Schonwald wrote:

...
Read that paper I linked. Anything else I say will be a rehash of that paper. :)

On Mar 15, 2015 4:21 PM, "Anatoly Yakovenko" wrote:

...
Ok, so whats the difference between the sequence and parallel versions? does the parallel one contain a thunk for every element in the output?

On Sun, Mar 15, 2015 at 12:44 PM, Carter Schonwald wrote:

...
Read what I linked. You are benchmarking repa for exactly the pessimal workload that it is bad at.

Repa is for point wise parallel and local convolution parallel programs. The way repa can express matrix multiplication is exactly the worst way to implement a parallel matrix mult. Like, pretty pessimal wrt a

memory

...
...
...
...
traffic / communication complexity metric of performance.

Benchmark something like image blur algorithms and repa will really shine.

Right now your benchmark is the repa equivalent of noticing that random access on singly linked lists is slow :)

On Mar 15, 2015 2:44 PM, "Anatoly Yakovenko" < aeyakovenko@gmail.com> wrote: > > I am not really focusing on matrix multiply specifically. So the > real > problem is that the implementation using parallelized functions is > slower then the sequential one, and adding more threads makes it > barely as fast as the sequential one. > > So why would i ever use the parallelized versions? > > > On Sat, Mar 14, 2015 at 9:24 AM, Carter Schonwald > wrote: > > http://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf

...
...
...
...
...
> > paper > > (among many others by the blis project) articulates some of the > > ideas > > i > > allude to pretty well (with pictures!) > > > > On Sat, Mar 14, 2015 at 12:21 PM, Carter Schonwald > > wrote: > >> > >> dense matrix product is not an algorithm that makes sense in > >> repa's > >> execution model, > >> in square matrix multiply of two N x N matrices, each result > >> entry > >> depends > >> on 2n values total across the two input matrices. > >> even then, thats actually the wrong way to parallelize dense > >> matrix > >> product! its worth reading the papers about goto blas and the > >> more > >> recent > >> blis project. a high performance dense matrix multipy winds up > >> needing > >> to do > >> some nested array parallelism with mutable updates to have > >> efficient > >> sharing > >> of sub computations! > >> > >> > >> > >> On Fri, Mar 13, 2015 at 9:03 PM, Anatoly Yakovenko > >> > >> wrote: > >>> > >>> you think the backed would make any difference? this seems

On Sun, Mar 15, 2015 at 8:04 PM, Carter Schonwald wrote: this like

...
...
...
...
...
> >>> a > >>> runtime issue to me, how are the threads scheduled by the ghc > >>> runtime? > >>> > >>> On Fri, Mar 13, 2015 at 4:58 PM, KC wrote: > >>> > How is the LLVM? > >>> > > >>> > -- > >>> > -- > >>> > > >>> > Sent from an expensive device which will be obsolete in a few > >>> > months! > >>> > :D > >>> > > >>> > Casey > >>> > > >>> > > >>> > On Mar 13, 2015 10:24 AM, "Anatoly Yakovenko" > >>> > > >>> > wrote: > >>> >> > >>> >> https://gist.github.com/aeyakovenko/bf558697a0b3f377f9e8 > >>> >> > >>> >> > >>> >> so i am seeing basically results with N4 that are as good as > >>> >> using > >>> >> sequential computation on my macbook for the matrix multiply > >>> >> algorithm. any idea why? > >>> >> > >>> >> Thanks, > >>> >> Anatoly > >>> >> _______________________________________________ > >>> >> Haskell-Cafe mailing list > >>> >> Haskell-Cafe@haskell.org > >>> >> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > >>> _______________________________________________ > >>> Haskell-Cafe mailing list > >>> Haskell-Cafe@haskell.org > >>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > >> > >> > >

-- -- Sent from an expensive device which will be obsolete in a few months! :D Casey

Anatoly Yakovenko

3:27 a.m.

you mean by hand? or using accelerate? I am not sure GPU's would do that much better at matrix multiplication. According to that paper its pretty cache bound, so having a bigger l1/l2 would have a higher impact on performance then more cores, or wider simd. either way, i am more trying to understand how the parallelization techniques work in Repa, what usecases are they applicable in, and how to pick sequential or parallel versions of functions. Anatoly On Mon, Mar 16, 2015 at 6:20 PM, KC wrote:

...

You want to throw your parallelizable matrix operations to the GPU cores.

MATLAB can now do this and I believe it is starting to be built into R so that R can use the GPU cores..

On Mon, Mar 16, 2015 at 5:11 AM, Anatoly Yakovenko wrote:

...
hmm, so i was wrong

https://gist.github.com/aeyakovenko/0af788390ee9d980c1d6

the first version performed the best, even when running with -N1 agains the sequential version.

On Sun, Mar 15, 2015 at 8:04 PM, Carter Schonwald wrote:

...
you're getting on the right track! :)

the idea you're reaching for is "parallel work depth". Eg, if instead of foldl' (which has O(n) work depth), you had a "parallel" fold that kinda looks like a recursive split and then merge version of the fold operation, you'd have O(log n) work depth. (and that'd likely be faster!). But then you'd notice "below some threshold, its better to compute sequentially, because the overhead of parallization is too big".

etc etc. (the point i'm trying to reach for is that effective parallelization requires a pretty rich understanding of your application / software / hardware cost model)

likewise, REPA is really only going to shine on workloads that look "pointwise" or "flat", at least with the current iteration. Its probably a good idea to look at the various example codes that are available for repa and acccelerate, because you'll notice that the codes which are especially performant have that "flat" style of paralellims

On Sun, Mar 15, 2015 at 7:16 PM, Anatoly Yakovenko wrote:

...
Ok, got it. I picked the wrong function to try to understand how Repa parallelizes :)

So whats a good heuristic for using the parallel versions vs sequential for Repa?

Do the internals try to parallelize every element? or does it fuse them into some small number of parallelized tasks?

So just based from my observations

f (Z :. r :. c) = r * c

a <- computeP (fromFunction f) a `deepSeqArray` sumAllP a

should be faster then:

let a = computeS $ fromFunction f a `deepSeqArray` sumAllP $ a

but probably slower then

sumAllS $ computeS $ fromFunction f

Since an intermediate array is not even computed.

Thanks, Anatoly

On Sun, Mar 15, 2015 at 1:41 PM, Carter Schonwald wrote:

...
Read that paper I linked. Anything else I say will be a rehash of that paper. :)

On Mar 15, 2015 4:21 PM, "Anatoly Yakovenko" wrote:

...
Ok, so whats the difference between the sequence and parallel versions? does the parallel one contain a thunk for every element in the output?

On Sun, Mar 15, 2015 at 12:44 PM, Carter Schonwald wrote: > Read what I linked. > You are benchmarking repa for exactly the pessimal workload that > it > is > bad > at. > > Repa is for point wise parallel and local convolution parallel > programs. > The way repa can express matrix multiplication is exactly the > worst > way > to > implement a parallel matrix mult. Like, pretty pessimal wrt a > memory > traffic / communication complexity metric of performance. > > Benchmark something like image blur algorithms and repa will > really > shine. > > Right now your benchmark is the repa equivalent of noticing that > random > access on singly linked lists is slow :) > > On Mar 15, 2015 2:44 PM, "Anatoly Yakovenko" > > wrote: >> >> I am not really focusing on matrix multiply specifically. So the >> real >> problem is that the implementation using parallelized functions >> is >> slower then the sequential one, and adding more threads makes it >> barely as fast as the sequential one. >> >> So why would i ever use the parallelized versions? >> >> >> On Sat, Mar 14, 2015 at 9:24 AM, Carter Schonwald >> wrote: >> > http://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf >> > this >> > paper >> > (among many others by the blis project) articulates some of the >> > ideas >> > i >> > allude to pretty well (with pictures!) >> > >> > On Sat, Mar 14, 2015 at 12:21 PM, Carter Schonwald >> > wrote: >> >> >> >> dense matrix product is not an algorithm that makes sense in >> >> repa's >> >> execution model, >> >> in square matrix multiply of two N x N matrices, each result >> >> entry >> >> depends >> >> on 2n values total across the two input matrices. >> >> even then, thats actually the wrong way to parallelize dense >> >> matrix >> >> product! its worth reading the papers about goto blas and the >> >> more >> >> recent >> >> blis project. a high performance dense matrix multipy winds up >> >> needing >> >> to do >> >> some nested array parallelism with mutable updates to have >> >> efficient >> >> sharing >> >> of sub computations! >> >> >> >> >> >> >> >> On Fri, Mar 13, 2015 at 9:03 PM, Anatoly Yakovenko >> >> >> >> wrote: >> >>> >> >>> you think the backed would make any difference? this seems >> >>> like >> >>> a >> >>> runtime issue to me, how are the threads scheduled by the ghc >> >>> runtime? >> >>> >> >>> On Fri, Mar 13, 2015 at 4:58 PM, KC wrote: >> >>> > How is the LLVM? >> >>> > >> >>> > -- >> >>> > -- >> >>> > >> >>> > Sent from an expensive device which will be obsolete in a >> >>> > few >> >>> > months! >> >>> > :D >> >>> > >> >>> > Casey >> >>> > >> >>> > >> >>> > On Mar 13, 2015 10:24 AM, "Anatoly Yakovenko" >> >>> > >> >>> > wrote: >> >>> >> >> >>> >> https://gist.github.com/aeyakovenko/bf558697a0b3f377f9e8 >> >>> >> >> >>> >> >> >>> >> so i am seeing basically results with N4 that are as good >> >>> >> as >> >>> >> using >> >>> >> sequential computation on my macbook for the matrix >> >>> >> multiply >> >>> >> algorithm. any idea why? >> >>> >> >> >>> >> Thanks, >> >>> >> Anatoly >> >>> >> _______________________________________________ >> >>> >> Haskell-Cafe mailing list >> >>> >> Haskell-Cafe@haskell.org >> >>> >> >> >>> >> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >> >>> _______________________________________________ >> >>> Haskell-Cafe mailing list >> >>> Haskell-Cafe@haskell.org >> >>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >> >> >> >> >> >

--

--

Sent from an expensive device which will be obsolete in a few months! :D

Casey

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

Carter Schonwald

4:08 a.m.

if you want to understand writing matrix algorithms for the GPU, http://icl.cs.utk.edu/magma/pubs/index.html has lots of reading resources. The cost model for memory and computation on a GPU is different from the CPU cost model. and http://icl.cs.utk.edu/magma/software/ has a bunch of high performance gpu kernels for a bunch of architectures. accelerate is good to look at, at minimum because it manages to focus on a fragment of functionall array programs that are moderately easy to optmize for gpu On Mon, Mar 16, 2015 at 11:27 PM, Anatoly Yakovenko wrote:

...

you mean by hand? or using accelerate? I am not sure GPU's would do that much better at matrix multiplication. According to that paper its pretty cache bound, so having a bigger l1/l2 would have a higher impact on performance then more cores, or wider simd.

either way, i am more trying to understand how the parallelization techniques work in Repa, what usecases are they applicable in, and how to pick sequential or parallel versions of functions.

Anatoly

...
You want to throw your parallelizable matrix operations to the GPU cores.

MATLAB can now do this and I believe it is starting to be built into R so that R can use the GPU cores..

On Mon, Mar 16, 2015 at 5:11 AM, Anatoly Yakovenko < aeyakovenko@gmail.com> wrote:

...
hmm, so i was wrong

https://gist.github.com/aeyakovenko/0af788390ee9d980c1d6

the first version performed the best, even when running with -N1 agains the sequential version.

On Sun, Mar 15, 2015 at 8:04 PM, Carter Schonwald wrote:

...
you're getting on the right track! :)

the idea you're reaching for is "parallel work depth". Eg, if instead of foldl' (which has O(n) work depth), you had a "parallel" fold that

kinda

...
...
looks like a recursive split and then merge version of the fold operation, you'd have O(log n) work depth. (and that'd likely be faster!). But

...
...
...
you'd notice "below some threshold, its better to compute sequentially, because the overhead of parallization is too big".

etc etc. (the point i'm trying to reach for is that effective parallelization requires a pretty rich understanding of your application / software / hardware cost model)

likewise, REPA is really only going to shine on workloads that look "pointwise" or "flat", at least with the current iteration. Its

...
...
...
a good idea to look at the various example codes that are available for repa and acccelerate, because you'll notice that the codes which are especially performant have that "flat" style of paralellims

On Sun, Mar 15, 2015 at 7:16 PM, Anatoly Yakovenko wrote:

...
Ok, got it. I picked the wrong function to try to understand how Repa parallelizes :)

So whats a good heuristic for using the parallel versions vs sequential for Repa?

Do the internals try to parallelize every element? or does it fuse them into some small number of parallelized tasks?

So just based from my observations

f (Z :. r :. c) = r * c

a <- computeP (fromFunction f) a `deepSeqArray` sumAllP a

should be faster then:

let a = computeS $ fromFunction f a `deepSeqArray` sumAllP $ a

but probably slower then

sumAllS $ computeS $ fromFunction f

Since an intermediate array is not even computed.

Thanks, Anatoly

On Sun, Mar 15, 2015 at 1:41 PM, Carter Schonwald wrote:

...
Read that paper I linked. Anything else I say will be a rehash of that paper. :)

On Mar 15, 2015 4:21 PM, "Anatoly Yakovenko" <

aeyakovenko@gmail.com>

...
...
wrote: > > Ok, so whats the difference between the sequence and parallel > versions? does the parallel one contain a thunk for every element in > the output? > > On Sun, Mar 15, 2015 at 12:44 PM, Carter Schonwald > wrote: > > Read what I linked. > > You are benchmarking repa for exactly the pessimal workload that > > it > > is > > bad > > at. > > > > Repa is for point wise parallel and local convolution parallel > > programs. > > The way repa can express matrix multiplication is exactly the > > worst > > way > > to > > implement a parallel matrix mult. Like, pretty pessimal wrt a > > memory > > traffic / communication complexity metric of performance. > > > > Benchmark something like image blur algorithms and repa will > > really > > shine. > > > > Right now your benchmark is the repa equivalent of noticing that > > random > > access on singly linked lists is slow :) > > > > On Mar 15, 2015 2:44 PM, "Anatoly Yakovenko" > > > > wrote: > >> > >> I am not really focusing on matrix multiply specifically. So

...
...
...
...
...
> >> real > >> problem is that the implementation using parallelized functions > >> is > >> slower then the sequential one, and adding more threads makes it > >> barely as fast as the sequential one. > >> > >> So why would i ever use the parallelized versions? > >> > >> > >> On Sat, Mar 14, 2015 at 9:24 AM, Carter Schonwald > >> wrote: > >> > http://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf > >> > this > >> > paper > >> > (among many others by the blis project) articulates some of

...
...
...
...
...
> >> > ideas > >> > i > >> > allude to pretty well (with pictures!) > >> > > >> > On Sat, Mar 14, 2015 at 12:21 PM, Carter Schonwald > >> > wrote: > >> >> > >> >> dense matrix product is not an algorithm that makes sense in > >> >> repa's > >> >> execution model, > >> >> in square matrix multiply of two N x N matrices, each result > >> >> entry > >> >> depends > >> >> on 2n values total across the two input matrices. > >> >> even then, thats actually the wrong way to parallelize dense > >> >> matrix > >> >> product! its worth reading the papers about goto blas and

On Mon, Mar 16, 2015 at 6:20 PM, KC wrote: then probably the the the

...
...
...
...
...
> >> >> more > >> >> recent > >> >> blis project. a high performance dense matrix multipy winds up > >> >> needing > >> >> to do > >> >> some nested array parallelism with mutable updates to have > >> >> efficient > >> >> sharing > >> >> of sub computations! > >> >> > >> >> > >> >> > >> >> On Fri, Mar 13, 2015 at 9:03 PM, Anatoly Yakovenko > >> >> > >> >> wrote: > >> >>> > >> >>> you think the backed would make any difference? this seems > >> >>> like > >> >>> a > >> >>> runtime issue to me, how are the threads scheduled by the ghc > >> >>> runtime? > >> >>> > >> >>> On Fri, Mar 13, 2015 at 4:58 PM, KC wrote: > >> >>> > How is the LLVM? > >> >>> > > >> >>> > -- > >> >>> > -- > >> >>> > > >> >>> > Sent from an expensive device which will be obsolete in a > >> >>> > few > >> >>> > months! > >> >>> > :D > >> >>> > > >> >>> > Casey > >> >>> > > >> >>> > > >> >>> > On Mar 13, 2015 10:24 AM, "Anatoly Yakovenko" > >> >>> > > >> >>> > wrote: > >> >>> >> > >> >>> >> https://gist.github.com/aeyakovenko/bf558697a0b3f377f9e8 > >> >>> >> > >> >>> >> > >> >>> >> so i am seeing basically results with N4 that are as good > >> >>> >> as > >> >>> >> using > >> >>> >> sequential computation on my macbook for the matrix > >> >>> >> multiply > >> >>> >> algorithm. any idea why? > >> >>> >> > >> >>> >> Thanks, > >> >>> >> Anatoly > >> >>> >> _______________________________________________ > >> >>> >> Haskell-Cafe mailing list > >> >>> >> Haskell-Cafe@haskell.org > >> >>> >> > >> >>> >> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > >> >>> _______________________________________________ > >> >>> Haskell-Cafe mailing list > >> >>> Haskell-Cafe@haskell.org > >> >>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > >> >> > >> >> > >> >

--

--

Sent from an expensive device which will be obsolete in a few months! :D

Casey

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

Thomas Miedema

14 Jan 14 Jan

7:19 p.m.

Anatoly: I also ran your benchmark, and can not reproduce your findings. Note that GHC does not make effective use of hyperthreads ( https://ghc.haskell.org/trac/ghc/ticket/9221#comment:12). So don't use -N4 when you have only a dual core machine. Maybe that's why you were getting bad results? I also notice a `NaN` in one of your timing results. I don't know how that is possible, or if it affected your results. Could you try running your benchmark again, but this time with -N2? On Sat, Mar 14, 2015 at 5:21 PM, Carter Schonwald < carter.schonwald@gmail.com> wrote:

...

dense matrix product is not an algorithm that makes sense in repa's execution model,

Matrix multiplication is the first example in the first repa paper: http://benl.ouroborus.net/papers/repa/repa-icfp2010.pdf. Look at figures 2 and 7. "we measured very good absolute speedup, ×7.2 for 8 cores, on multicore hardware" Doing a quick experiment with 2 threads (my laptop doesn't have more cores): $ cabal install repa-examples # I did not bother with `-fllvm` ... $ ~/.cabal/bin/repa-mmult -random 1024 1024 -random 1024 1204 elapsedTimeMS = 6491 $ ~/.cabal/bin/repa-mmult -random 1024 1024 -random 1024 1204 +RTS -N2 elapsedTimeMS = 3393 This is with GHC 7.10.3 and repa-3.4.0.1 (and dependencies from http://www.stackage.org/snapshot/lts-3.22)

Thomas Miedema

7:22 p.m.

To avoid any confusion, this was a reply to the following email: On Fri, Mar 13, 2015 at 6:23 PM, Anatoly Yakovenko wrote:

...

https://gist.github.com/aeyakovenko/bf558697a0b3f377f9e8

so i am seeing basically results with N4 that are as good as using sequential computation on my macbook for the matrix multiply algorithm. any idea why?

Thanks, Anatoly

On Thu, Jan 14, 2016 at 8:19 PM, Thomas Miedema wrote:

...

Anatoly: I also ran your benchmark, and can not reproduce your findings.

Note that GHC does not make effective use of hyperthreads ( https://ghc.haskell.org/trac/ghc/ticket/9221#comment:12). So don't use -N4 when you have only a dual core machine. Maybe that's why you were getting bad results? I also notice a `NaN` in one of your timing results. I don't know how that is possible, or if it affected your results. Could you try running your benchmark again, but this time with -N2?

On Sat, Mar 14, 2015 at 5:21 PM, Carter Schonwald < carter.schonwald@gmail.com> wrote:

...
dense matrix product is not an algorithm that makes sense in repa's execution model,

Matrix multiplication is the first example in the first repa paper: http://benl.ouroborus.net/papers/repa/repa-icfp2010.pdf. Look at figures 2 and 7.

"we measured very good absolute speedup, ×7.2 for 8 cores, on multicore hardware"

Doing a quick experiment with 2 threads (my laptop doesn't have more cores):

$ cabal install repa-examples # I did not bother with `-fllvm` ...

$ ~/.cabal/bin/repa-mmult -random 1024 1024 -random 1024 1204 elapsedTimeMS = 6491

$ ~/.cabal/bin/repa-mmult -random 1024 1024 -random 1024 1204 +RTS -N2 elapsedTimeMS = 3393

This is with GHC 7.10.3 and repa-3.4.0.1 (and dependencies from http://www.stackage.org/snapshot/lts-3.22)

Anatoly Yakovenko

8:57 p.m.

Not sure what changed, but after rerunning it I get expected results: anatolys-MacBook:rbm anatolyy$ dist/build/proto/proto +RTS -N2 benchmarking P time 1.791 s (1.443 s .. 2.304 s) 0.991 R² (0.974 R² .. 1.000 R²) mean 1.803 s (1.750 s .. 1.855 s) std dev 90.06 ms (0.0 s .. 90.90 ms) variance introduced by outliers: 19% (moderately inflated) benchmarking S time 3.225 s (2.685 s .. 3.837 s) 0.996 R² (0.985 R² .. 1.000 R²) mean 3.033 s (2.857 s .. 3.142 s) std dev 165.0 ms (0.0 s .. 188.7 ms) variance introduced by outliers: 19% (moderately inflated) perf log written to dist/perf-mmult.html anatolys-MacBook:rbm anatolyy$ dist/build/proto/proto +RTS -N4 benchmarking P time 1.851 s (1.326 s .. 2.316 s) 0.990 R² (0.964 R² .. 1.000 R²) mean 1.784 s (1.693 s .. 1.901 s) std dev 106.3 ms (0.0 s .. 119.8 ms) variance introduced by outliers: 19% (moderately inflated) benchmarking S time 3.329 s (3.041 s .. 3.944 s) 0.996 R² (0.993 R² .. 1.000 R²) mean 3.173 s (3.100 s .. 3.244 s) std dev 119.6 ms (0.0 s .. 121.9 ms) variance introduced by outliers: 19% (moderately inflated) perf log written to dist/perf-mmult.html anatolys-MacBook:rbm anatolyy$ dist/build/proto/proto +RTS -N benchmarking P time 1.717 s (1.654 s .. 1.830 s) 0.999 R² (0.999 R² .. 1.000 R²) mean 1.717 s (1.701 s .. 1.728 s) std dev 16.64 ms (0.0 s .. 19.20 ms) variance introduced by outliers: 19% (moderately inflated) benchmarking S time 3.127 s (3.079 s .. 3.222 s) 1.000 R² (1.000 R² .. 1.000 R²) mean 3.105 s (3.094 s .. 3.116 s) std dev 18.12 ms (543.9 as .. 18.50 ms) variance introduced by outliers: 19% (moderately inflated) perf log written to dist/perf-mmult.html On Thu, Jan 14, 2016 at 11:22 AM Thomas Miedema wrote:

...

To avoid any confusion, this was a reply to the following email:

On Fri, Mar 13, 2015 at 6:23 PM, Anatoly Yakovenko wrote:

...
https://gist.github.com/aeyakovenko/bf558697a0b3f377f9e8

so i am seeing basically results with N4 that are as good as using sequential computation on my macbook for the matrix multiply algorithm. any idea why?

Thanks, Anatoly

On Thu, Jan 14, 2016 at 8:19 PM, Thomas Miedema wrote:

...
Anatoly: I also ran your benchmark, and can not reproduce your findings.

Note that GHC does not make effective use of hyperthreads ( https://ghc.haskell.org/trac/ghc/ticket/9221#comment:12). So don't use -N4 when you have only a dual core machine. Maybe that's why you were getting bad results? I also notice a `NaN` in one of your timing results. I don't know how that is possible, or if it affected your results. Could you try running your benchmark again, but this time with -N2?

On Sat, Mar 14, 2015 at 5:21 PM, Carter Schonwald < carter.schonwald@gmail.com> wrote:

...
dense matrix product is not an algorithm that makes sense in repa's execution model,

Matrix multiplication is the first example in the first repa paper: http://benl.ouroborus.net/papers/repa/repa-icfp2010.pdf. Look at figures 2 and 7.

"we measured very good absolute speedup, ×7.2 for 8 cores, on multicore hardware"

Doing a quick experiment with 2 threads (my laptop doesn't have more cores):

$ cabal install repa-examples # I did not bother with `-fllvm` ...

$ ~/.cabal/bin/repa-mmult -random 1024 1024 -random 1024 1204 elapsedTimeMS = 6491

$ ~/.cabal/bin/repa-mmult -random 1024 1024 -random 1024 1204 +RTS -N2 elapsedTimeMS = 3393

This is with GHC 7.10.3 and repa-3.4.0.1 (and dependencies from http://www.stackage.org/snapshot/lts-3.22)

Dominic Steinitz

17 Mar 17 Mar

10:21 a.m.

Anatoly Yakovenko writes:

...

https://gist.github.com/aeyakovenko/bf558697a0b3f377f9e8

so i am seeing basically results with N4 that are as good as using sequential computation on my macbook for the matrix multiply algorithm. any idea why?

Thanks, Anatoly

Hi Anatoly, repa is good for things that live on a grid e.g. forward / backward Euler, Crank Nicholson, convolution e.g. of images, multi-grid methods where each cell is updated based on local information (so we are in the world of comonads). I imagine it would also be good for Ising models (but maybe using Swendson-Yang or Wolff). It is not good where the update is based on global information e.g. simulating the solar system. You might compare your results in repa againt yarr https://hackage.haskell.org/package/yarr. Here are some examples of repa / yarr that could be of use https://idontgetoutmuch.wordpress.com/2014/02/10/ laplaces-equation-in-haskell-using-a-dsl-for-stencils-3/ https://idontgetoutmuch.wordpress.com/2013/08/06/ planetary-simulation-with-excursions-in-symplectic-manifolds-6/ https://idontgetoutmuch.wordpress.com/2013/02/10/ parallelising-path-dependent-options-in-haskell-2/ https://readerunner.wordpress.com/2014/04/30/ multigrid-methods-with-repa/ If I knew how to cross-post this to https://mail.haskell.org/cgi-bin/mailman/listinfo/numeric when using gmane I would do so. Dominic.

3474

Age (days ago)

3781

Last active (days ago)

List overview

Download

19 comments

5 participants

participants (5)

Anatoly Yakovenko
Carter Schonwald
Dominic Steinitz
KC
Thomas Miedema

repa parallelization results

tags

participants (5)