Re: [Haskell-cafe] repa parallelization results

15 Mar 2015


      Read that paper I linked. Anything else I say will be a rehash of that
paper. :)
On Mar 15, 2015 4:21 PM, "Anatoly Yakovenko"  wrote:
...
Ok, so whats the difference between the sequence and parallel
versions? does the parallel one contain a thunk for every element in
the output?
...
Read what I linked.
You are benchmarking repa for exactly the pessimal workload that it is
bad
at.
Repa is for point wise parallel and local convolution parallel programs.
The way repa can express matrix multiplication is exactly the worst way
to
implement a parallel matrix mult.  Like, pretty pessimal wrt a memory
traffic / communication complexity metric of performance.
Benchmark something like image blur algorithms and repa will really
shine.
Right now your benchmark is the repa equivalent of noticing that random
access on singly linked lists is slow :)
On Mar 15, 2015 2:44 PM, "Anatoly Yakovenko" 
wrote:
...
I am not really focusing on matrix multiply specifically.  So the real
problem is that the implementation using parallelized functions is
slower then the sequential one, and adding more threads makes it
barely as fast as the sequential one.
So why would i ever use the parallelized versions?
On Sat, Mar 14, 2015 at 9:24 AM, Carter Schonwald
 wrote:
...
http://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf this
On Sun, Mar 15, 2015 at 12:44 PM, Carter Schonwald
 wrote:
paper
...
...
...
(among many others by the blis project) articulates some of the ideas
i
allude to pretty well (with pictures!)
On Sat, Mar 14, 2015 at 12:21 PM, Carter Schonwald
 wrote:
...
dense matrix product is not an algorithm that makes sense in repa's
execution model,
in square matrix multiply of two N x N matrices, each result entry
depends
on 2n values total across the  two input matrices.
even then, thats actually the wrong way to parallelize dense matrix
product! its worth reading the papers about goto blas and the more
recent
blis project. a high performance dense matrix multipy winds up
needing
...
to do
some nested array parallelism with mutable updates to have efficient
sharing
of sub computations!
On Fri, Mar 13, 2015 at 9:03 PM, Anatoly Yakovenko

wrote:
...
you think the backed would make any difference?  this seems like a
runtime issue to me, how are the threads scheduled by the ghc
runtime?
...
On Fri, Mar 13, 2015 at 4:58 PM, KC  wrote:
> How is the LLVM?
>
> --
> --
>
> Sent from an expensive device which will be obsolete in a few
> months!
> :D
>
> Casey
>
>
> On Mar 13, 2015 10:24 AM, "Anatoly Yakovenko"
> 
> wrote:
>>
>> https://gist.github.com/aeyakovenko/bf558697a0b3f377f9e8
>>
>>
>> so i am seeing basically results with N4 that are as good as
using
...
>> sequential computation on my macbook for the matrix multiply
>> algorithm.  any idea why?
>>
>> Thanks,
>> Anatoly
>> _______________________________________________
>> Haskell-Cafe mailing list
>> Haskell-Cafe@haskell.org
>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe