DpH/repa cache locality

Hey everyone, Just out of curiosity, what work is being done in the data parallel haskell / repa projects regarding cache locality? The reason I am asking is because, as I understand it, the biggest bottleneck on today's processors are cache misses, and the reason why optimized platform-specific linear algebra libraries perform well is because they divide the data into chunks that are optimally sized for the cache in order to maximize the number of operations performed per memory access. I wouldn't expect data parallel haskell/repa to automatically know what the perfect chunking strategy should be on each platform, but are there any plans being made at all to do something like this? (To be explicit, this isn't meant as a criticism; I'm just curious and am interested in seeing discussion on this topic by those more knowledgeable than I. :-) ) Thanks! Greg

Hi,
2010/7/13 Gregory Crosswhite
Just out of curiosity, what work is being done in the data parallel haskell / repa projects regarding cache locality? Hi,
I'm not knowledgeable at all about this, but for a technical introduction to DPH, I found the following excellent. Locality is a recurring theme in the talk, IIRC. (Sorry for the double reply) http://www.youtube.com/watch?v=NWSZ4c9yqW8 Dominique

The difficulty with optimising for cache effects is that you're effectively introducing sequential dependencies between elements of the array you're processing. To say this another way: If you can evaluate the array elements in any order then you can evaluate them in parallel. Adding restrictions like "element i must be processed before element i+1" can improve case usage but also restricts the evaluation order, and makes it less obvious how to parallelise the code. For Repa, I think we'll end providing primitive operators for doing things like 2d image convolutions. Moving from a linear image convolution, where the all the pixels in one row are processed before moving onto the next, to a blocked based convolution is really a change in algorithm -- and not something I'd expect a general purpose compiler optimisation to do. Ben. On 13/07/2010, at 9:49 , Gregory Crosswhite wrote:
Hey everyone,
Just out of curiosity, what work is being done in the data parallel haskell / repa projects regarding cache locality? The reason I am asking is because, as I understand it, the biggest bottleneck on today's processors are cache misses, and the reason why optimized platform-specific linear algebra libraries perform well is because they divide the data into chunks that are optimally sized for the cache in order to maximize the number of operations performed per memory access. I wouldn't expect data parallel haskell/repa to automatically know what the perfect chunking strategy should be on each platform, but are there any plans being made at all to do something like this?
(To be explicit, this isn't meant as a criticism; I'm just curious and am interested in seeing discussion on this topic by those more knowledgeable than I. :-) )
Thanks! Greg
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
participants (3)
-
Ben Lippmeier
-
Dominique Devriese
-
Gregory Crosswhite