How to make Repa do stencils unboxed?

Hello all, TL;DR: Repa appears to be calculating my stencil operation using boxed arithmetic, which is very slow. How do I make it do unboxed arithmetic? I am currently working on a project where I need to simulate a 2D environment that exhibits wave behavior. I'm accomplishing this by directly discretizing the wave equation and turning it into a cellular automaton. The math here is relatively simple. Let's specify the example to the physical system of a rubber sheet, which exhibits wave behavior. To simulate a rubber sheet, you divide it up into grids. Then, you keep track of the velocity and height of each grid. At each iteration of the simulation, you update the velocity of each grid based on the force exerted upon it by its neighboring grids. This is simply a linear multiple of the distance between the grid and its neighbors (i.e. how stretched the sheet is). For example, if cell x is at height X, and its neighbors a,b,c,d are at height A,B,C,D, the tension on X is, for some constant L, T = L*((A-X)+(B-X)+...) = L*(A+B+C+D-4X). You can calculate this efficiently via a matrix convolution, in particular by a convolution with the matrix [[ 0 1 0 1 -4 1 0 1 0 ]] You can be more physically accurate by including the diagonals (weighted down a bit). A popular stencil for such purposes is [[ 1 2 1 2 -12 2 1 2 1 ]] Technically the off-diagonal elements should be something like sqrt(2), but this is sufficient for most purposes. After calculating the tension matrix, you divide it by the mass of each grid. This is the acceleration on each grid. Multiply it by the length of your time step, and you get the change in velocity of each grid. Add this to the velocity matrix. Then, multiply the velocity matrix by the time step, and that's the change in height. Add this to the height matrix. Boom, you've simulated a single time step of a rubber sheet. Keep doing this over and over to get the time-evolution of the rubber sheet. My code to do this is as follows: https://gist.github.com/wyager/8c468c9d18ad62aff8bc9738aa947ea4 The imports are messy; sorry about that. You'll notice I'm using the Dimensional library, which provides statically typed physical dimensions over any numeric value using a newtype wrapper with typelits. Hopefully this isn't slowing things down somehow. According to profiling via "ghc -Odph -rtsopts -threaded -fno-liberate-case -funfolding-use-threshold1000 -funfolding-keeness-factor1000 -optlo-O3 -fforce-recomp -prof -fprof-auto -fexternal-interpreter RepaSim.hs" (whew), fully 80% of time and 84% of allocation comes from the stencil operation. The simulation starts with a rubber sheet with a smooth bump in the middle, simulates 100ms of time steps, saves a .png file of the sheet, and then repeats n times (n is typed in via standard input). How can I make Repa do this stencil unboxed, and are there other things I should be doing to get better performance? Thanks, Will

On 2 Jul 2016, at 1:34 PM, William Yager
wrote: My code to do this is as follows:
https://gist.github.com/wyager/8c468c9d18ad62aff8bc9738aa947ea4 https://gist.github.com/wyager/8c468c9d18ad62aff8bc9738aa947ea4
1) Put INLINE pragmas on all the leaf functions, especially ‘kernel’. If the compiler does not inline these functions they won’t fuse. This is a key problem with the Repa approach to fusion. 2) The ‘dimensional’ packages wraps a data type around all those values. I’m not convinced the simplifier will be able to undo the wrapping / unwrapping. You’ll need to inspect the core code to check. Ben.

OK, some interesting things:
1. INLINE pragmas seem to have no substantial effect.
2. Regrettably, using Dimensional does seem to have some negative effect on
performance. It's about 1.5x slower with Dimensional. The fragility of our
currently used fusion techniques renders empty the promise of
"overhead-free" newtype abstractions.
3. I got a *huge* performance boost by calculating `outputs` through
`runIdentity` rather than treating it as an IO action. Several times
faster. This makes sense, but I'm surprised the results are so drastic.
At this point, `mapStencil2
On Sat, Jul 2, 2016 at 3:46 AM, Ben Lippmeier
On 2 Jul 2016, at 1:34 PM, William Yager
wrote: My code to do this is as follows:
https://gist.github.com/wyager/8c468c9d18ad62aff8bc9738aa947ea4
1) Put INLINE pragmas on all the leaf functions, especially ‘kernel’. If the compiler does not inline these functions they won’t fuse. This is a key problem with the Repa approach to fusion.
2) The ‘dimensional’ packages wraps a data type around all those values. I’m not convinced the simplifier will be able to undo the wrapping / unwrapping. You’ll need to inspect the core code to check.
Ben.

Oops, sent prematurely.
`mapStencil2` seems to be allocating one extra machine word per stencil
computation. So not entirely unboxed, but much better.
Latest code is at
https://gist.github.com/wyager/1c5879660a61c5cd6afaceb3e928d889
Can anyone explain to me how the Identity monad manages to guarantee the
sequencing/non-nesting requirements of computeP? As far as I can tell, it
should be reduced to plain old function application, which is the same as
nesting computeP, which is what we were trying to prevent.
In other words, what effect does the Identity monad have over not having a
monad at all? Its bind definition has no sequencing effects or anything,
so I can't imagine that it actually accomplishes anything.
Cheers,
Will
On Sat, Jul 2, 2016 at 3:46 PM, William Yager
OK, some interesting things:
1. INLINE pragmas seem to have no substantial effect.
2. Regrettably, using Dimensional does seem to have some negative effect on performance. It's about 1.5x slower with Dimensional. The fragility of our currently used fusion techniques renders empty the promise of "overhead-free" newtype abstractions.
3. I got a *huge* performance boost by calculating `outputs` through `runIdentity` rather than treating it as an IO action. Several times faster. This makes sense, but I'm surprised the results are so drastic.
At this point, `mapStencil2
On Sat, Jul 2, 2016 at 3:46 AM, Ben Lippmeier
wrote: On 2 Jul 2016, at 1:34 PM, William Yager
wrote: My code to do this is as follows:
https://gist.github.com/wyager/8c468c9d18ad62aff8bc9738aa947ea4
1) Put INLINE pragmas on all the leaf functions, especially ‘kernel’. If the compiler does not inline these functions they won’t fuse. This is a key problem with the Repa approach to fusion.
2) The ‘dimensional’ packages wraps a data type around all those values. I’m not convinced the simplifier will be able to undo the wrapping / unwrapping. You’ll need to inspect the core code to check.
Ben.

On 3 Jul 2016, at 8:46 AM, William Yager
wrote:
2. Regrettably, using Dimensional does seem to have some negative effect on performance. It's about 1.5x slower with Dimensional. The fragility of our currently used fusion techniques renders empty the promise of "overhead-free" newtype abstractions.
Yes, this is a massive problem with the Repa approach to array fusion. The compilation method (and resulting runtime performance) depends on the GHC simplifier acting in a certain way — yet there is no specification of exactly what the simplifier should do, and no easy way to check that it did what was expected other than eyeballing the intermediate code. We really need a different approach to program optimisation, such as a working supercompiler, instead of just “a lot of bullets in the gun” that might hit the target or not. The latter is fine for general purpose code optimisation but not “compile by transformation” where we really depend on the transformations doing what they’re supposed to. The use of syntactic witnesses of type equality in the core language doesn’t help either, as they tend do get in the way of other program transformations. I think the way the Repa API is *set up* is fine, but expecting a general purpose compiler to always successfully fuse such code is too optimistic. We had the same sort of problems in DPH. Live and learn.
3. I got a *huge* performance boost by calculating `outputs` through `runIdentity` rather than treating it as an IO action. Several times faster. This makes sense, but I'm surprised the results are so drastic.
If a source level change causes some syntactic wibble in the intermediate code that helps fusion, then the result will be drastic.
Can anyone explain to me how the Identity monad manages to guarantee the sequencing/non-nesting requirements of computeP?
It doesn’t. I wrapped computeP in a identity monad to discourage people from trying to perform a computeP in the worker function of another parallel operator (like map).
As far as I can tell, it should be reduced to plain old function application, which is the same as nesting computeP, which is what we were trying to prevent.
Yes. Doing this would create a nested parallel computation, which Repa does not support (as you mentioned). If you were to use ‘runIdentity' to discharge the identity constructor, then create a nested parallel computation anyway, then you’d get a runtime warning on the console.
In other words, what effect does the Identity monad have over not having a monad at all? Its bind definition has no sequencing effects or anything, so I can't imagine that it actually accomplishes anything.
Its purpose is psychological. It forces the user to question what is really going on. Ben.
participants (2)
-
Ben Lippmeier
-
William Yager