OK, some interesting things:
1. INLINE pragmas seem to have no substantial effect.
2. Regrettably, using Dimensional does seem to have some negative effect on performance. It's about 1.5x slower with Dimensional. The fragility of our currently used fusion techniques renders empty the promise of "overhead-free" newtype abstractions.
3. I got a *huge* performance boost by calculating `outputs` through `runIdentity` rather than treating it as an IO action. Several times faster. This makes sense, but I'm surprised the results are so drastic.
At this point, `mapStencil2