If you are serious about arrow optimizations, it might be worthwhile to modify the original preprocessor to output a richer GADT (with cases for things like ArrSwap) with all the information it knows, so that user code can experiment with optimizations. 

Godspeed on this interesting problem.  Please publish :-)

Thanks for the tips Luke. I didn't think of that. But would doing these optimizations at runtime be as efficient as using rewrite rules? I know Yampa uses these GADTs for optimization.

No I won't be able to make such a big optimizer myself - but Paul Liu and Paul Hudak are working on something like that I think (see the Causal Commutative Arrows paper). But my code made use of a non-default *** and I found it odd that it was never called, so I hoped it would be easy to add a couple of simple rewrite rule so that *** was generated.