Sounds good. I am not familiar with the current implementation of CAS and the trade-offs but in principle moving to a CallishMachOp sounds right. You can look at previous examples of this that Johan and I have done in the past to support things like popcnt efficiently:
2906db6c3a3f1000bd7347c7d8e45e65eb2806cb
2d0438f329ac153f9e59155f405d27fac0c43d65
Implementing the change shouldn't be that hard, mostly it'll probably just be a challenge getting familiar with GHC if you aren't already, but a change like this has good documentation and prior examples to work from. A CallishMachOp should allow you to probably just use the existing implementation to start but with the code gen now knowing more info. This as a stepping stone will ensure correctness, then you can look at implementing backend specific versions that are optimized and produce direct code so they are inlined.
Cheers,
David