ok, could you add those comments (about additional operations to consider) to the ticket?
relatedly: if we want these atomic ops to use the sequential analogues when we're not using the threaded run time system, does that meanwe need to have a symbol / constant variable exposed in the RTS we link in, so that the inline code branches on a linktime constant value / symbol (something like "isThreadedRTS:: Bool", ) or some sort of analogue thereof?
one nice thing about doing such, is that if at some point link time optimization is added, the branch would go away! On the other hand, it could be argued that the cost of the call to the CAS primops in their current form isn't that much more expensive than such a branch.