For Johan's primops to work, each primop must represent a full memory fence that is respected both by the architecture, and by both compilers (GHC & LLVM).  Since I don't think GHC is a problem, let's talk about LLVM.  We need to verify that LLVM understands not to float regular loads and stores past one of its own atomic instructions.  If that is the case (even without anything being marked "volatile"), then I think we are in ok shape, right?

Clarification -- this is assuming we're using the "SequentiallyConsistent" setting in the LLVM backend to get full fences on each op, which correspond to the gcc-compatible __sync_* builtins:

   http://llvm.org/docs/Atomics.html#sequentiallyconsistent