I think so. Atomically reading and writing a single memory location
(which CAS does) is just a very simple transaction. But using a CAS
instruction should be more efficient, since STM has to maintain a
transaction log and commit transactions, which creates some overhead.

Ah, I see. In that case, it may be worthwhile to implement the CAS instruction in terms of STM as well and measure the performance difference this makes for the final data structure. After all, STM is a lot more compositional than CAS, so I'd like to know whether the loss of expressiveness is worth the gain in performance.

There's one annoying hitch with doing apples-to-apples comparisons here.

The problem is that CAS falls short of being a hardware-accelerated version of a simple transaction (read TVar, (==) against expected value, conditionally update TVar), because CAS is based on pointer equality rather than true equality. (eq? rather than equal? for any Schemers out there.)

For this reason our "Fake" version of CAS -- for older GHCs and for performance comparison -- has to use reallyUnsafePointerEquality#:

http://hackage.haskell.org/packages/archive/IORefCAS/0.2/doc/html/Data-CAS-Internal-Fake.html

But we do provide a "CASable" type class in that package which is precisely for comparing the performance of:

Hardware CAS
Fake CAS -- atomicModifyIORef + ptrEquality
Foreign CAS -- gcc intrinsic + function call

Cheers,

-Ryan