Just to keep you all up to date...  I'm adding the primops in question and validating the individual commits before putting them here:

    https://github.com/rrnewton/ghc/commits/atomicPrimOps

The basic idea for using these extensions is:
  • the atomic-primops library will work in 7.6 or 7.7+.  It will use ifdefs to decide whether to use its own primops or GHC-builtin
  • future versions will simply get faster, as Carter replaces out-of-line primops that *also* use C calls, with inline primops / LLVM equivalents
Shall I stick a patch on a ticket, or will someone volunteer to pull?  What's the protocol for requesting commit access anyway?  (By the way, can someone share the reason that pull-requests to the github ghc mirror are such a no-no?  They seem no worse than a patch in an email which the big warning sign recommends.)

Best,
  -Ryan

P.S. FYI, I'm periodically getting these: 

   0 caused framework failures
   0 unexpected passes
   1 unexpected failures

     Unexpected failures:
perf/compiler  T1969 [stat not good enough] (normal)

Can that just be because of running on a loaded machine?  How narrow are these windows?


On Thu, Aug 1, 2013 at 12:32 PM, Ryan Newton <rrnewton@gmail.com> wrote:
On Sun, Jul 21, 2013 at 3:32 AM, Carter Schonwald <carter.schonwald@gmail.com> wrote:
ok, could you add those comments (about additional operations to consider) to the ticket?

Sure.  Just did that.
 
relatedly: if we want these atomic ops to use the sequential analogues when we're not using the threaded run time system, does that mean 
we need to have a symbol / constant variable exposed in the RTS we link in, so that the inline code branches on a linktime constant value / symbol (something like "isThreadedRTS:: Bool", )  or some sort of analogue thereof?  

I think it will take some care to mimic the semantics perfectly.  Why not just leave the real atomic ops even in non-threaded mode, at least at first?  Later we can optimize it if we find that people are using concurrent data structures heavily in non-threaded mode ;-). 
 
one nice thing about doing such, is that if at some point link time optimization is added, the branch would go away! On the other hand, it could be argued that the cost of the call to the CAS primops in their current form isn't that much more expensive than such a branch. 

Indeed, I'm much more concerned about performance in the threaded case and making sure they're correct.