Hi Carter,
Yes, SMP.h is where I've copy pasted the duplicate functionality from (since I can't presently rely on linking the symbols).
Your proposal for the LLVM backend sounds *great*. But it also is going to provide additional constraints for getting "atomic-primops" right.
The goal of atomic-primops is to be a stable Haskell-level interface into the relevant CAS and fetch-and-add stuff. The reason this is important is that one has to be very careful to defeat the GHC optimizer in all the relevant places and make pointer equality a reliable property. I would like to get atomic-primops to work reliably in 7.4, 7.6 [and 7.8] and have more "native" support in future GHC releases, where maybe the foreign primops would become unecessary. (They are a pain and have already exposed one blocking cabal bug, fixed in upcoming 1.17.)
A couple additional suggestions for the proposal in ticket #7883:
- we should use more unique symbols than "cas", especially for this rewriting trick. How about "ghc_cas" or something?
- it would be great to get at least fetch-and-add in addition to CAS and barriers
- if we reliably provide this set of special symbols, libraries like atomic-primops may use them in the .cmm and benefit from the CMM->LLVM substitutions
- if we include all the primops I need in GHC proper the previous bullet will stop applying ;-)
Cheers,
-Ryan
P.S. Just as a bit of motivation, here are some recent performance numbers. We often wonder about how close our "pure values in a box" approach comes to efficient lock-free structures. Well here are some numbers about using a proper unboxed counter in the Haskell heap, vs using an IORef Int and atomicModifyIORef': Up to 100X performance difference on some platforms for microbenchmarks that hammer a counter:
And here are the performance and scaling advantages of using ChaseLev (based on atomic-primops), over a traditional pure-in-a-box structure (IORef Data.Seq). The following are timings of ChaseLev/traditional respectively on a 32 core westmere:
fib(42) 1 threads: 21s
fib(42) 2 threads: 10.1s
fib(42) 4 threads: 5.2s (100%prod)
fib(42) 8 threads: 2.7s - 3.2s (100%prod)
fib(42) 16 threads: 1.28s
fib(42) 24 threads: 1.85s
fib(42) 32 threads: 4.8s (high variance)
(hive) fib(42) 1 threads: 41.8s (95% prod)
(hive) fib(42) 2 threads: 25.2s (66% prod)
(hive) fib(42) 4 threads: 14.6s (27% prod, 135GB alloc)
(hive) fib(42) 8 threads: 17.1s (26% prod)
(hive) fib(42) 16 threads: 16.3s (13% prod)
(hive) fib(42) 24 threads: 21.2s (30% prod)
(hive) fib(42) 32 threads: 29.3s (33% prod)
And that is WITH the inefficiency of doing a "ccall" on every single atomic operation.
Notes on parfib performance are here: