Hi Carter,

Yes, SMP.h is where I've copy pasted the duplicate functionality from (since I can't presently rely on linking the symbols).

Your proposal for the LLVM backend sounds *great*. But it also is going to provide additional constraints for getting "atomic-primops" right.

The goal of atomic-primops is to be a stable Haskell-level interface into the relevant CAS and fetch-and-add stuff. The reason this is important is that one has to be very careful to defeat the GHC optimizer in all the relevant places and make pointer equality a reliable property. I would like to get atomic-primops to work reliably in 7.4, 7.6 [and 7.8] and have more "native" support in future GHC releases, where maybe the foreign primops would become unecessary. (They are a pain and have already exposed one blocking cabal bug, fixed in upcoming 1.17.)

A couple additional suggestions for the proposal in ticket #7883:

we should use more unique symbols than "cas", especially for this rewriting trick. How about "ghc_cas" or something?
it would be great to get at least fetch-and-add in addition to CAS and barriers
if we reliably provide this set of special symbols, libraries like atomic-primops may use them in the .cmm and benefit from the CMM->LLVM substitutions
if we include all the primops I need in GHC proper the previous bullet will stop applying ;-)

Cheers,

-Ryan

P.S. Just as a bit of motivation, here are some recent performance numbers. We often wonder about how close our "pure values in a box" approach comes to efficient lock-free structures. Well here are some numbers about using a proper unboxed counter in the Haskell heap, vs using an IORef Int and atomicModifyIORef': Up to 100X performance difference on some platforms for microbenchmarks that hammer a counter:

https://github.com/rrnewton/haskell-lockfree-queue/blob/fb12d1121690553e4f737af258848f279147ea24/AtomicPrimops/DEVLOG.md#20130718-timing-atomic-counter-ops

And here are the performance and scaling advantages of using ChaseLev (based on atomic-primops), over a traditional pure-in-a-box structure (IORef Data.Seq). The following are timings of ChaseLev/traditional respectively on a 32 core westmere:

fib(42) 1 threads: 21s

fib(42) 2 threads: 10.1s

fib(42) 4 threads: 5.2s (100%prod)

fib(42) 8 threads: 2.7s - 3.2s (100%prod)

fib(42) 16 threads: 1.28s

fib(42) 24 threads: 1.85s

fib(42) 32 threads: 4.8s (high variance)

(hive) fib(42) 1 threads: 41.8s (95% prod)

(hive) fib(42) 2 threads: 25.2s (66% prod)

(hive) fib(42) 4 threads: 14.6s (27% prod, 135GB alloc)

(hive) fib(42) 8 threads: 17.1s (26% prod)

(hive) fib(42) 16 threads: 16.3s (13% prod)

(hive) fib(42) 24 threads: 21.2s (30% prod)

(hive) fib(42) 32 threads: 29.3s (33% prod)

And that is WITH the inefficiency of doing a "ccall" on every single atomic operation.

Notes on parfib performance are here:

https://github.com/rrnewton/haskell-lockfree-queue/blob/d6d3e9eda2a487a5f055b1f51423954bb6b6bdfa/ChaseLev/Test.hs#L158

On Fri, Jul 19, 2013 at 5:05 PM, Carter Schonwald <carter.schonwald@gmail.com> wrote:

ryan, the relevant machinery on the C side is here, see ./includes/stg/SMP.h : https://github.com/ghc/ghc/blob/7cc8a3cc5c2970009b83844ff9cc4e27913b8559/includes/stg/SMP.h

(unless i'm missing something)

On Fri, Jul 19, 2013 at 4:53 PM, Carter Schonwald <carter.schonwald@gmail.com> wrote:

Ryan,
if you look at line 270, you'll see the CAS is a C call https://github.com/ghc/ghc/blob/95e6865ecf06b2bd80fa737e4fa4a24beaae25c5/rts/PrimOps.cmm#L270

What Simon is alluding to is some work I started (but need to finish)
http://ghc.haskell.org/trac/ghc/ticket/7883 is the relevant ticket, and I'll need to sort out doing the same on the native code gen too

there ARE no write barrier primops, they're baked into the CAS machinery in ghc's rts

On Fri, Jul 19, 2013 at 1:02 PM, Ryan Newton <rrnewton@gmail.com> wrote:

Yes, I'd absolutely rather not suffer C call overhead for these functions (or the CAS functions). But isn't that how it's done currently for the casMutVar# primop?

https://github.com/ghc/ghc/blob/95e6865ecf06b2bd80fa737e4fa4a24beaae25c5/rts/PrimOps.cmm#L265

To avoid the overhead, is it necessary to make each primop in-line rather than out-of-line, or just to get rid of the "ccall"?

Another reason it would be good to package these with GHC is that I'm having trouble building robust libraries of foreign primops that work under all "ways" (e.g. GHCI). For example, this bug:

https://github.com/rrnewton/haskell-lockfree-queue/issues/10

If I write .cmm code that depends on RTS functionality like stg_MUT_VAR_CLEAN_info, then it seems to work fine when in compiled mode (with/without threading, profiling), but I get link errors from GHCI where these symbols aren't defined.

I've got a draft of the relevant primops here:

https://github.com/rrnewton/haskell-lockfree-queue/blob/master/AtomicPrimops/cbits/primops.cmm

Which includes:
variants of CAS for MutableArray# and MutableByteArray#
fetch-and-add for MutableByteArray#
Also, there are some tweaks to support the new "ticketed" interface for safer CAS:

http://hackage.haskell.org/packages/archive/atomic-primops/0.3/doc/html/Data-Atomics.html#g:3

I started adding some of these primops to GHC proper (still as out-of-line), but not all of them. I had gone with the foreign primop route instead...

https://github.com/rrnewton/ghc/commits/master

-Ryan

P.S. Where is the write barrier primop? I don't see it listed in prelude/primops.txt...

On Fri, Jul 19, 2013 at 11:41 AM, Carter Schonwald <carter.schonwald@gmail.com> wrote:

I guess I should find the time to finish the CAS primop work I volunteered to do then. Ill look into in a few days.

On Friday, July 19, 2013, Simon Marlow wrote:
On 18/07/13 14:17, Ryan Newton wrote:

The "atomic-primops" library depends on symbols such as
store_load_barrier and "cas", which are defined in SMP.h. Thus the
result is that if the program is linked WITHOUT "-threaded", the user
gets a linker error about undefined symbols.

The specific place it's used is in the 'foreign "C"' bits of this .cmm code:

https://github.com/rrnewton/haskell-lockfree-queue/blob/87e63b21b2a6c375e93c30b98c28c1d04f88781c/AtomicPrimops/cbits/primops.cmm

I'm trying to explore hacks that will enable me to pull in those
functions during compile time, without duplicating a whole bunch of code
from the RTS. But it's a fragile business.

It seems to me that some of these routines have general utility. In
future versions of GHC, could we consider linking in those routines
irrespective of "-threaded"?

We should make the non-THREADED versions EXTERN_INLINE too, so that there will be (empty) functions to call in rts/Inlines.c. Want to submit a patch?

A better solution would be to make them into primops. You don't really want to be calling out to a C function to implement a memory barrier. We have this for write_barrier(), but none of the others so far. Of couse that's a larger change.

Cheers,
Simon

_______________________________________________
ghc-devs mailing list
ghc-devs@haskell.org
http://www.haskell.org/mailman/listinfo/ghc-devs