Proposal: provide cas and barriers symbols even without -threaded

The "atomic-primops" library depends on symbols such as store_load_barrier and "cas", which are defined in SMP.h. Thus the result is that if the program is linked WITHOUT "-threaded", the user gets a linker error about undefined symbols. The specific place it's used is in the 'foreign "C"' bits of this .cmm code: https://github.com/rrnewton/haskell-lockfree-queue/blob/87e63b21b2a6c375e93c... I'm trying to explore hacks that will enable me to pull in those functions during compile time, without duplicating a whole bunch of code from the RTS. But it's a fragile business. It seems to me that some of these routines have general utility. In future versions of GHC, could we consider linking in those routines irrespective of "-threaded"? -Ryan

I want to note something, which is that if we did link in cas/store_load_barrier, then your lockfree queue would always be synchronized, even if you didn't compile with -threaded. Perhaps this is not a big deal, but it is generally nice to not pay the cost of synchronization when it is unnecessary. So it would be better if there were threaded/nonthreaded variants which you could use instead. How does that sound? (Fortunately, you are not inlining the functions, so it's totally possible for this to happen. We'd have a tougher row to hoe if you needed to inline these functions.) Edward Excerpts from Ryan Newton's message of Thu Jul 18 06:17:44 -0700 2013:
The "atomic-primops" library depends on symbols such as store_load_barrier and "cas", which are defined in SMP.h. Thus the result is that if the program is linked WITHOUT "-threaded", the user gets a linker error about undefined symbols.
The specific place it's used is in the 'foreign "C"' bits of this .cmm code:
https://github.com/rrnewton/haskell-lockfree-queue/blob/87e63b21b2a6c375e93c...
I'm trying to explore hacks that will enable me to pull in those functions during compile time, without duplicating a whole bunch of code from the RTS. But it's a fragile business.
It seems to me that some of these routines have general utility. In future versions of GHC, could we consider linking in those routines irrespective of "-threaded"?
-Ryan

Edward,
This makes sense to me. Especially because eliding-synchronization is
already the convention followed in SMP.hs, where, for example,
write_barrier becomes noops if !THREADED_RTS.
All I would need would be linkable symbols for those noops (a la
Inlines.chttps://github.com/ghc/ghc/blob/master/rts/Inlines.c),
not just the #defines that are currently in SMP.h
I think providing these symbols *reliably* would be complementary to
Carter's proposal to handle them better in the LLVM backend. In fact,
Carter's proposal is more motivation, for me to be using the "official"
versions in my .cmm "ccalls".
Right now I've literally copy-pasted the relevant code from SMP.h, into
C code called "DUP_cas", "DUP_write_barrier" etc (yuck). And these
duplicated versions would be missed by the CMM->LLVM conversion Carter has
proposed.
-Ryan
On Thu, Jul 18, 2013 at 12:11 PM, Edward Z. Yang
I want to note something, which is that if we did link in cas/store_load_barrier, then your lockfree queue would always be synchronized, even if you didn't compile with -threaded. Perhaps this is not a big deal, but it is generally nice to not pay the cost of synchronization when it is unnecessary. So it would be better if there were threaded/nonthreaded variants which you could use instead. How does that sound? (Fortunately, you are not inlining the functions, so it's totally possible for this to happen. We'd have a tougher row to hoe if you needed to inline these functions.)
Edward
Excerpts from Ryan Newton's message of Thu Jul 18 06:17:44 -0700 2013:
The "atomic-primops" library depends on symbols such as store_load_barrier and "cas", which are defined in SMP.h. Thus the result is that if the program is linked WITHOUT "-threaded", the user gets a linker error about undefined symbols.
The specific place it's used is in the 'foreign "C"' bits of this .cmm code:
https://github.com/rrnewton/haskell-lockfree-queue/blob/87e63b21b2a6c375e93c...
I'm trying to explore hacks that will enable me to pull in those
functions
during compile time, without duplicating a whole bunch of code from the RTS. But it's a fragile business.
It seems to me that some of these routines have general utility. In future versions of GHC, could we consider linking in those routines irrespective of "-threaded"?
-Ryan

On 18/07/13 14:17, Ryan Newton wrote:
The "atomic-primops" library depends on symbols such as store_load_barrier and "cas", which are defined in SMP.h. Thus the result is that if the program is linked WITHOUT "-threaded", the user gets a linker error about undefined symbols.
The specific place it's used is in the 'foreign "C"' bits of this .cmm code:
https://github.com/rrnewton/haskell-lockfree-queue/blob/87e63b21b2a6c375e93c...
I'm trying to explore hacks that will enable me to pull in those functions during compile time, without duplicating a whole bunch of code from the RTS. But it's a fragile business.
It seems to me that some of these routines have general utility. In future versions of GHC, could we consider linking in those routines irrespective of "-threaded"?
We should make the non-THREADED versions EXTERN_INLINE too, so that there will be (empty) functions to call in rts/Inlines.c. Want to submit a patch? A better solution would be to make them into primops. You don't really want to be calling out to a C function to implement a memory barrier. We have this for write_barrier(), but none of the others so far. Of couse that's a larger change. Cheers, Simon

I guess I should find the time to finish the CAS primop work I volunteered to do then. Ill look into in a few days. On Friday, July 19, 2013, Simon Marlow wrote:
On 18/07/13 14:17, Ryan Newton wrote:
The "atomic-primops" library depends on symbols such as store_load_barrier and "cas", which are defined in SMP.h. Thus the result is that if the program is linked WITHOUT "-threaded", the user gets a linker error about undefined symbols.
The specific place it's used is in the 'foreign "C"' bits of this .cmm code:
https://github.com/rrnewton/**haskell-lockfree-queue/blob/** 87e63b21b2a6c375e93c30b98c28c1**d04f88781c/AtomicPrimops/** cbits/primops.cmmhttps://github.com/rrnewton/haskell-lockfree-queue/blob/87e63b21b2a6c375e93c...
I'm trying to explore hacks that will enable me to pull in those functions during compile time, without duplicating a whole bunch of code from the RTS. But it's a fragile business.
It seems to me that some of these routines have general utility. In future versions of GHC, could we consider linking in those routines irrespective of "-threaded"?
We should make the non-THREADED versions EXTERN_INLINE too, so that there will be (empty) functions to call in rts/Inlines.c. Want to submit a patch?
A better solution would be to make them into primops. You don't really want to be calling out to a C function to implement a memory barrier. We have this for write_barrier(), but none of the others so far. Of couse that's a larger change.
Cheers, Simon
______________________________**_________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/**mailman/listinfo/ghc-devshttp://www.haskell.org/mailman/listinfo/ghc-devs

Yes, I'd absolutely rather not suffer C call overhead for these functions (or the CAS functions). But isn't that how it's done currently for the casMutVar# primop? https://github.com/ghc/ghc/blob/95e6865ecf06b2bd80fa737e4fa4a24beaae25c5/rts... To avoid the overhead, is it necessary to make each primop in-line rather than out-of-line, or just to get rid of the "ccall"? Another reason it would be good to package these with GHC is that I'm having trouble building robust libraries of foreign primops that work under all "ways" (e.g. GHCI). For example, this bug: https://github.com/rrnewton/haskell-lockfree-queue/issues/10 If I write .cmm code that depends on RTS functionality like stg_MUT_VAR_CLEAN_info, then it seems to work fine when in compiled mode (with/without threading, profiling), but I get link errors from GHCI where these symbols aren't defined. I've got a draft of the relevant primops here: https://github.com/rrnewton/haskell-lockfree-queue/blob/master/AtomicPrimops... Which includes: - variants of CAS for MutableArray# and MutableByteArray# - fetch-and-add for MutableByteArray# Also, there are some tweaks to support the new "ticketed" interface for safer CAS: http://hackage.haskell.org/packages/archive/atomic-primops/0.3/doc/html/Data... I started adding some of these primops to GHC proper (still as out-of-line), but not all of them. I had gone with the foreign primop route instead... https://github.com/rrnewton/ghc/commits/master -Ryan P.S. Where is the write barrier primop? I don't see it listed in prelude/primops.txt... On Fri, Jul 19, 2013 at 11:41 AM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
I guess I should find the time to finish the CAS primop work I volunteered to do then. Ill look into in a few days.
On Friday, July 19, 2013, Simon Marlow wrote:
On 18/07/13 14:17, Ryan Newton wrote:
The "atomic-primops" library depends on symbols such as store_load_barrier and "cas", which are defined in SMP.h. Thus the result is that if the program is linked WITHOUT "-threaded", the user gets a linker error about undefined symbols.
The specific place it's used is in the 'foreign "C"' bits of this .cmm code:
https://github.com/rrnewton/**haskell-lockfree-queue/blob/** 87e63b21b2a6c375e93c30b98c28c1**d04f88781c/AtomicPrimops/** cbits/primops.cmmhttps://github.com/rrnewton/haskell-lockfree-queue/blob/87e63b21b2a6c375e93c...
I'm trying to explore hacks that will enable me to pull in those functions during compile time, without duplicating a whole bunch of code from the RTS. But it's a fragile business.
It seems to me that some of these routines have general utility. In future versions of GHC, could we consider linking in those routines irrespective of "-threaded"?
We should make the non-THREADED versions EXTERN_INLINE too, so that there will be (empty) functions to call in rts/Inlines.c. Want to submit a patch?
A better solution would be to make them into primops. You don't really want to be calling out to a C function to implement a memory barrier. We have this for write_barrier(), but none of the others so far. Of couse that's a larger change.
Cheers, Simon
______________________________**_________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/**mailman/listinfo/ghc-devshttp://www.haskell.org/mailman/listinfo/ghc-devs

If anyone knows the proper way to fix this bug, btw, it would be greatly appreciated. I don't know the right way to make sure a library gets linked against RTS symbols like "stg_MUT_VAR_CLEAN_info" even when it is loaded by GHCI.
For example, this bug:
https://github.com/rrnewton/haskell-lockfree-queue/issues/10

Ryan,
if you look at line 270, you'll see the CAS is a C call
https://github.com/ghc/ghc/blob/95e6865ecf06b2bd80fa737e4fa4a24beaae25c5/rts...
What Simon is alluding to is some work I started (but need to finish)
http://ghc.haskell.org/trac/ghc/ticket/7883 is the relevant ticket, and
I'll need to sort out doing the same on the native code gen too
there ARE no write barrier primops, they're baked into the CAS machinery in
ghc's rts
On Fri, Jul 19, 2013 at 1:02 PM, Ryan Newton
Yes, I'd absolutely rather not suffer C call overhead for these functions (or the CAS functions). But isn't that how it's done currently for the casMutVar# primop?
https://github.com/ghc/ghc/blob/95e6865ecf06b2bd80fa737e4fa4a24beaae25c5/rts...
To avoid the overhead, is it necessary to make each primop in-line rather than out-of-line, or just to get rid of the "ccall"?
Another reason it would be good to package these with GHC is that I'm having trouble building robust libraries of foreign primops that work under all "ways" (e.g. GHCI). For example, this bug:
https://github.com/rrnewton/haskell-lockfree-queue/issues/10
If I write .cmm code that depends on RTS functionality like stg_MUT_VAR_CLEAN_info, then it seems to work fine when in compiled mode (with/without threading, profiling), but I get link errors from GHCI where these symbols aren't defined.
I've got a draft of the relevant primops here:
https://github.com/rrnewton/haskell-lockfree-queue/blob/master/AtomicPrimops...
Which includes:
- variants of CAS for MutableArray# and MutableByteArray# - fetch-and-add for MutableByteArray#
Also, there are some tweaks to support the new "ticketed" interface for safer CAS:
http://hackage.haskell.org/packages/archive/atomic-primops/0.3/doc/html/Data...
I started adding some of these primops to GHC proper (still as out-of-line), but not all of them. I had gone with the foreign primop route instead...
https://github.com/rrnewton/ghc/commits/master
-Ryan
P.S. Where is the write barrier primop? I don't see it listed in prelude/primops.txt...
On Fri, Jul 19, 2013 at 11:41 AM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
I guess I should find the time to finish the CAS primop work I volunteered to do then. Ill look into in a few days.
On Friday, July 19, 2013, Simon Marlow wrote:
On 18/07/13 14:17, Ryan Newton wrote:
The "atomic-primops" library depends on symbols such as store_load_barrier and "cas", which are defined in SMP.h. Thus the result is that if the program is linked WITHOUT "-threaded", the user gets a linker error about undefined symbols.
The specific place it's used is in the 'foreign "C"' bits of this .cmm code:
https://github.com/rrnewton/**haskell-lockfree-queue/blob/** 87e63b21b2a6c375e93c30b98c28c1**d04f88781c/AtomicPrimops/** cbits/primops.cmmhttps://github.com/rrnewton/haskell-lockfree-queue/blob/87e63b21b2a6c375e93c...
I'm trying to explore hacks that will enable me to pull in those functions during compile time, without duplicating a whole bunch of code from the RTS. But it's a fragile business.
It seems to me that some of these routines have general utility. In future versions of GHC, could we consider linking in those routines irrespective of "-threaded"?
We should make the non-THREADED versions EXTERN_INLINE too, so that there will be (empty) functions to call in rts/Inlines.c. Want to submit a patch?
A better solution would be to make them into primops. You don't really want to be calling out to a C function to implement a memory barrier. We have this for write_barrier(), but none of the others so far. Of couse that's a larger change.
Cheers, Simon
______________________________**_________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/**mailman/listinfo/ghc-devshttp://www.haskell.org/mailman/listinfo/ghc-devs

ryan, the relevant machinery on the C side is here, see ./includes/stg/SMP.h : https://github.com/ghc/ghc/blob/7cc8a3cc5c2970009b83844ff9cc4e27913b8559/inc... (unless i'm missing something) On Fri, Jul 19, 2013 at 4:53 PM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
Ryan, if you look at line 270, you'll see the CAS is a C call https://github.com/ghc/ghc/blob/95e6865ecf06b2bd80fa737e4fa4a24beaae25c5/rts...
What Simon is alluding to is some work I started (but need to finish) http://ghc.haskell.org/trac/ghc/ticket/7883 is the relevant ticket, and I'll need to sort out doing the same on the native code gen too
there ARE no write barrier primops, they're baked into the CAS machinery in ghc's rts
On Fri, Jul 19, 2013 at 1:02 PM, Ryan Newton
wrote: Yes, I'd absolutely rather not suffer C call overhead for these functions (or the CAS functions). But isn't that how it's done currently for the casMutVar# primop?
https://github.com/ghc/ghc/blob/95e6865ecf06b2bd80fa737e4fa4a24beaae25c5/rts...
To avoid the overhead, is it necessary to make each primop in-line rather than out-of-line, or just to get rid of the "ccall"?
Another reason it would be good to package these with GHC is that I'm having trouble building robust libraries of foreign primops that work under all "ways" (e.g. GHCI). For example, this bug:
https://github.com/rrnewton/haskell-lockfree-queue/issues/10
If I write .cmm code that depends on RTS functionality like stg_MUT_VAR_CLEAN_info, then it seems to work fine when in compiled mode (with/without threading, profiling), but I get link errors from GHCI where these symbols aren't defined.
I've got a draft of the relevant primops here:
https://github.com/rrnewton/haskell-lockfree-queue/blob/master/AtomicPrimops...
Which includes:
- variants of CAS for MutableArray# and MutableByteArray# - fetch-and-add for MutableByteArray#
Also, there are some tweaks to support the new "ticketed" interface for safer CAS:
http://hackage.haskell.org/packages/archive/atomic-primops/0.3/doc/html/Data...
I started adding some of these primops to GHC proper (still as out-of-line), but not all of them. I had gone with the foreign primop route instead...
https://github.com/rrnewton/ghc/commits/master
-Ryan
P.S. Where is the write barrier primop? I don't see it listed in prelude/primops.txt...
On Fri, Jul 19, 2013 at 11:41 AM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
I guess I should find the time to finish the CAS primop work I volunteered to do then. Ill look into in a few days.
On Friday, July 19, 2013, Simon Marlow wrote:
On 18/07/13 14:17, Ryan Newton wrote:
The "atomic-primops" library depends on symbols such as store_load_barrier and "cas", which are defined in SMP.h. Thus the result is that if the program is linked WITHOUT "-threaded", the user gets a linker error about undefined symbols.
The specific place it's used is in the 'foreign "C"' bits of this .cmm code:
https://github.com/rrnewton/**haskell-lockfree-queue/blob/** 87e63b21b2a6c375e93c30b98c28c1**d04f88781c/AtomicPrimops/** cbits/primops.cmmhttps://github.com/rrnewton/haskell-lockfree-queue/blob/87e63b21b2a6c375e93c...
I'm trying to explore hacks that will enable me to pull in those functions during compile time, without duplicating a whole bunch of code from the RTS. But it's a fragile business.
It seems to me that some of these routines have general utility. In future versions of GHC, could we consider linking in those routines irrespective of "-threaded"?
We should make the non-THREADED versions EXTERN_INLINE too, so that there will be (empty) functions to call in rts/Inlines.c. Want to submit a patch?
A better solution would be to make them into primops. You don't really want to be calling out to a C function to implement a memory barrier. We have this for write_barrier(), but none of the others so far. Of couse that's a larger change.
Cheers, Simon
______________________________**_________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/**mailman/listinfo/ghc-devshttp://www.haskell.org/mailman/listinfo/ghc-devs

Hi Carter, Yes, SMP.h is where I've copy pasted the duplicate functionality from (since I can't presently rely on linking the symbols). Your proposal for the LLVM backend sounds **great**. But it also is going to provide additional constraints for getting "atomic-primops" right. The goal of atomic-primops is to be a stable Haskell-level interface into the relevant CAS and fetch-and-add stuff. The reason this is important is that one has to be very careful to defeat the GHC optimizer in all the relevant places and make pointer equality a reliable property. I would like to get atomic-primops to work reliably in 7.4, 7.6 [and 7.8] and have more "native" support in future GHC releases, where maybe the foreign primops would become unecessary. (They are a pain and have already exposed one blocking cabal bug, fixed in upcoming 1.17.) A couple additional suggestions for the proposal in ticket #7883: - we should use more unique symbols than "cas", especially for this rewriting trick. How about "ghc_cas" or something? - it would be great to get at least fetch-and-add in addition to CAS and barriers - if we reliably provide this set of special symbols, libraries like atomic-primops may use them in the .cmm and benefit from the CMM->LLVM substitutions - if we include all the primops I need in GHC proper the previous bullet will stop applying ;-) Cheers, -Ryan P.S. Just as a bit of motivation, here are some recent performance numbers. We often wonder about how close our "pure values in a box" approach comes to efficient lock-free structures. Well here are some numbers about using a proper unboxed counter in the Haskell heap, vs using an IORef Int and atomicModifyIORef': Up to 100X performance difference on some platforms for microbenchmarks that hammer a counter: https://github.com/rrnewton/haskell-lockfree-queue/blob/fb12d1121690553e4f73... And here are the performance and scaling advantages of using ChaseLev (based on atomic-primops), over a traditional pure-in-a-box structure (IORef Data.Seq). The following are timings of ChaseLev/traditional respectively on a 32 core westmere: fib(42) 1 threads: 21s fib(42) 2 threads: 10.1s fib(42) 4 threads: 5.2s (100%prod) fib(42) 8 threads: 2.7s - 3.2s (100%prod) fib(42) 16 threads: 1.28s fib(42) 24 threads: 1.85s fib(42) 32 threads: 4.8s (high variance) (hive) fib(42) 1 threads: 41.8s (95% prod) (hive) fib(42) 2 threads: 25.2s (66% prod) (hive) fib(42) 4 threads: 14.6s (27% prod, 135GB alloc) (hive) fib(42) 8 threads: 17.1s (26% prod) (hive) fib(42) 16 threads: 16.3s (13% prod) (hive) fib(42) 24 threads: 21.2s (30% prod) (hive) fib(42) 32 threads: 29.3s (33% prod) And that is WITH the inefficiency of doing a "ccall" on every single atomic operation. Notes on parfib performance are here: https://github.com/rrnewton/haskell-lockfree-queue/blob/d6d3e9eda2a487a5f055... On Fri, Jul 19, 2013 at 5:05 PM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
ryan, the relevant machinery on the C side is here, see ./includes/stg/SMP.h : https://github.com/ghc/ghc/blob/7cc8a3cc5c2970009b83844ff9cc4e27913b8559/inc...
(unless i'm missing something)
On Fri, Jul 19, 2013 at 4:53 PM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
Ryan, if you look at line 270, you'll see the CAS is a C call https://github.com/ghc/ghc/blob/95e6865ecf06b2bd80fa737e4fa4a24beaae25c5/rts...
What Simon is alluding to is some work I started (but need to finish) http://ghc.haskell.org/trac/ghc/ticket/7883 is the relevant ticket, and I'll need to sort out doing the same on the native code gen too
there ARE no write barrier primops, they're baked into the CAS machinery in ghc's rts
On Fri, Jul 19, 2013 at 1:02 PM, Ryan Newton
wrote: Yes, I'd absolutely rather not suffer C call overhead for these functions (or the CAS functions). But isn't that how it's done currently for the casMutVar# primop?
https://github.com/ghc/ghc/blob/95e6865ecf06b2bd80fa737e4fa4a24beaae25c5/rts...
To avoid the overhead, is it necessary to make each primop in-line rather than out-of-line, or just to get rid of the "ccall"?
Another reason it would be good to package these with GHC is that I'm having trouble building robust libraries of foreign primops that work under all "ways" (e.g. GHCI). For example, this bug:
https://github.com/rrnewton/haskell-lockfree-queue/issues/10
If I write .cmm code that depends on RTS functionality like stg_MUT_VAR_CLEAN_info, then it seems to work fine when in compiled mode (with/without threading, profiling), but I get link errors from GHCI where these symbols aren't defined.
I've got a draft of the relevant primops here:
https://github.com/rrnewton/haskell-lockfree-queue/blob/master/AtomicPrimops...
Which includes:
- variants of CAS for MutableArray# and MutableByteArray# - fetch-and-add for MutableByteArray#
Also, there are some tweaks to support the new "ticketed" interface for safer CAS:
http://hackage.haskell.org/packages/archive/atomic-primops/0.3/doc/html/Data...
I started adding some of these primops to GHC proper (still as out-of-line), but not all of them. I had gone with the foreign primop route instead...
https://github.com/rrnewton/ghc/commits/master
-Ryan
P.S. Where is the write barrier primop? I don't see it listed in prelude/primops.txt...
On Fri, Jul 19, 2013 at 11:41 AM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
I guess I should find the time to finish the CAS primop work I volunteered to do then. Ill look into in a few days.
On Friday, July 19, 2013, Simon Marlow wrote:
On 18/07/13 14:17, Ryan Newton wrote:
The "atomic-primops" library depends on symbols such as store_load_barrier and "cas", which are defined in SMP.h. Thus the result is that if the program is linked WITHOUT "-threaded", the user gets a linker error about undefined symbols.
The specific place it's used is in the 'foreign "C"' bits of this .cmm code:
https://github.com/rrnewton/**haskell-lockfree-queue/blob/** 87e63b21b2a6c375e93c30b98c28c1**d04f88781c/AtomicPrimops/** cbits/primops.cmmhttps://github.com/rrnewton/haskell-lockfree-queue/blob/87e63b21b2a6c375e93c...
I'm trying to explore hacks that will enable me to pull in those functions during compile time, without duplicating a whole bunch of code from the RTS. But it's a fragile business.
It seems to me that some of these routines have general utility. In future versions of GHC, could we consider linking in those routines irrespective of "-threaded"?
We should make the non-THREADED versions EXTERN_INLINE too, so that there will be (empty) functions to call in rts/Inlines.c. Want to submit a patch?
A better solution would be to make them into primops. You don't really want to be calling out to a C function to implement a memory barrier. We have this for write_barrier(), but none of the others so far. Of couse that's a larger change.
Cheers, Simon
______________________________**_________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/**mailman/listinfo/ghc-devshttp://www.haskell.org/mailman/listinfo/ghc-devs

Ryan, could you explain what you want more precisely? Specifically what you
want in terms of exposed primops using the terminology / vocabulary in
http://llvm.org/docs/LangRef.html#ordering and
http://llvm.org/docs/Atomics.html ?
I'll first do the work for just the LLVM backend, and I"ll likely need
some active guidance / monitoring for the native codegen analogues
(also asked this on ticket for documentation purposes)
On Sat, Jul 20, 2013 at 2:18 AM, Ryan Newton
Hi Carter,
Yes, SMP.h is where I've copy pasted the duplicate functionality from (since I can't presently rely on linking the symbols).
Your proposal for the LLVM backend sounds **great**. But it also is going to provide additional constraints for getting "atomic-primops" right.
The goal of atomic-primops is to be a stable Haskell-level interface into the relevant CAS and fetch-and-add stuff. The reason this is important is that one has to be very careful to defeat the GHC optimizer in all the relevant places and make pointer equality a reliable property. I would like to get atomic-primops to work reliably in 7.4, 7.6 [and 7.8] and have more "native" support in future GHC releases, where maybe the foreign primops would become unecessary. (They are a pain and have already exposed one blocking cabal bug, fixed in upcoming 1.17.)
A couple additional suggestions for the proposal in ticket #7883:
- we should use more unique symbols than "cas", especially for this rewriting trick. How about "ghc_cas" or something? - it would be great to get at least fetch-and-add in addition to CAS and barriers - if we reliably provide this set of special symbols, libraries like atomic-primops may use them in the .cmm and benefit from the CMM->LLVM substitutions - if we include all the primops I need in GHC proper the previous bullet will stop applying ;-)
Cheers, -Ryan
P.S. Just as a bit of motivation, here are some recent performance numbers. We often wonder about how close our "pure values in a box" approach comes to efficient lock-free structures. Well here are some numbers about using a proper unboxed counter in the Haskell heap, vs using an IORef Int and atomicModifyIORef': Up to 100X performance difference on some platforms for microbenchmarks that hammer a counter:
https://github.com/rrnewton/haskell-lockfree-queue/blob/fb12d1121690553e4f73...
And here are the performance and scaling advantages of using ChaseLev (based on atomic-primops), over a traditional pure-in-a-box structure (IORef Data.Seq). The following are timings of ChaseLev/traditional respectively on a 32 core westmere:
fib(42) 1 threads: 21s fib(42) 2 threads: 10.1s fib(42) 4 threads: 5.2s (100%prod) fib(42) 8 threads: 2.7s - 3.2s (100%prod) fib(42) 16 threads: 1.28s fib(42) 24 threads: 1.85s fib(42) 32 threads: 4.8s (high variance)
(hive) fib(42) 1 threads: 41.8s (95% prod) (hive) fib(42) 2 threads: 25.2s (66% prod) (hive) fib(42) 4 threads: 14.6s (27% prod, 135GB alloc) (hive) fib(42) 8 threads: 17.1s (26% prod) (hive) fib(42) 16 threads: 16.3s (13% prod) (hive) fib(42) 24 threads: 21.2s (30% prod) (hive) fib(42) 32 threads: 29.3s (33% prod)
And that is WITH the inefficiency of doing a "ccall" on every single atomic operation.
Notes on parfib performance are here:
https://github.com/rrnewton/haskell-lockfree-queue/blob/d6d3e9eda2a487a5f055...
On Fri, Jul 19, 2013 at 5:05 PM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
ryan, the relevant machinery on the C side is here, see ./includes/stg/SMP.h : https://github.com/ghc/ghc/blob/7cc8a3cc5c2970009b83844ff9cc4e27913b8559/inc...
(unless i'm missing something)
On Fri, Jul 19, 2013 at 4:53 PM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
Ryan, if you look at line 270, you'll see the CAS is a C call https://github.com/ghc/ghc/blob/95e6865ecf06b2bd80fa737e4fa4a24beaae25c5/rts...
What Simon is alluding to is some work I started (but need to finish) http://ghc.haskell.org/trac/ghc/ticket/7883 is the relevant ticket, and I'll need to sort out doing the same on the native code gen too
there ARE no write barrier primops, they're baked into the CAS machinery in ghc's rts
On Fri, Jul 19, 2013 at 1:02 PM, Ryan Newton
wrote: Yes, I'd absolutely rather not suffer C call overhead for these functions (or the CAS functions). But isn't that how it's done currently for the casMutVar# primop?
https://github.com/ghc/ghc/blob/95e6865ecf06b2bd80fa737e4fa4a24beaae25c5/rts...
To avoid the overhead, is it necessary to make each primop in-line rather than out-of-line, or just to get rid of the "ccall"?
Another reason it would be good to package these with GHC is that I'm having trouble building robust libraries of foreign primops that work under all "ways" (e.g. GHCI). For example, this bug:
https://github.com/rrnewton/haskell-lockfree-queue/issues/10
If I write .cmm code that depends on RTS functionality like stg_MUT_VAR_CLEAN_info, then it seems to work fine when in compiled mode (with/without threading, profiling), but I get link errors from GHCI where these symbols aren't defined.
I've got a draft of the relevant primops here:
https://github.com/rrnewton/haskell-lockfree-queue/blob/master/AtomicPrimops...
Which includes:
- variants of CAS for MutableArray# and MutableByteArray# - fetch-and-add for MutableByteArray#
Also, there are some tweaks to support the new "ticketed" interface for safer CAS:
http://hackage.haskell.org/packages/archive/atomic-primops/0.3/doc/html/Data...
I started adding some of these primops to GHC proper (still as out-of-line), but not all of them. I had gone with the foreign primop route instead...
https://github.com/rrnewton/ghc/commits/master
-Ryan
P.S. Where is the write barrier primop? I don't see it listed in prelude/primops.txt...
On Fri, Jul 19, 2013 at 11:41 AM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
I guess I should find the time to finish the CAS primop work I volunteered to do then. Ill look into in a few days.
On Friday, July 19, 2013, Simon Marlow wrote:
On 18/07/13 14:17, Ryan Newton wrote:
> The "atomic-primops" library depends on symbols such as > store_load_barrier and "cas", which are defined in SMP.h. Thus the > result is that if the program is linked WITHOUT "-threaded", the user > gets a linker error about undefined symbols. > > The specific place it's used is in the 'foreign "C"' bits of this > .cmm code: > > https://github.com/rrnewton/**haskell-lockfree-queue/blob/** > 87e63b21b2a6c375e93c30b98c28c1**d04f88781c/AtomicPrimops/** > cbits/primops.cmmhttps://github.com/rrnewton/haskell-lockfree-queue/blob/87e63b21b2a6c375e93c... > > I'm trying to explore hacks that will enable me to pull in those > functions during compile time, without duplicating a whole bunch of > code > from the RTS. But it's a fragile business. > > It seems to me that some of these routines have general utility. In > future versions of GHC, could we consider linking in those routines > irrespective of "-threaded"? >
We should make the non-THREADED versions EXTERN_INLINE too, so that there will be (empty) functions to call in rts/Inlines.c. Want to submit a patch?
A better solution would be to make them into primops. You don't really want to be calling out to a C function to implement a memory barrier. We have this for write_barrier(), but none of the others so far. Of couse that's a larger change.
Cheers, Simon
______________________________**_________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/**mailman/listinfo/ghc-devshttp://www.haskell.org/mailman/listinfo/ghc-devs

Sorry, "rewrite" was too overloaded a term to use here. I was just referring to the proposal to "substitute the cas funcall with the right llvm operation". That is, the approach would pattern match for the CMM code "ccall cas" or "foreign "C" cas" (I'm afraid I don't know the difference between those) and replace it with the equivalent LLVM op, right? I think the assumption there is that the native codegen would still have to suffer the funcall overhead and use the C versions. I don't know exactly what the changes would look like to make barriers/CAS all proper inline primops, because it would have to reproduce in the code generator all the platform-specific #ifdef'd C code that is currently in SMP.h. Which I guess is doable, but probably only for someone who knows the native GHC codegen properly... On Sat, Jul 20, 2013 at 2:30 AM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
Ryan, could you explain what you want more precisely? Specifically what you want in terms of exposed primops using the terminology / vocabulary in http://llvm.org/docs/LangRef.html#ordering and http://llvm.org/docs/Atomics.html ?
I'll first do the work for just the LLVM backend, and I"ll likely need some active guidance / monitoring for the native codegen analogues
(also asked this on ticket for documentation purposes)
On Sat, Jul 20, 2013 at 2:18 AM, Ryan Newton
wrote: Hi Carter,
Yes, SMP.h is where I've copy pasted the duplicate functionality from (since I can't presently rely on linking the symbols).
Your proposal for the LLVM backend sounds **great**. But it also is going to provide additional constraints for getting "atomic-primops" right.
The goal of atomic-primops is to be a stable Haskell-level interface into the relevant CAS and fetch-and-add stuff. The reason this is important is that one has to be very careful to defeat the GHC optimizer in all the relevant places and make pointer equality a reliable property. I would like to get atomic-primops to work reliably in 7.4, 7.6 [and 7.8] and have more "native" support in future GHC releases, where maybe the foreign primops would become unecessary. (They are a pain and have already exposed one blocking cabal bug, fixed in upcoming 1.17.)
A couple additional suggestions for the proposal in ticket #7883:
- we should use more unique symbols than "cas", especially for this rewriting trick. How about "ghc_cas" or something? - it would be great to get at least fetch-and-add in addition to CAS and barriers - if we reliably provide this set of special symbols, libraries like atomic-primops may use them in the .cmm and benefit from the CMM->LLVM substitutions - if we include all the primops I need in GHC proper the previous bullet will stop applying ;-)
Cheers, -Ryan
P.S. Just as a bit of motivation, here are some recent performance numbers. We often wonder about how close our "pure values in a box" approach comes to efficient lock-free structures. Well here are some numbers about using a proper unboxed counter in the Haskell heap, vs using an IORef Int and atomicModifyIORef': Up to 100X performance difference on some platforms for microbenchmarks that hammer a counter:
https://github.com/rrnewton/haskell-lockfree-queue/blob/fb12d1121690553e4f73...
And here are the performance and scaling advantages of using ChaseLev (based on atomic-primops), over a traditional pure-in-a-box structure (IORef Data.Seq). The following are timings of ChaseLev/traditional respectively on a 32 core westmere:
fib(42) 1 threads: 21s fib(42) 2 threads: 10.1s fib(42) 4 threads: 5.2s (100%prod) fib(42) 8 threads: 2.7s - 3.2s (100%prod) fib(42) 16 threads: 1.28s fib(42) 24 threads: 1.85s fib(42) 32 threads: 4.8s (high variance)
(hive) fib(42) 1 threads: 41.8s (95% prod) (hive) fib(42) 2 threads: 25.2s (66% prod) (hive) fib(42) 4 threads: 14.6s (27% prod, 135GB alloc) (hive) fib(42) 8 threads: 17.1s (26% prod) (hive) fib(42) 16 threads: 16.3s (13% prod) (hive) fib(42) 24 threads: 21.2s (30% prod) (hive) fib(42) 32 threads: 29.3s (33% prod)
And that is WITH the inefficiency of doing a "ccall" on every single atomic operation.
Notes on parfib performance are here:
https://github.com/rrnewton/haskell-lockfree-queue/blob/d6d3e9eda2a487a5f055...
On Fri, Jul 19, 2013 at 5:05 PM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
ryan, the relevant machinery on the C side is here, see ./includes/stg/SMP.h : https://github.com/ghc/ghc/blob/7cc8a3cc5c2970009b83844ff9cc4e27913b8559/inc...
(unless i'm missing something)
On Fri, Jul 19, 2013 at 4:53 PM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
Ryan, if you look at line 270, you'll see the CAS is a C call https://github.com/ghc/ghc/blob/95e6865ecf06b2bd80fa737e4fa4a24beaae25c5/rts...
What Simon is alluding to is some work I started (but need to finish) http://ghc.haskell.org/trac/ghc/ticket/7883 is the relevant ticket, and I'll need to sort out doing the same on the native code gen too
there ARE no write barrier primops, they're baked into the CAS machinery in ghc's rts
On Fri, Jul 19, 2013 at 1:02 PM, Ryan Newton
wrote: Yes, I'd absolutely rather not suffer C call overhead for these functions (or the CAS functions). But isn't that how it's done currently for the casMutVar# primop?
https://github.com/ghc/ghc/blob/95e6865ecf06b2bd80fa737e4fa4a24beaae25c5/rts...
To avoid the overhead, is it necessary to make each primop in-line rather than out-of-line, or just to get rid of the "ccall"?
Another reason it would be good to package these with GHC is that I'm having trouble building robust libraries of foreign primops that work under all "ways" (e.g. GHCI). For example, this bug:
https://github.com/rrnewton/haskell-lockfree-queue/issues/10
If I write .cmm code that depends on RTS functionality like stg_MUT_VAR_CLEAN_info, then it seems to work fine when in compiled mode (with/without threading, profiling), but I get link errors from GHCI where these symbols aren't defined.
I've got a draft of the relevant primops here:
https://github.com/rrnewton/haskell-lockfree-queue/blob/master/AtomicPrimops...
Which includes:
- variants of CAS for MutableArray# and MutableByteArray# - fetch-and-add for MutableByteArray#
Also, there are some tweaks to support the new "ticketed" interface for safer CAS:
http://hackage.haskell.org/packages/archive/atomic-primops/0.3/doc/html/Data...
I started adding some of these primops to GHC proper (still as out-of-line), but not all of them. I had gone with the foreign primop route instead...
https://github.com/rrnewton/ghc/commits/master
-Ryan
P.S. Where is the write barrier primop? I don't see it listed in prelude/primops.txt...
On Fri, Jul 19, 2013 at 11:41 AM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
I guess I should find the time to finish the CAS primop work I volunteered to do then. Ill look into in a few days.
On Friday, July 19, 2013, Simon Marlow wrote:
> On 18/07/13 14:17, Ryan Newton wrote: > >> The "atomic-primops" library depends on symbols such as >> store_load_barrier and "cas", which are defined in SMP.h. Thus the >> result is that if the program is linked WITHOUT "-threaded", the >> user >> gets a linker error about undefined symbols. >> >> The specific place it's used is in the 'foreign "C"' bits of this >> .cmm code: >> >> https://github.com/rrnewton/**haskell-lockfree-queue/blob/** >> 87e63b21b2a6c375e93c30b98c28c1**d04f88781c/AtomicPrimops/** >> cbits/primops.cmmhttps://github.com/rrnewton/haskell-lockfree-queue/blob/87e63b21b2a6c375e93c... >> >> I'm trying to explore hacks that will enable me to pull in those >> functions during compile time, without duplicating a whole bunch of >> code >> from the RTS. But it's a fragile business. >> >> It seems to me that some of these routines have general utility. In >> future versions of GHC, could we consider linking in those routines >> irrespective of "-threaded"? >> > > We should make the non-THREADED versions EXTERN_INLINE too, so that > there will be (empty) functions to call in rts/Inlines.c. Want to submit a > patch? > > A better solution would be to make them into primops. You don't > really want to be calling out to a C function to implement a memory > barrier. We have this for write_barrier(), but none of the others so far. > Of couse that's a larger change. > > Cheers, > Simon > > > > ______________________________**_________________ > ghc-devs mailing list > ghc-devs@haskell.org > http://www.haskell.org/**mailman/listinfo/ghc-devshttp://www.haskell.org/mailman/listinfo/ghc-devs >

Ryan, you misunderstand (or maybe i'm not understanding quite). It is 330
am after all! (I might be better at explaining tomorrow afternoon)
the idea is to provide CMM/haskell level primops, not to "pattern match on
the ccall". I leave the updating of any cmm code to use such intrinsics as
distinct task to be done subsequently :)
If you look at the example patches for pop count that David Terei
referred me to,
https://github.com/ghc/ghc/commit/2d0438f329ac153f9e59155f405d27fac0c43d65(f...
the native code gen) and
https://github.com/ghc/ghc/commit/2906db6c3a3f1000bd7347c7d8e45e65eb2806cbfo...
the llvm code gen, the pattern is pretty clear, adding new "first
class" primiops
Point being, dont' worry about that right now, (its 3am after all)
What I want from you is a clear description of the CMM / Haskell level
PrimOps you want for making your life easier in supporting great
parallelism in GHC, in terms of those LLVM operations and their semantics
that I've referred you to.
what the final names of these will be can be bike shedded some other time,
doesn't matter currently. For now, please read my ticket and the llvm links
when you have the bandwidth, and layout what you'd want primop wise!
thanks
-Carter
On Sat, Jul 20, 2013 at 2:47 AM, Ryan Newton
Sorry, "rewrite" was too overloaded a term to use here. I was just referring to the proposal to "substitute the cas funcall with the right llvm operation".
That is, the approach would pattern match for the CMM code "ccall cas" or "foreign "C" cas" (I'm afraid I don't know the difference between those) and replace it with the equivalent LLVM op, right?
I think the assumption there is that the native codegen would still have to suffer the funcall overhead and use the C versions. I don't know exactly what the changes would look like to make barriers/CAS all proper inline primops, because it would have to reproduce in the code generator all the platform-specific #ifdef'd C code that is currently in SMP.h. Which I guess is doable, but probably only for someone who knows the native GHC codegen properly...
On Sat, Jul 20, 2013 at 2:30 AM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
Ryan, could you explain what you want more precisely? Specifically what you want in terms of exposed primops using the terminology / vocabulary in http://llvm.org/docs/LangRef.html#ordering and http://llvm.org/docs/Atomics.html ?
I'll first do the work for just the LLVM backend, and I"ll likely need some active guidance / monitoring for the native codegen analogues
(also asked this on ticket for documentation purposes)
On Sat, Jul 20, 2013 at 2:18 AM, Ryan Newton
wrote: Hi Carter,
Yes, SMP.h is where I've copy pasted the duplicate functionality from (since I can't presently rely on linking the symbols).
Your proposal for the LLVM backend sounds **great**. But it also is going to provide additional constraints for getting "atomic-primops" right.
The goal of atomic-primops is to be a stable Haskell-level interface into the relevant CAS and fetch-and-add stuff. The reason this is important is that one has to be very careful to defeat the GHC optimizer in all the relevant places and make pointer equality a reliable property. I would like to get atomic-primops to work reliably in 7.4, 7.6 [and 7.8] and have more "native" support in future GHC releases, where maybe the foreign primops would become unecessary. (They are a pain and have already exposed one blocking cabal bug, fixed in upcoming 1.17.)
A couple additional suggestions for the proposal in ticket #7883:
- we should use more unique symbols than "cas", especially for this rewriting trick. How about "ghc_cas" or something? - it would be great to get at least fetch-and-add in addition to CAS and barriers - if we reliably provide this set of special symbols, libraries like atomic-primops may use them in the .cmm and benefit from the CMM->LLVM substitutions - if we include all the primops I need in GHC proper the previous bullet will stop applying ;-)
Cheers, -Ryan
P.S. Just as a bit of motivation, here are some recent performance numbers. We often wonder about how close our "pure values in a box" approach comes to efficient lock-free structures. Well here are some numbers about using a proper unboxed counter in the Haskell heap, vs using an IORef Int and atomicModifyIORef': Up to 100X performance difference on some platforms for microbenchmarks that hammer a counter:
https://github.com/rrnewton/haskell-lockfree-queue/blob/fb12d1121690553e4f73...
And here are the performance and scaling advantages of using ChaseLev (based on atomic-primops), over a traditional pure-in-a-box structure (IORef Data.Seq). The following are timings of ChaseLev/traditional respectively on a 32 core westmere:
fib(42) 1 threads: 21s fib(42) 2 threads: 10.1s fib(42) 4 threads: 5.2s (100%prod) fib(42) 8 threads: 2.7s - 3.2s (100%prod) fib(42) 16 threads: 1.28s fib(42) 24 threads: 1.85s fib(42) 32 threads: 4.8s (high variance)
(hive) fib(42) 1 threads: 41.8s (95% prod) (hive) fib(42) 2 threads: 25.2s (66% prod) (hive) fib(42) 4 threads: 14.6s (27% prod, 135GB alloc) (hive) fib(42) 8 threads: 17.1s (26% prod) (hive) fib(42) 16 threads: 16.3s (13% prod) (hive) fib(42) 24 threads: 21.2s (30% prod) (hive) fib(42) 32 threads: 29.3s (33% prod)
And that is WITH the inefficiency of doing a "ccall" on every single atomic operation.
Notes on parfib performance are here:
https://github.com/rrnewton/haskell-lockfree-queue/blob/d6d3e9eda2a487a5f055...
On Fri, Jul 19, 2013 at 5:05 PM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
ryan, the relevant machinery on the C side is here, see ./includes/stg/SMP.h : https://github.com/ghc/ghc/blob/7cc8a3cc5c2970009b83844ff9cc4e27913b8559/inc...
(unless i'm missing something)
On Fri, Jul 19, 2013 at 4:53 PM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
Ryan, if you look at line 270, you'll see the CAS is a C call https://github.com/ghc/ghc/blob/95e6865ecf06b2bd80fa737e4fa4a24beaae25c5/rts...
What Simon is alluding to is some work I started (but need to finish) http://ghc.haskell.org/trac/ghc/ticket/7883 is the relevant ticket, and I'll need to sort out doing the same on the native code gen too
there ARE no write barrier primops, they're baked into the CAS machinery in ghc's rts
On Fri, Jul 19, 2013 at 1:02 PM, Ryan Newton
wrote: Yes, I'd absolutely rather not suffer C call overhead for these functions (or the CAS functions). But isn't that how it's done currently for the casMutVar# primop?
https://github.com/ghc/ghc/blob/95e6865ecf06b2bd80fa737e4fa4a24beaae25c5/rts...
To avoid the overhead, is it necessary to make each primop in-line rather than out-of-line, or just to get rid of the "ccall"?
Another reason it would be good to package these with GHC is that I'm having trouble building robust libraries of foreign primops that work under all "ways" (e.g. GHCI). For example, this bug:
https://github.com/rrnewton/haskell-lockfree-queue/issues/10
If I write .cmm code that depends on RTS functionality like stg_MUT_VAR_CLEAN_info, then it seems to work fine when in compiled mode (with/without threading, profiling), but I get link errors from GHCI where these symbols aren't defined.
I've got a draft of the relevant primops here:
https://github.com/rrnewton/haskell-lockfree-queue/blob/master/AtomicPrimops...
Which includes:
- variants of CAS for MutableArray# and MutableByteArray# - fetch-and-add for MutableByteArray#
Also, there are some tweaks to support the new "ticketed" interface for safer CAS:
http://hackage.haskell.org/packages/archive/atomic-primops/0.3/doc/html/Data...
I started adding some of these primops to GHC proper (still as out-of-line), but not all of them. I had gone with the foreign primop route instead...
https://github.com/rrnewton/ghc/commits/master
-Ryan
P.S. Where is the write barrier primop? I don't see it listed in prelude/primops.txt...
On Fri, Jul 19, 2013 at 11:41 AM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
> I guess I should find the time to finish the CAS primop work I > volunteered to do then. Ill look into in a few days. > > > On Friday, July 19, 2013, Simon Marlow wrote: > >> On 18/07/13 14:17, Ryan Newton wrote: >> >>> The "atomic-primops" library depends on symbols such as >>> store_load_barrier and "cas", which are defined in SMP.h. Thus the >>> result is that if the program is linked WITHOUT "-threaded", the >>> user >>> gets a linker error about undefined symbols. >>> >>> The specific place it's used is in the 'foreign "C"' bits of this >>> .cmm code: >>> >>> https://github.com/rrnewton/**haskell-lockfree-queue/blob/** >>> 87e63b21b2a6c375e93c30b98c28c1**d04f88781c/AtomicPrimops/** >>> cbits/primops.cmmhttps://github.com/rrnewton/haskell-lockfree-queue/blob/87e63b21b2a6c375e93c... >>> >>> I'm trying to explore hacks that will enable me to pull in those >>> functions during compile time, without duplicating a whole bunch >>> of code >>> from the RTS. But it's a fragile business. >>> >>> It seems to me that some of these routines have general utility. >>> In >>> future versions of GHC, could we consider linking in those routines >>> irrespective of "-threaded"? >>> >> >> We should make the non-THREADED versions EXTERN_INLINE too, so that >> there will be (empty) functions to call in rts/Inlines.c. Want to submit a >> patch? >> >> A better solution would be to make them into primops. You don't >> really want to be calling out to a C function to implement a memory >> barrier. We have this for write_barrier(), but none of the others so far. >> Of couse that's a larger change. >> >> Cheers, >> Simon >> >> >> >> ______________________________**_________________ >> ghc-devs mailing list >> ghc-devs@haskell.org >> http://www.haskell.org/**mailman/listinfo/ghc-devshttp://www.haskell.org/mailman/listinfo/ghc-devs >> >

Ah, I see. There are several ways this could be done. With the "substitute the cas funcall" line I thought you were going for an intermediate solution that would help the LLVM backend but not the native codegen. I was thinking you would leave the out-of-line primop definition for, e.g., casMutVar#, but fix the ccall to "cas" within that primop, so that you don't need a C function call sequence. But it sounds like you are going whole hog and going right for inline primops! Great. Actually, there are some places where I am ignorant of what optimizations the backend(s) can do (and I haven't been able to learn the answer from the commentary yet http://ghc.haskell.org/trac/ghc/wiki/Commentary/PrimOps). For example, I assume calls to C are never inlinable, but *are "out of line" primops inlinable*? You alluded to the double call over head -- first for out-of-line casMutVar# and then to the C function "cas". Does that mean "no" they are not inlinable? (There is one sentence in the commentary that makes it sound like "no": *This also changes to code generator to push the continuation of any follow on code onto the stack.*) One thing that I now understand looking at Tibbe's patches, is that going to inline primops does NOT mean forgoing FFI calls necessarily. That patch still uses emitForeignCall within emitPopCntCall. Is that what you were planning to do for the atomic primops? The alternative, which seemed laborious, is to take code like this: * cas(StgVolatilePtr p, StgWord o, StgWord n)* * {* * #if i386_HOST_ARCH || x86_64_HOST_ARCH* * __asm__ __volatile__ (* * "lock\ncmpxchg %3,%1"* * :"=a"(o), "=m" (*(volatile unsigned int *)p) * * :"0" (o), "r" (n));* * return o;* * #elif powerpc_HOST_ARCH * * ....* and embed its logic within the codegen for the inline primops. ----------------------------------------------------------------------------- Anyway, to answer your question about which primops I'd like to see: - CAS on MutVars, MutableArray#, and MutableByteArray# - fetch and add on MutableByteArray# - barriers / memory fences - Drafts of .cmm for these can be found herehttps://github.com/rrnewton/haskell-lockfree-queue/blob/master/AtomicPrimops.... Note that *only* casMutVar# is currently shipped with GHC. These are the ones I'm using currently. But there's no reason that we shouldn't aim for a fairly "complete set". For example, why not have fetch-and-sub and the other "atomicrmw" variants? Relating these to the LLVM atomics and memory orderings, they become: - CAS variants = LLVM cmpxchg with SequentiallyConsistent ordering - fetch-and-X variants = LLVM atomicrmw with SequentiallyConsistent - store_load_barrier = LLVM fenceInst with SequentiallyConsistent - write_barrier and load_load_barrier = I *think* these are both covered by a FenceInst with AcquireRelease ordering... Someone else double checking these would be good, since I'm not yet familiar with LLVM and am just going off the documentation you linked. Btw, I'm not sure why SMP.h uses "lock; addl $0,0(%%esp)" instead of the mfence instruction for store_load_barrier on x86, but I believe they should be the same. -Ryan [1] I note that the LLVM documentation says "store-store fences are generally not exposed to IR because they are extremely difficult to use correctly." On Sat, Jul 20, 2013 at 3:19 AM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
Ryan, you misunderstand (or maybe i'm not understanding quite). It is 330 am after all! (I might be better at explaining tomorrow afternoon)
the idea is to provide CMM/haskell level primops, not to "pattern match on the ccall". I leave the updating of any cmm code to use such intrinsics as distinct task to be done subsequently :)
If you look at the example patches for pop count that David Terei referred me to, https://github.com/ghc/ghc/commit/2d0438f329ac153f9e59155f405d27fac0c43d65(f... the native code gen) and https://github.com/ghc/ghc/commit/2906db6c3a3f1000bd7347c7d8e45e65eb2806cbfo... the llvm code gen, the pattern is pretty clear, adding new "first class" primiops
Point being, dont' worry about that right now, (its 3am after all)
What I want from you is a clear description of the CMM / Haskell level PrimOps you want for making your life easier in supporting great parallelism in GHC, in terms of those LLVM operations and their semantics that I've referred you to.
what the final names of these will be can be bike shedded some other time, doesn't matter currently. For now, please read my ticket and the llvm links when you have the bandwidth, and layout what you'd want primop wise!
thanks -Carter
On Sat, Jul 20, 2013 at 2:47 AM, Ryan Newton
wrote: Sorry, "rewrite" was too overloaded a term to use here. I was just referring to the proposal to "substitute the cas funcall with the right llvm operation".
That is, the approach would pattern match for the CMM code "ccall cas" or "foreign "C" cas" (I'm afraid I don't know the difference between those) and replace it with the equivalent LLVM op, right?
I think the assumption there is that the native codegen would still have to suffer the funcall overhead and use the C versions. I don't know exactly what the changes would look like to make barriers/CAS all proper inline primops, because it would have to reproduce in the code generator all the platform-specific #ifdef'd C code that is currently in SMP.h. Which I guess is doable, but probably only for someone who knows the native GHC codegen properly...
On Sat, Jul 20, 2013 at 2:30 AM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
Ryan, could you explain what you want more precisely? Specifically what you want in terms of exposed primops using the terminology / vocabulary in http://llvm.org/docs/LangRef.html#ordering and http://llvm.org/docs/Atomics.html ?
I'll first do the work for just the LLVM backend, and I"ll likely need some active guidance / monitoring for the native codegen analogues
(also asked this on ticket for documentation purposes)
On Sat, Jul 20, 2013 at 2:18 AM, Ryan Newton
wrote: Hi Carter,
Yes, SMP.h is where I've copy pasted the duplicate functionality from (since I can't presently rely on linking the symbols).
Your proposal for the LLVM backend sounds **great**. But it also is going to provide additional constraints for getting "atomic-primops" right.
The goal of atomic-primops is to be a stable Haskell-level interface into the relevant CAS and fetch-and-add stuff. The reason this is important is that one has to be very careful to defeat the GHC optimizer in all the relevant places and make pointer equality a reliable property. I would like to get atomic-primops to work reliably in 7.4, 7.6 [and 7.8] and have more "native" support in future GHC releases, where maybe the foreign primops would become unecessary. (They are a pain and have already exposed one blocking cabal bug, fixed in upcoming 1.17.)
A couple additional suggestions for the proposal in ticket #7883:
- we should use more unique symbols than "cas", especially for this rewriting trick. How about "ghc_cas" or something? - it would be great to get at least fetch-and-add in addition to CAS and barriers - if we reliably provide this set of special symbols, libraries like atomic-primops may use them in the .cmm and benefit from the CMM->LLVM substitutions - if we include all the primops I need in GHC proper the previous bullet will stop applying ;-)
Cheers, -Ryan
P.S. Just as a bit of motivation, here are some recent performance numbers. We often wonder about how close our "pure values in a box" approach comes to efficient lock-free structures. Well here are some numbers about using a proper unboxed counter in the Haskell heap, vs using an IORef Int and atomicModifyIORef': Up to 100X performance difference on some platforms for microbenchmarks that hammer a counter:
https://github.com/rrnewton/haskell-lockfree-queue/blob/fb12d1121690553e4f73...
And here are the performance and scaling advantages of using ChaseLev (based on atomic-primops), over a traditional pure-in-a-box structure (IORef Data.Seq). The following are timings of ChaseLev/traditional respectively on a 32 core westmere:
fib(42) 1 threads: 21s fib(42) 2 threads: 10.1s fib(42) 4 threads: 5.2s (100%prod) fib(42) 8 threads: 2.7s - 3.2s (100%prod) fib(42) 16 threads: 1.28s fib(42) 24 threads: 1.85s fib(42) 32 threads: 4.8s (high variance)
(hive) fib(42) 1 threads: 41.8s (95% prod) (hive) fib(42) 2 threads: 25.2s (66% prod) (hive) fib(42) 4 threads: 14.6s (27% prod, 135GB alloc) (hive) fib(42) 8 threads: 17.1s (26% prod) (hive) fib(42) 16 threads: 16.3s (13% prod) (hive) fib(42) 24 threads: 21.2s (30% prod) (hive) fib(42) 32 threads: 29.3s (33% prod)
And that is WITH the inefficiency of doing a "ccall" on every single atomic operation.
Notes on parfib performance are here:
https://github.com/rrnewton/haskell-lockfree-queue/blob/d6d3e9eda2a487a5f055...
On Fri, Jul 19, 2013 at 5:05 PM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
ryan, the relevant machinery on the C side is here, see ./includes/stg/SMP.h : https://github.com/ghc/ghc/blob/7cc8a3cc5c2970009b83844ff9cc4e27913b8559/inc...
(unless i'm missing something)
On Fri, Jul 19, 2013 at 4:53 PM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
Ryan, if you look at line 270, you'll see the CAS is a C call https://github.com/ghc/ghc/blob/95e6865ecf06b2bd80fa737e4fa4a24beaae25c5/rts...
What Simon is alluding to is some work I started (but need to finish) http://ghc.haskell.org/trac/ghc/ticket/7883 is the relevant ticket, and I'll need to sort out doing the same on the native code gen too
there ARE no write barrier primops, they're baked into the CAS machinery in ghc's rts
On Fri, Jul 19, 2013 at 1:02 PM, Ryan Newton
wrote: > Yes, I'd absolutely rather not suffer C call overhead for these > functions (or the CAS functions). But isn't that how it's done currently > for the casMutVar# primop? > > > https://github.com/ghc/ghc/blob/95e6865ecf06b2bd80fa737e4fa4a24beaae25c5/rts... > > To avoid the overhead, is it necessary to make each primop in-line > rather than out-of-line, or just to get rid of the "ccall"? > > Another reason it would be good to package these with GHC is that > I'm having trouble building robust libraries of foreign primops that work > under all "ways" (e.g. GHCI). For example, this bug: > > https://github.com/rrnewton/haskell-lockfree-queue/issues/10 > > If I write .cmm code that depends on RTS functionality like > stg_MUT_VAR_CLEAN_info, then it seems to work fine when in compiled mode > (with/without threading, profiling), but I get link errors from GHCI where > these symbols aren't defined. > > I've got a draft of the relevant primops here: > > > https://github.com/rrnewton/haskell-lockfree-queue/blob/master/AtomicPrimops... > > Which includes: > > - variants of CAS for MutableArray# and MutableByteArray# > - fetch-and-add for MutableByteArray# > > Also, there are some tweaks to support the new "ticketed" interface > for safer CAS: > > > http://hackage.haskell.org/packages/archive/atomic-primops/0.3/doc/html/Data... > > I started adding some of these primops to GHC proper (still as > out-of-line), but not all of them. I had gone with the foreign primop > route instead... > > https://github.com/rrnewton/ghc/commits/master > > -Ryan > > P.S. Where is the write barrier primop? I don't see it listed in > prelude/primops.txt... > > > > > > On Fri, Jul 19, 2013 at 11:41 AM, Carter Schonwald < > carter.schonwald@gmail.com> wrote: > >> I guess I should find the time to finish the CAS primop work I >> volunteered to do then. Ill look into in a few days. >> >> >> On Friday, July 19, 2013, Simon Marlow wrote: >> >>> On 18/07/13 14:17, Ryan Newton wrote: >>> >>>> The "atomic-primops" library depends on symbols such as >>>> store_load_barrier and "cas", which are defined in SMP.h. Thus >>>> the >>>> result is that if the program is linked WITHOUT "-threaded", the >>>> user >>>> gets a linker error about undefined symbols. >>>> >>>> The specific place it's used is in the 'foreign "C"' bits of this >>>> .cmm code: >>>> >>>> https://github.com/rrnewton/**haskell-lockfree-queue/blob/** >>>> 87e63b21b2a6c375e93c30b98c28c1**d04f88781c/AtomicPrimops/** >>>> cbits/primops.cmmhttps://github.com/rrnewton/haskell-lockfree-queue/blob/87e63b21b2a6c375e93c... >>>> >>>> I'm trying to explore hacks that will enable me to pull in those >>>> functions during compile time, without duplicating a whole bunch >>>> of code >>>> from the RTS. But it's a fragile business. >>>> >>>> It seems to me that some of these routines have general utility. >>>> In >>>> future versions of GHC, could we consider linking in those >>>> routines >>>> irrespective of "-threaded"? >>>> >>> >>> We should make the non-THREADED versions EXTERN_INLINE too, so >>> that there will be (empty) functions to call in rts/Inlines.c. Want to >>> submit a patch? >>> >>> A better solution would be to make them into primops. You don't >>> really want to be calling out to a C function to implement a memory >>> barrier. We have this for write_barrier(), but none of the others so far. >>> Of couse that's a larger change. >>> >>> Cheers, >>> Simon >>> >>> >>> >>> ______________________________**_________________ >>> ghc-devs mailing list >>> ghc-devs@haskell.org >>> http://www.haskell.org/**mailman/listinfo/ghc-devshttp://www.haskell.org/mailman/listinfo/ghc-devs >>> >> >

ok, could you add those comments (about additional operations to consider)
to the ticket?
relatedly: if we want these atomic ops to use the sequential analogues when
we're not using the threaded run time system, does that mean
we need to have a symbol / constant variable exposed in the RTS we link in,
so that the inline code branches on a linktime constant value / symbol
(something like "isThreadedRTS:: Bool", ) or some sort of analogue
thereof?
one nice thing about doing such, is that if at some point link time
optimization is added, the branch would go away! On the other hand, it
could be argued that the cost of the call to the CAS primops in their
current form isn't that much more expensive than such a branch.
I should add that question to the ticket, but its worth hashing out first.
thoughts? I'm probably overlooking some parts of this too
-Carter
On Sat, Jul 20, 2013 at 1:49 PM, Ryan Newton
Ah, I see. There are several ways this could be done. With the "substitute the cas funcall" line I thought you were going for an intermediate solution that would help the LLVM backend but not the native codegen. I was thinking you would leave the out-of-line primop definition for, e.g., casMutVar#, but fix the ccall to "cas" within that primop, so that you don't need a C function call sequence. But it sounds like you are going whole hog and going right for inline primops! Great.
Actually, there are some places where I am ignorant of what optimizations the backend(s) can do (and I haven't been able to learn the answer from the commentary yethttp://ghc.haskell.org/trac/ghc/wiki/Commentary/PrimOps). For example, I assume calls to C are never inlinable, but *are "out of line" primops inlinable*? You alluded to the double call over head -- first for out-of-line casMutVar# and then to the C function "cas". Does that mean "no" they are not inlinable? (There is one sentence in the commentary that makes it sound like "no": *This also changes to code generator to push the continuation of any follow on code onto the stack.*)
One thing that I now understand looking at Tibbe's patches, is that going to inline primops does NOT mean forgoing FFI calls necessarily. That patch still uses emitForeignCall within emitPopCntCall. Is that what you were planning to do for the atomic primops?
The alternative, which seemed laborious, is to take code like this:
* cas(StgVolatilePtr p, StgWord o, StgWord n)* * {* * #if i386_HOST_ARCH || x86_64_HOST_ARCH* * __asm__ __volatile__ (* * "lock\ncmpxchg %3,%1"* * :"=a"(o), "=m" (*(volatile unsigned int *)p) * * :"0" (o), "r" (n));* * return o;* * #elif powerpc_HOST_ARCH * * ....*
and embed its logic within the codegen for the inline primops.
----------------------------------------------------------------------------- Anyway, to answer your question about which primops I'd like to see:
- CAS on MutVars, MutableArray#, and MutableByteArray# - fetch and add on MutableByteArray# - barriers / memory fences - Drafts of .cmm for these can be found herehttps://github.com/rrnewton/haskell-lockfree-queue/blob/master/AtomicPrimops.... Note that *only* casMutVar# is currently shipped with GHC.
These are the ones I'm using currently. But there's no reason that we shouldn't aim for a fairly "complete set". For example, why not have fetch-and-sub and the other "atomicrmw" variants? Relating these to the LLVM atomics and memory orderings, they become:
- CAS variants = LLVM cmpxchg with SequentiallyConsistent ordering - fetch-and-X variants = LLVM atomicrmw with SequentiallyConsistent - store_load_barrier = LLVM fenceInst with SequentiallyConsistent - write_barrier and load_load_barrier = I *think* these are both covered by a FenceInst with AcquireRelease ordering...
Someone else double checking these would be good, since I'm not yet familiar with LLVM and am just going off the documentation you linked.
Btw, I'm not sure why SMP.h uses "lock; addl $0,0(%%esp)" instead of the mfence instruction for store_load_barrier on x86, but I believe they should be the same.
-Ryan
[1] I note that the LLVM documentation says "store-store fences are generally not exposed to IR because they are extremely difficult to use correctly."
On Sat, Jul 20, 2013 at 3:19 AM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
Ryan, you misunderstand (or maybe i'm not understanding quite). It is 330 am after all! (I might be better at explaining tomorrow afternoon)
the idea is to provide CMM/haskell level primops, not to "pattern match on the ccall". I leave the updating of any cmm code to use such intrinsics as distinct task to be done subsequently :)
If you look at the example patches for pop count that David Terei referred me to, https://github.com/ghc/ghc/commit/2d0438f329ac153f9e59155f405d27fac0c43d65(f... the native code gen) and https://github.com/ghc/ghc/commit/2906db6c3a3f1000bd7347c7d8e45e65eb2806cbfo... the llvm code gen, the pattern is pretty clear, adding new "first class" primiops
Point being, dont' worry about that right now, (its 3am after all)
What I want from you is a clear description of the CMM / Haskell level PrimOps you want for making your life easier in supporting great parallelism in GHC, in terms of those LLVM operations and their semantics that I've referred you to.
what the final names of these will be can be bike shedded some other time, doesn't matter currently. For now, please read my ticket and the llvm links when you have the bandwidth, and layout what you'd want primop wise!
thanks -Carter
On Sat, Jul 20, 2013 at 2:47 AM, Ryan Newton
wrote: Sorry, "rewrite" was too overloaded a term to use here. I was just referring to the proposal to "substitute the cas funcall with the right llvm operation".
That is, the approach would pattern match for the CMM code "ccall cas" or "foreign "C" cas" (I'm afraid I don't know the difference between those) and replace it with the equivalent LLVM op, right?
I think the assumption there is that the native codegen would still have to suffer the funcall overhead and use the C versions. I don't know exactly what the changes would look like to make barriers/CAS all proper inline primops, because it would have to reproduce in the code generator all the platform-specific #ifdef'd C code that is currently in SMP.h. Which I guess is doable, but probably only for someone who knows the native GHC codegen properly...
On Sat, Jul 20, 2013 at 2:30 AM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
Ryan, could you explain what you want more precisely? Specifically what you want in terms of exposed primops using the terminology / vocabulary in http://llvm.org/docs/LangRef.html#ordering and http://llvm.org/docs/Atomics.html ?
I'll first do the work for just the LLVM backend, and I"ll likely need some active guidance / monitoring for the native codegen analogues
(also asked this on ticket for documentation purposes)
On Sat, Jul 20, 2013 at 2:18 AM, Ryan Newton
wrote: Hi Carter,
Yes, SMP.h is where I've copy pasted the duplicate functionality from (since I can't presently rely on linking the symbols).
Your proposal for the LLVM backend sounds **great**. But it also is going to provide additional constraints for getting "atomic-primops" right.
The goal of atomic-primops is to be a stable Haskell-level interface into the relevant CAS and fetch-and-add stuff. The reason this is important is that one has to be very careful to defeat the GHC optimizer in all the relevant places and make pointer equality a reliable property. I would like to get atomic-primops to work reliably in 7.4, 7.6 [and 7.8] and have more "native" support in future GHC releases, where maybe the foreign primops would become unecessary. (They are a pain and have already exposed one blocking cabal bug, fixed in upcoming 1.17.)
A couple additional suggestions for the proposal in ticket #7883:
- we should use more unique symbols than "cas", especially for this rewriting trick. How about "ghc_cas" or something? - it would be great to get at least fetch-and-add in addition to CAS and barriers - if we reliably provide this set of special symbols, libraries like atomic-primops may use them in the .cmm and benefit from the CMM->LLVM substitutions - if we include all the primops I need in GHC proper the previous bullet will stop applying ;-)
Cheers, -Ryan
P.S. Just as a bit of motivation, here are some recent performance numbers. We often wonder about how close our "pure values in a box" approach comes to efficient lock-free structures. Well here are some numbers about using a proper unboxed counter in the Haskell heap, vs using an IORef Int and atomicModifyIORef': Up to 100X performance difference on some platforms for microbenchmarks that hammer a counter:
https://github.com/rrnewton/haskell-lockfree-queue/blob/fb12d1121690553e4f73...
And here are the performance and scaling advantages of using ChaseLev (based on atomic-primops), over a traditional pure-in-a-box structure (IORef Data.Seq). The following are timings of ChaseLev/traditional respectively on a 32 core westmere:
fib(42) 1 threads: 21s fib(42) 2 threads: 10.1s fib(42) 4 threads: 5.2s (100%prod) fib(42) 8 threads: 2.7s - 3.2s (100%prod) fib(42) 16 threads: 1.28s fib(42) 24 threads: 1.85s fib(42) 32 threads: 4.8s (high variance)
(hive) fib(42) 1 threads: 41.8s (95% prod) (hive) fib(42) 2 threads: 25.2s (66% prod) (hive) fib(42) 4 threads: 14.6s (27% prod, 135GB alloc) (hive) fib(42) 8 threads: 17.1s (26% prod) (hive) fib(42) 16 threads: 16.3s (13% prod) (hive) fib(42) 24 threads: 21.2s (30% prod) (hive) fib(42) 32 threads: 29.3s (33% prod)
And that is WITH the inefficiency of doing a "ccall" on every single atomic operation.
Notes on parfib performance are here:
https://github.com/rrnewton/haskell-lockfree-queue/blob/d6d3e9eda2a487a5f055...
On Fri, Jul 19, 2013 at 5:05 PM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
ryan, the relevant machinery on the C side is here, see ./includes/stg/SMP.h : https://github.com/ghc/ghc/blob/7cc8a3cc5c2970009b83844ff9cc4e27913b8559/inc...
(unless i'm missing something)
On Fri, Jul 19, 2013 at 4:53 PM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
> Ryan, > if you look at line 270, you'll see the CAS is a C call > https://github.com/ghc/ghc/blob/95e6865ecf06b2bd80fa737e4fa4a24beaae25c5/rts... > > > What Simon is alluding to is some work I started (but need to finish) > http://ghc.haskell.org/trac/ghc/ticket/7883 is the relevant ticket, > and I'll need to sort out doing the same on the native code gen too > > there ARE no write barrier primops, they're baked into the CAS > machinery in ghc's rts > > > On Fri, Jul 19, 2013 at 1:02 PM, Ryan Newton
wrote: > >> Yes, I'd absolutely rather not suffer C call overhead for these >> functions (or the CAS functions). But isn't that how it's done currently >> for the casMutVar# primop? >> >> >> https://github.com/ghc/ghc/blob/95e6865ecf06b2bd80fa737e4fa4a24beaae25c5/rts... >> >> To avoid the overhead, is it necessary to make each primop in-line >> rather than out-of-line, or just to get rid of the "ccall"? >> >> Another reason it would be good to package these with GHC is that >> I'm having trouble building robust libraries of foreign primops that work >> under all "ways" (e.g. GHCI). For example, this bug: >> >> https://github.com/rrnewton/haskell-lockfree-queue/issues/10 >> >> If I write .cmm code that depends on RTS functionality like >> stg_MUT_VAR_CLEAN_info, then it seems to work fine when in compiled mode >> (with/without threading, profiling), but I get link errors from GHCI where >> these symbols aren't defined. >> >> I've got a draft of the relevant primops here: >> >> >> https://github.com/rrnewton/haskell-lockfree-queue/blob/master/AtomicPrimops... >> >> Which includes: >> >> - variants of CAS for MutableArray# and MutableByteArray# >> - fetch-and-add for MutableByteArray# >> >> Also, there are some tweaks to support the new "ticketed" interface >> for safer CAS: >> >> >> http://hackage.haskell.org/packages/archive/atomic-primops/0.3/doc/html/Data... >> >> I started adding some of these primops to GHC proper (still as >> out-of-line), but not all of them. I had gone with the foreign primop >> route instead... >> >> https://github.com/rrnewton/ghc/commits/master >> >> -Ryan >> >> P.S. Where is the write barrier primop? I don't see it listed in >> prelude/primops.txt... >> >> >> >> >> >> On Fri, Jul 19, 2013 at 11:41 AM, Carter Schonwald < >> carter.schonwald@gmail.com> wrote: >> >>> I guess I should find the time to finish the CAS primop work I >>> volunteered to do then. Ill look into in a few days. >>> >>> >>> On Friday, July 19, 2013, Simon Marlow wrote: >>> >>>> On 18/07/13 14:17, Ryan Newton wrote: >>>> >>>>> The "atomic-primops" library depends on symbols such as >>>>> store_load_barrier and "cas", which are defined in SMP.h. Thus >>>>> the >>>>> result is that if the program is linked WITHOUT "-threaded", the >>>>> user >>>>> gets a linker error about undefined symbols. >>>>> >>>>> The specific place it's used is in the 'foreign "C"' bits of >>>>> this .cmm code: >>>>> >>>>> https://github.com/rrnewton/**haskell-lockfree-queue/blob/** >>>>> 87e63b21b2a6c375e93c30b98c28c1**d04f88781c/AtomicPrimops/** >>>>> cbits/primops.cmmhttps://github.com/rrnewton/haskell-lockfree-queue/blob/87e63b21b2a6c375e93c... >>>>> >>>>> I'm trying to explore hacks that will enable me to pull in those >>>>> functions during compile time, without duplicating a whole bunch >>>>> of code >>>>> from the RTS. But it's a fragile business. >>>>> >>>>> It seems to me that some of these routines have general utility. >>>>> In >>>>> future versions of GHC, could we consider linking in those >>>>> routines >>>>> irrespective of "-threaded"? >>>>> >>>> >>>> We should make the non-THREADED versions EXTERN_INLINE too, so >>>> that there will be (empty) functions to call in rts/Inlines.c. Want to >>>> submit a patch? >>>> >>>> A better solution would be to make them into primops. You don't >>>> really want to be calling out to a C function to implement a memory >>>> barrier. We have this for write_barrier(), but none of the others so far. >>>> Of couse that's a larger change. >>>> >>>> Cheers, >>>> Simon >>>> >>>> >>>> >>>> ______________________________**_________________ >>>> ghc-devs mailing list >>>> ghc-devs@haskell.org >>>> http://www.haskell.org/**mailman/listinfo/ghc-devshttp://www.haskell.org/mailman/listinfo/ghc-devs >>>> >>> >> >

On Sun, Jul 21, 2013 at 3:32 AM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
ok, could you add those comments (about additional operations to consider) to the ticket?
Sure. Just did that.
relatedly: if we want these atomic ops to use the sequential analogues when we're not using the threaded run time system, does that mean we need to have a symbol / constant variable exposed in the RTS we link in, so that the inline code branches on a linktime constant value / symbol (something like "isThreadedRTS:: Bool", ) or some sort of analogue thereof?
I think it will take some care to mimic the semantics perfectly. Why not just leave the real atomic ops even in non-threaded mode, at least at first? Later we can optimize it if we find that people are using concurrent data structures heavily in non-threaded mode ;-).
one nice thing about doing such, is that if at some point link time optimization is added, the branch would go away! On the other hand, it could be argued that the cost of the call to the CAS primops in their current form isn't that much more expensive than such a branch.
Indeed, I'm much more concerned about performance in the threaded case and making sure they're correct.

Just to keep you all up to date... I'm adding the primops in question and
validating the individual commits before putting them here:
https://github.com/rrnewton/ghc/commits/atomicPrimOps
The basic idea for using these extensions is:
- the atomic-primops library will work in 7.6 or 7.7+. It will use
ifdefs to decide whether to use its own primops or GHC-builtin
- future versions will simply get faster, as Carter replaces out-of-line
primops that *also* use C calls, with inline primops / LLVM equivalents
Shall I stick a patch on a ticket, or will someone volunteer to pull?
What's the protocol for requesting commit access anyway? (By the way, can
someone share the reason that pull-requests to the github ghc mirror are
such a no-no? They seem no worse than a patch in an email which the
big warning
sign https://github.com/ghc/ghc recommends.)
Best,
-Ryan
P.S. FYI, I'm periodically getting these:
0 caused framework failures
0 unexpected passes
1 unexpected failures
Unexpected failures:
perf/compiler T1969 [stat not good enough] (normal)
Can that just be because of running on a loaded machine? How narrow are
these windows?
On Thu, Aug 1, 2013 at 12:32 PM, Ryan Newton
On Sun, Jul 21, 2013 at 3:32 AM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
ok, could you add those comments (about additional operations to consider) to the ticket?
Sure. Just did that.
relatedly: if we want these atomic ops to use the sequential analogues when we're not using the threaded run time system, does that mean we need to have a symbol / constant variable exposed in the RTS we link in, so that the inline code branches on a linktime constant value / symbol (something like "isThreadedRTS:: Bool", ) or some sort of analogue thereof?
I think it will take some care to mimic the semantics perfectly. Why not just leave the real atomic ops even in non-threaded mode, at least at first? Later we can optimize it if we find that people are using concurrent data structures heavily in non-threaded mode ;-).
one nice thing about doing such, is that if at some point link time optimization is added, the branch would go away! On the other hand, it could be argued that the cost of the call to the CAS primops in their current form isn't that much more expensive than such a branch.
Indeed, I'm much more concerned about performance in the threaded case and making sure they're correct.

awesome! (this will also make my work easier)
ryan: github is down, could you put the branch on bitbucket or some such so
I can have a lookseee/clone locally?
thanks!
-Carter
On Sat, Aug 3, 2013 at 4:01 AM, Ryan Newton
Just to keep you all up to date... I'm adding the primops in question and validating the individual commits before putting them here:
https://github.com/rrnewton/ghc/commits/atomicPrimOps
The basic idea for using these extensions is:
- the atomic-primops library will work in 7.6 or 7.7+. It will use ifdefs to decide whether to use its own primops or GHC-builtin - future versions will simply get faster, as Carter replaces out-of-line primops that *also* use C calls, with inline primops / LLVM equivalents
Shall I stick a patch on a ticket, or will someone volunteer to pull? What's the protocol for requesting commit access anyway? (By the way, can someone share the reason that pull-requests to the github ghc mirror are such a no-no? They seem no worse than a patch in an email which the big warning sign https://github.com/ghc/ghc recommends.)
Best, -Ryan
P.S. FYI, I'm periodically getting these:
0 caused framework failures 0 unexpected passes 1 unexpected failures
Unexpected failures: perf/compiler T1969 [stat not good enough] (normal)
Can that just be because of running on a loaded machine? How narrow are these windows?
On Thu, Aug 1, 2013 at 12:32 PM, Ryan Newton
wrote: On Sun, Jul 21, 2013 at 3:32 AM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
ok, could you add those comments (about additional operations to consider) to the ticket?
Sure. Just did that.
relatedly: if we want these atomic ops to use the sequential analogues when we're not using the threaded run time system, does that mean we need to have a symbol / constant variable exposed in the RTS we link in, so that the inline code branches on a linktime constant value / symbol (something like "isThreadedRTS:: Bool", ) or some sort of analogue thereof?
I think it will take some care to mimic the semantics perfectly. Why not just leave the real atomic ops even in non-threaded mode, at least at first? Later we can optimize it if we find that people are using concurrent data structures heavily in non-threaded mode ;-).
one nice thing about doing such, is that if at some point link time optimization is added, the branch would go away! On the other hand, it could be argued that the cost of the call to the CAS primops in their current form isn't that much more expensive than such a branch.
Indeed, I'm much more concerned about performance in the threaded case and making sure they're correct.

nvm, githubs backup, i'll have a look! :)
On Sat, Aug 3, 2013 at 9:05 PM, Carter Schonwald wrote: awesome! (this will also make my work easier) ryan: github is down, could you put the branch on bitbucket or some such
so I can have a lookseee/clone locally? thanks!
-Carter On Sat, Aug 3, 2013 at 4:01 AM, Ryan Newton Just to keep you all up to date... I'm adding the primops in question
and validating the individual commits before putting them here: https://github.com/rrnewton/ghc/commits/atomicPrimOps The basic idea for using these extensions is: - the atomic-primops library will work in 7.6 or 7.7+. It will use
ifdefs to decide whether to use its own primops or GHC-builtin
- future versions will simply get faster, as Carter replaces
out-of-line primops that *also* use C calls, with inline primops / LLVM
equivalents Shall I stick a patch on a ticket, or will someone volunteer to pull?
What's the protocol for requesting commit access anyway? (By the way, can
someone share the reason that pull-requests to the github ghc mirror are
such a no-no? They seem no worse than a patch in an email which the big warning
sign https://github.com/ghc/ghc recommends.) Best,
-Ryan P.S. FYI, I'm periodically getting these: 0 caused framework failures
0 unexpected passes
1 unexpected failures Unexpected failures:
perf/compiler T1969 [stat not good enough] (normal) Can that just be because of running on a loaded machine? How narrow are
these windows? On Thu, Aug 1, 2013 at 12:32 PM, Ryan Newton On Sun, Jul 21, 2013 at 3:32 AM, Carter Schonwald <
carter.schonwald@gmail.com> wrote: ok, could you add those comments (about additional operations to
consider) to the ticket? Sure. Just did that. relatedly: if we want these atomic ops to use the sequential analogues
when we're not using the threaded run time system, does that mean
we need to have a symbol / constant variable exposed in the RTS we link
in, so that the inline code branches on a linktime constant value / symbol
(something like "isThreadedRTS:: Bool", ) or some sort of analogue
thereof? I think it will take some care to mimic the semantics perfectly. Why
not just leave the real atomic ops even in non-threaded mode, at least at
first? Later we can optimize it if we find that people are using
concurrent data structures heavily in non-threaded mode ;-). one nice thing about doing such, is that if at some point link time
optimization is added, the branch would go away! On the other hand, it
could be argued that the cost of the call to the CAS primops in their
current form isn't that much more expensive than such a branch. Indeed, I'm much more concerned about performance in the threaded case
and making sure they're correct.

took a quick look, awesome! this will make it MUCH MUCH easier for me to
do my work. Thank you very much.
off hand, to prevent patch confusion,
it naively seems like the nicest way to post the patches to trac is to
post a *new ticket to trac* that links to the main one,
plus add a comment on the main ticket a link to the new ticket for the
c/cmm based versions of the primops.
At least, given that theres likely going to be a bit of discussion on just
your ticket perhaps, better to factor that into a related ticket to make it
easier to keep track of that?
(i'm also possibly over thinking this enormously, so i could be way off
base)
On Sat, Aug 3, 2013 at 9:31 PM, Carter Schonwald wrote: nvm, githubs backup, i'll have a look! :) On Sat, Aug 3, 2013 at 9:05 PM, Carter Schonwald <
carter.schonwald@gmail.com> wrote: awesome! (this will also make my work easier) ryan: github is down, could you put the branch on bitbucket or some such
so I can have a lookseee/clone locally? thanks!
-Carter On Sat, Aug 3, 2013 at 4:01 AM, Ryan Newton Just to keep you all up to date... I'm adding the primops in question
and validating the individual commits before putting them here: https://github.com/rrnewton/ghc/commits/atomicPrimOps The basic idea for using these extensions is: - the atomic-primops library will work in 7.6 or 7.7+. It will use
ifdefs to decide whether to use its own primops or GHC-builtin
- future versions will simply get faster, as Carter replaces
out-of-line primops that *also* use C calls, with inline primops / LLVM
equivalents Shall I stick a patch on a ticket, or will someone volunteer to pull?
What's the protocol for requesting commit access anyway? (By the way, can
someone share the reason that pull-requests to the github ghc mirror are
such a no-no? They seem no worse than a patch in an email which the big warning
sign https://github.com/ghc/ghc recommends.) Best,
-Ryan P.S. FYI, I'm periodically getting these: 0 caused framework failures
0 unexpected passes
1 unexpected failures Unexpected failures:
perf/compiler T1969 [stat not good enough] (normal) Can that just be because of running on a loaded machine? How narrow are
these windows? On Thu, Aug 1, 2013 at 12:32 PM, Ryan Newton On Sun, Jul 21, 2013 at 3:32 AM, Carter Schonwald <
carter.schonwald@gmail.com> wrote: ok, could you add those comments (about additional operations to
consider) to the ticket? Sure. Just did that. relatedly: if we want these atomic ops to use the sequential analogues
when we're not using the threaded run time system, does that mean
we need to have a symbol / constant variable exposed in the RTS we
link in, so that the inline code branches on a linktime constant value /
symbol (something like "isThreadedRTS:: Bool", ) or some sort of analogue
thereof? I think it will take some care to mimic the semantics perfectly. Why
not just leave the real atomic ops even in non-threaded mode, at least at
first? Later we can optimize it if we find that people are using
concurrent data structures heavily in non-threaded mode ;-). one nice thing about doing such, is that if at some point link time
optimization is added, the branch would go away! On the other hand, it
could be argued that the cost of the call to the CAS primops in their
current form isn't that much more expensive than such a branch. Indeed, I'm much more concerned about performance in the threaded case
and making sure they're correct.

Well for new features like this (rather than bug fix), I'd prefer if I could get commit access and at least push it to a branch. I can create a new trac ticket too. On Saturday, August 3, 2013, Carter Schonwald wrote:
took a quick look, awesome! this will make it MUCH MUCH easier for me to do my work. Thank you very much.
off hand, to prevent patch confusion, it naively seems like the nicest way to post the patches to trac is to post a *new ticket to trac* that links to the main one, plus add a comment on the main ticket a link to the new ticket for the c/cmm based versions of the primops.
At least, given that theres likely going to be a bit of discussion on just your ticket perhaps, better to factor that into a related ticket to make it easier to keep track of that?
(i'm also possibly over thinking this enormously, so i could be way off base)
On Sat, Aug 3, 2013 at 9:31 PM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
nvm, githubs backup, i'll have a look! :)
On Sat, Aug 3, 2013 at 9:05 PM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
awesome! (this will also make my work easier)
ryan: github is down, could you put the branch on bitbucket or some such so I can have a lookseee/clone locally?
thanks! -Carter
On Sat, Aug 3, 2013 at 4:01 AM, Ryan Newton
wrote: Just to keep you all up to date... I'm adding the primops in question and validating the individual commits before putting them here:
https://github.com/rrnewton/ghc/commits/atomicPrimOps
The basic idea for using these extensions is:
- the atomic-primops library will work in 7.6 or 7.7+. It will use ifdefs to decide whether to use its own primops or GHC-builtin - future versions will simply get faster, as Carter replaces out-of-line primops that *also* use C calls, with inline primops / LLVM equivalents
Shall I stick a patch on a ticket, or will someone volunteer to pull? What's the protocol for requesting commit access anyway? (By the way, can someone share the reason that pull-requests to the github ghc mirror are such a no-no? They seem no worse than a patch in an email which the big warning sign https://github.com/ghc/ghc recommends.)
Best, -Ryan
P.S. FYI, I'm periodically getting these:
0 caused framework failures 0 unexpected passes 1 unexpected failures
Unexpected failures: perf/compiler T1969 [stat not good enough] (normal)
Can that just be because of running on a loaded machine? How narrow are these windows?
On Thu, Aug 1, 2013 at 12:32 PM, Ryan Newton
wrote: On Sun, Jul 21, 2013 at 3:32 AM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
ok, could you add those comments (about additional operations to consider) to the ticket?
Sure. Just did that.
relatedly: if we want these atomic ops to use the sequential analogues when we're not using the threaded run time system, does that mean we need to have a symbol / constant variable exposed in the RTS we link in, so that the inline code branches on a linktime constant value / symbol (something like "isThreadedRTS:: Bool", ) or some sort of analogue thereof?
<
-- Sent from Gmail Mobile

huh, did I suggest viewing it as a bug fix? my mistake! (a branch would
make sense)
On Sat, Aug 10, 2013 at 12:40 PM, Ryan Newton
Well for new features like this (rather than bug fix), I'd prefer if I could get commit access and at least push it to a branch. I can create a new trac ticket too.
On Saturday, August 3, 2013, Carter Schonwald wrote:
took a quick look, awesome! this will make it MUCH MUCH easier for me to do my work. Thank you very much.
off hand, to prevent patch confusion, it naively seems like the nicest way to post the patches to trac is to post a *new ticket to trac* that links to the main one, plus add a comment on the main ticket a link to the new ticket for the c/cmm based versions of the primops.
At least, given that theres likely going to be a bit of discussion on just your ticket perhaps, better to factor that into a related ticket to make it easier to keep track of that?
(i'm also possibly over thinking this enormously, so i could be way off base)
On Sat, Aug 3, 2013 at 9:31 PM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
nvm, githubs backup, i'll have a look! :)
On Sat, Aug 3, 2013 at 9:05 PM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
awesome! (this will also make my work easier)
ryan: github is down, could you put the branch on bitbucket or some such so I can have a lookseee/clone locally?
thanks! -Carter
On Sat, Aug 3, 2013 at 4:01 AM, Ryan Newton
wrote: Just to keep you all up to date... I'm adding the primops in question and validating the individual commits before putting them here:
https://github.com/rrnewton/ghc/commits/atomicPrimOps
The basic idea for using these extensions is:
- the atomic-primops library will work in 7.6 or 7.7+. It will use ifdefs to decide whether to use its own primops or GHC-builtin - future versions will simply get faster, as Carter replaces out-of-line primops that *also* use C calls, with inline primops / LLVM equivalents
Shall I stick a patch on a ticket, or will someone volunteer to pull? What's the protocol for requesting commit access anyway? (By the way, can someone share the reason that pull-requests to the github ghc mirror are such a no-no? They seem no worse than a patch in an email which the big warning sign https://github.com/ghc/ghc recommends.)
Best, -Ryan
P.S. FYI, I'm periodically getting these:
0 caused framework failures 0 unexpected passes 1 unexpected failures
Unexpected failures: perf/compiler T1969 [stat not good enough] (normal)
Can that just be because of running on a loaded machine? How narrow are these windows?
On Thu, Aug 1, 2013 at 12:32 PM, Ryan Newton
wrote: On Sun, Jul 21, 2013 at 3:32 AM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
ok, could you add those comments (about additional operations to consider) to the ticket?
Sure. Just did that.
relatedly: if we want these atomic ops to use the sequential analogues when we're not using the threaded run time system, does that mean we need to have a symbol / constant variable exposed in the RTS we link in, so that the inline code branches on a linktime constant value / symbol (something like "isThreadedRTS:: Bool", ) or some sort of analogue thereof?
<
-- Sent from Gmail Mobile

Do you have a branch already lined up for your LLVM-atomics work? On Sat, Aug 10, 2013 at 7:02 PM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
huh, did I suggest viewing it as a bug fix? my mistake! (a branch would make sense)
On Sat, Aug 10, 2013 at 12:40 PM, Ryan Newton
wrote: Well for new features like this (rather than bug fix), I'd prefer if I could get commit access and at least push it to a branch. I can create a new trac ticket too.
On Saturday, August 3, 2013, Carter Schonwald wrote:
took a quick look, awesome! this will make it MUCH MUCH easier for me to do my work. Thank you very much.
off hand, to prevent patch confusion, it naively seems like the nicest way to post the patches to trac is to post a *new ticket to trac* that links to the main one, plus add a comment on the main ticket a link to the new ticket for the c/cmm based versions of the primops.
At least, given that theres likely going to be a bit of discussion on just your ticket perhaps, better to factor that into a related ticket to make it easier to keep track of that?
(i'm also possibly over thinking this enormously, so i could be way off base)
On Sat, Aug 3, 2013 at 9:31 PM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
nvm, githubs backup, i'll have a look! :)
On Sat, Aug 3, 2013 at 9:05 PM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
awesome! (this will also make my work easier)
ryan: github is down, could you put the branch on bitbucket or some such so I can have a lookseee/clone locally?
thanks! -Carter
On Sat, Aug 3, 2013 at 4:01 AM, Ryan Newton
wrote: Just to keep you all up to date... I'm adding the primops in question and validating the individual commits before putting them here:
https://github.com/rrnewton/ghc/commits/atomicPrimOps
The basic idea for using these extensions is:
- the atomic-primops library will work in 7.6 or 7.7+. It will use ifdefs to decide whether to use its own primops or GHC-builtin - future versions will simply get faster, as Carter replaces out-of-line primops that *also* use C calls, with inline primops / LLVM equivalents
Shall I stick a patch on a ticket, or will someone volunteer to pull? What's the protocol for requesting commit access anyway? (By the way, can someone share the reason that pull-requests to the github ghc mirror are such a no-no? They seem no worse than a patch in an email which the big warning sign https://github.com/ghc/ghc recommends.)
Best, -Ryan
P.S. FYI, I'm periodically getting these:
0 caused framework failures 0 unexpected passes 1 unexpected failures
Unexpected failures: perf/compiler T1969 [stat not good enough] (normal)
Can that just be because of running on a loaded machine? How narrow are these windows?
On Thu, Aug 1, 2013 at 12:32 PM, Ryan Newton
wrote: On Sun, Jul 21, 2013 at 3:32 AM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
ok, could you add those comments (about additional operations to consider) to the ticket?
Sure. Just did that.
relatedly: if we want these atomic ops to use the sequential analogues when we're not using the threaded run time system, does that mean we need to have a symbol / constant variable exposed in the RTS we link in, so that the inline code branches on a linktime constant value / symbol (something like "isThreadedRTS:: Bool", ) or some sort of analogue thereof?
<
-- Sent from Gmail Mobile

Nope. On Monday, August 12, 2013, Ryan Newton wrote:
Do you have a branch already lined up for your LLVM-atomics work?
On Sat, Aug 10, 2013 at 7:02 PM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
huh, did I suggest viewing it as a bug fix? my mistake! (a branch would make sense)
On Sat, Aug 10, 2013 at 12:40 PM, Ryan Newton
wrote: Well for new features like this (rather than bug fix), I'd prefer if I could get commit access and at least push it to a branch. I can create a new trac ticket too.
On Saturday, August 3, 2013, Carter Schonwald wrote:
took a quick look, awesome! this will make it MUCH MUCH easier for me to do my work. Thank you very much.
off hand, to prevent patch confusion, it naively seems like the nicest way to post the patches to trac is to post a *new ticket to trac* that links to the main one, plus add a comment on the main ticket a link to the new ticket for the c/cmm based versions of the primops.
At least, given that theres likely going to be a bit of discussion on just your ticket perhaps, better to factor that into a related ticket to make it easier to keep track of that?
(i'm also possibly over thinking this enormously, so i could be way off base)
On Sat, Aug 3, 2013 at 9:31 PM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
nvm, githubs backup, i'll have a look! :)
On Sat, Aug 3, 2013 at 9:05 PM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
awesome! (this will also make my work easier)
ryan: github is down, could you put the branch on bitbucket or some such so I can have a lookseee/clone locally?
thanks! -Carter
On Sat, Aug 3, 2013 at 4:01 AM, Ryan Newton
wrote: Just to keep you all up to date... I'm adding the primops in question and validating the individual commits before putting them here:
https://github.com/rrnewton/ghc/commits/atomicPrimOps
The basic idea for using these extensions is:
- the atomic-primops library will work in 7.6 or 7.7+. It will use ifdefs to decide whether to use its own primops or GHC-builtin - future versions will simply get faster, as Carter replaces out-of-line primops that *also* use C calls, with inline primops / LLVM equivalents
Shall I stick a patch on a ticket, or will someone volunteer to pull? What's the protocol for requesting commit access anyway? (By the way, can someone share the reason that pull-requests to the github ghc mirror are such a no-no? They seem no worse than a patch in an email which the big warning sign https://github.com/ghc/ghc recommends.)
Best, -Ryan
P.S. FYI, I'm periodically getting these:
0 caused framework failures 0 unexpected passes 1 unexpected failures
Unexpected failures: perf/compiler T1969 [stat not good enough] (normal)
Can that just be because of running on a loaded machine? How narrow are these windows?
On Thu, Aug 1, 2013 at 12:32 PM, Ryan Newton
wrote:

Ohhh. I meant task, not branch, in the email you were replying to. Was a bit ill this past week. Sorry for my confusing remake. On Monday, August 12, 2013, Ryan Newton wrote:
Do you have a branch already lined up for your LLVM-atomics work?
On Sat, Aug 10, 2013 at 7:02 PM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
huh, did I suggest viewing it as a bug fix? my mistake! (a branch would make sense)
On Sat, Aug 10, 2013 at 12:40 PM, Ryan Newton
wrote: Well for new features like this (rather than bug fix), I'd prefer if I could get commit access and at least push it to a branch. I can create a new trac ticket too.
On Saturday, August 3, 2013, Carter Schonwald wrote:
took a quick look, awesome! this will make it MUCH MUCH easier for me to do my work. Thank you very much.
off hand, to prevent patch confusion, it naively seems like the nicest way to post the patches to trac is to post a *new ticket to trac* that links to the main one, plus add a comment on the main ticket a link to the new ticket for the c/cmm based versions of the primops.
At least, given that theres likely going to be a bit of discussion on just your ticket perhaps, better to factor that into a related ticket to make it easier to keep track of that?
(i'm also possibly over thinking this enormously, so i could be way off base)
On Sat, Aug 3, 2013 at 9:31 PM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
nvm, githubs backup, i'll have a look! :)
On Sat, Aug 3, 2013 at 9:05 PM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
awesome! (this will also make my work easier)
ryan: github is down, could you put the branch on bitbucket or some such so I can have a lookseee/clone locally?
thanks! -Carter
On Sat, Aug 3, 2013 at 4:01 AM, Ryan Newton
wrote: Just to keep you all up to date... I'm adding the primops in question and validating the individual commits before putting them here:
https://github.com/rrnewton/ghc/commits/atomicPrimOps
The basic idea for using these extensions is:
- the atomic-primops library will work in 7.6 or 7.7+. It will use ifdefs to decide whether to use its own primops or GHC-builtin - future versions will simply get faster, as Carter replaces out-of-line primops that *also* use C calls, with inline primops / LLVM equivalents
Shall I stick a patch on a ticket, or will someone volunteer to pull? What's the protocol for requesting commit access anyway? (By the way, can someone share the reason that pull-requests to the github ghc mirror are such a no-no? They seem no worse than a patch in an email which the big warning sign https://github.com/ghc/ghc recommends.)
Best, -Ryan
P.S. FYI, I'm periodically getting these:
0 caused framework failures 0 unexpected passes 1 unexpected failures
Unexpected failures: perf/compiler T1969 [stat not good enough] (normal)
Can that just be because of running on a loaded machine? How narrow are these windows?
On Thu, Aug 1, 2013 at 12:32 PM, Ryan Newton
wrote:

also: HOLY CRAP THATS AWESOME performance :)
(i'll be wanting to do some cache aware parallel work stealing in the near
future, so this is really really handy for me)
On Sat, Jul 20, 2013 at 2:18 AM, Ryan Newton
Hi Carter,
Yes, SMP.h is where I've copy pasted the duplicate functionality from (since I can't presently rely on linking the symbols).
Your proposal for the LLVM backend sounds **great**. But it also is going to provide additional constraints for getting "atomic-primops" right.
The goal of atomic-primops is to be a stable Haskell-level interface into the relevant CAS and fetch-and-add stuff. The reason this is important is that one has to be very careful to defeat the GHC optimizer in all the relevant places and make pointer equality a reliable property. I would like to get atomic-primops to work reliably in 7.4, 7.6 [and 7.8] and have more "native" support in future GHC releases, where maybe the foreign primops would become unecessary. (They are a pain and have already exposed one blocking cabal bug, fixed in upcoming 1.17.)
A couple additional suggestions for the proposal in ticket #7883:
- we should use more unique symbols than "cas", especially for this rewriting trick. How about "ghc_cas" or something? - it would be great to get at least fetch-and-add in addition to CAS and barriers - if we reliably provide this set of special symbols, libraries like atomic-primops may use them in the .cmm and benefit from the CMM->LLVM substitutions - if we include all the primops I need in GHC proper the previous bullet will stop applying ;-)
Cheers, -Ryan
P.S. Just as a bit of motivation, here are some recent performance numbers. We often wonder about how close our "pure values in a box" approach comes to efficient lock-free structures. Well here are some numbers about using a proper unboxed counter in the Haskell heap, vs using an IORef Int and atomicModifyIORef': Up to 100X performance difference on some platforms for microbenchmarks that hammer a counter:
https://github.com/rrnewton/haskell-lockfree-queue/blob/fb12d1121690553e4f73...
And here are the performance and scaling advantages of using ChaseLev (based on atomic-primops), over a traditional pure-in-a-box structure (IORef Data.Seq). The following are timings of ChaseLev/traditional respectively on a 32 core westmere:
fib(42) 1 threads: 21s fib(42) 2 threads: 10.1s fib(42) 4 threads: 5.2s (100%prod) fib(42) 8 threads: 2.7s - 3.2s (100%prod) fib(42) 16 threads: 1.28s fib(42) 24 threads: 1.85s fib(42) 32 threads: 4.8s (high variance)
(hive) fib(42) 1 threads: 41.8s (95% prod) (hive) fib(42) 2 threads: 25.2s (66% prod) (hive) fib(42) 4 threads: 14.6s (27% prod, 135GB alloc) (hive) fib(42) 8 threads: 17.1s (26% prod) (hive) fib(42) 16 threads: 16.3s (13% prod) (hive) fib(42) 24 threads: 21.2s (30% prod) (hive) fib(42) 32 threads: 29.3s (33% prod)
And that is WITH the inefficiency of doing a "ccall" on every single atomic operation.
Notes on parfib performance are here:
https://github.com/rrnewton/haskell-lockfree-queue/blob/d6d3e9eda2a487a5f055...
On Fri, Jul 19, 2013 at 5:05 PM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
ryan, the relevant machinery on the C side is here, see ./includes/stg/SMP.h : https://github.com/ghc/ghc/blob/7cc8a3cc5c2970009b83844ff9cc4e27913b8559/inc...
(unless i'm missing something)
On Fri, Jul 19, 2013 at 4:53 PM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
Ryan, if you look at line 270, you'll see the CAS is a C call https://github.com/ghc/ghc/blob/95e6865ecf06b2bd80fa737e4fa4a24beaae25c5/rts...
What Simon is alluding to is some work I started (but need to finish) http://ghc.haskell.org/trac/ghc/ticket/7883 is the relevant ticket, and I'll need to sort out doing the same on the native code gen too
there ARE no write barrier primops, they're baked into the CAS machinery in ghc's rts
On Fri, Jul 19, 2013 at 1:02 PM, Ryan Newton
wrote: Yes, I'd absolutely rather not suffer C call overhead for these functions (or the CAS functions). But isn't that how it's done currently for the casMutVar# primop?
https://github.com/ghc/ghc/blob/95e6865ecf06b2bd80fa737e4fa4a24beaae25c5/rts...
To avoid the overhead, is it necessary to make each primop in-line rather than out-of-line, or just to get rid of the "ccall"?
Another reason it would be good to package these with GHC is that I'm having trouble building robust libraries of foreign primops that work under all "ways" (e.g. GHCI). For example, this bug:
https://github.com/rrnewton/haskell-lockfree-queue/issues/10
If I write .cmm code that depends on RTS functionality like stg_MUT_VAR_CLEAN_info, then it seems to work fine when in compiled mode (with/without threading, profiling), but I get link errors from GHCI where these symbols aren't defined.
I've got a draft of the relevant primops here:
https://github.com/rrnewton/haskell-lockfree-queue/blob/master/AtomicPrimops...
Which includes:
- variants of CAS for MutableArray# and MutableByteArray# - fetch-and-add for MutableByteArray#
Also, there are some tweaks to support the new "ticketed" interface for safer CAS:
http://hackage.haskell.org/packages/archive/atomic-primops/0.3/doc/html/Data...
I started adding some of these primops to GHC proper (still as out-of-line), but not all of them. I had gone with the foreign primop route instead...
https://github.com/rrnewton/ghc/commits/master
-Ryan
P.S. Where is the write barrier primop? I don't see it listed in prelude/primops.txt...
On Fri, Jul 19, 2013 at 11:41 AM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
I guess I should find the time to finish the CAS primop work I volunteered to do then. Ill look into in a few days.
On Friday, July 19, 2013, Simon Marlow wrote:
On 18/07/13 14:17, Ryan Newton wrote:
> The "atomic-primops" library depends on symbols such as > store_load_barrier and "cas", which are defined in SMP.h. Thus the > result is that if the program is linked WITHOUT "-threaded", the user > gets a linker error about undefined symbols. > > The specific place it's used is in the 'foreign "C"' bits of this > .cmm code: > > https://github.com/rrnewton/**haskell-lockfree-queue/blob/** > 87e63b21b2a6c375e93c30b98c28c1**d04f88781c/AtomicPrimops/** > cbits/primops.cmmhttps://github.com/rrnewton/haskell-lockfree-queue/blob/87e63b21b2a6c375e93c... > > I'm trying to explore hacks that will enable me to pull in those > functions during compile time, without duplicating a whole bunch of > code > from the RTS. But it's a fragile business. > > It seems to me that some of these routines have general utility. In > future versions of GHC, could we consider linking in those routines > irrespective of "-threaded"? >
We should make the non-THREADED versions EXTERN_INLINE too, so that there will be (empty) functions to call in rts/Inlines.c. Want to submit a patch?
A better solution would be to make them into primops. You don't really want to be calling out to a C function to implement a memory barrier. We have this for write_barrier(), but none of the others so far. Of couse that's a larger change.
Cheers, Simon
______________________________**_________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/**mailman/listinfo/ghc-devshttp://www.haskell.org/mailman/listinfo/ghc-devs

On 19/07/13 18:02, Ryan Newton wrote:
Yes, I'd absolutely rather not suffer C call overhead for these functions (or the CAS functions). But isn't that how it's done currently for the casMutVar# primop?
https://github.com/ghc/ghc/blob/95e6865ecf06b2bd80fa737e4fa4a24beaae25c5/rts...
To avoid the overhead, is it necessary to make each primop in-line rather than out-of-line, or just to get rid of the "ccall"?
The best thing would be to get rid of the overhead, but the native code generators need to be taught how to generate code for cas.
Another reason it would be good to package these with GHC is that I'm having trouble building robust libraries of foreign primops that work under all "ways" (e.g. GHCI). For example, this bug:
https://github.com/rrnewton/haskell-lockfree-queue/issues/10
If I write .cmm code that depends on RTS functionality like stg_MUT_VAR_CLEAN_info, then it seems to work fine when in compiled mode (with/without threading, profiling), but I get link errors from GHCI where these symbols aren't defined.
That's a bug, I'll fix it.
I've got a draft of the relevant primops here:
https://github.com/rrnewton/haskell-lockfree-queue/blob/master/AtomicPrimops...
Which includes:
* variants of CAS for MutableArray# and MutableByteArray# * fetch-and-add for MutableByteArray#
Also, there are some tweaks to support the new "ticketed" interface for safer CAS:
http://hackage.haskell.org/packages/archive/atomic-primops/0.3/doc/html/Data...
I started adding some of these primops to GHC proper (still as out-of-line), but not all of them. I had gone with the foreign primop route instead...
Ok, will you make a ticket and attach the patches when you're ready?
-Ryan
P.S. Where is the write barrier primop? I don't see it listed in prelude/primops.txt.
It's not a primop. Perhaps it should be. It's a MachOp in Cmm, you write it as prim write_barrier; Cheers, Simon
On Fri, Jul 19, 2013 at 11:41 AM, Carter Schonwald
mailto:carter.schonwald@gmail.com> wrote: I guess I should find the time to finish the CAS primop work I volunteered to do then. Ill look into in a few days.
On Friday, July 19, 2013, Simon Marlow wrote:
On 18/07/13 14:17, Ryan Newton wrote:
The "atomic-primops" library depends on symbols such as store_load_barrier and "cas", which are defined in SMP.h. Thus the result is that if the program is linked WITHOUT "-threaded", the user gets a linker error about undefined symbols.
The specific place it's used is in the 'foreign "C"' bits of this .cmm code:
https://github.com/rrnewton/__haskell-lockfree-queue/blob/__87e63b21b2a6c375... https://github.com/rrnewton/haskell-lockfree-queue/blob/87e63b21b2a6c375e93c...
I'm trying to explore hacks that will enable me to pull in those functions during compile time, without duplicating a whole bunch of code from the RTS. But it's a fragile business.
It seems to me that some of these routines have general utility. In future versions of GHC, could we consider linking in those routines irrespective of "-threaded"?
We should make the non-THREADED versions EXTERN_INLINE too, so that there will be (empty) functions to call in rts/Inlines.c. Want to submit a patch?
A better solution would be to make them into primops. You don't really want to be calling out to a C function to implement a memory barrier. We have this for write_barrier(), but none of the others so far. Of couse that's a larger change.
Cheers, Simon
_________________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/__mailman/listinfo/ghc-devs http://www.haskell.org/mailman/listinfo/ghc-devs

Short version: Patch for the barriers here: http://ghc.haskell.org/trac/ghc/ticket/8077 Long version: I started adding some of these primops to GHC proper (still as
out-of-line), but not all of them. I had gone with the foreign primop route instead...
Ok, will you make a ticket and attach the patches when you're ready?
Ah, so the feeling is that the feeling is "foreign primops in a hackage library isn't really ideal and they should eventually come to rest in GHC"? I think I'm coming to concur with that. Honestly, my biggest barrier as a sometimes-almost-GHC-contributor is that when I haven't touched it in a while, the dance to get GHC validating can take some doing. For example, I downloaded fresh copies just now and it failed on mac OS and RHEL6 but worked on Ubuntu 12.04. Anyway, I just added a couple entries to the build troubleshooting pagehttp://ghc.haskell.org/trac/ghc/wiki/Building/Troubleshooting, and was able to validate with and without this patch (for the barrier KEEP_INLINES issue): https://github.com/rrnewton/ghc/commit/5cfb51303192b6722276a7848f265cfcbec56... And I attached it to the ticket here: http://ghc.haskell.org/trac/ghc/ticket/8077 Since I'm in a good state now I'll try to also get some validated out-of-line atomic primops in there soon for Carter to port to inline primops at his leisure. On the topic of easy validation, I see it was discussed several years agohttp://webcache.googleusercontent.com/search?q=cache:gmaMH1TiUX0J:www.haskell.org/pipermail/glasgow-haskell-users/2009-June/017366.html+&cd=1&hl=en&ct=clnk&gl=us that a GHC development VM might be useful. It doesn't look like that happened. But isn't it even easier nowadays? It looks like Amazon lets people just provide community VMs https://www.fpcomplete.com/page/haskell-eval-vm for others to use. If I validate on there maybe I can find the share/publish button... Best, -Ryan P.S. For general Haskell development (not GHC development) it looks like FP Complete provides a VM: https://www.fpcomplete.com/page/haskell-eval-vm

Hi Simon, That sounds like a good solution and I'll attempt a patch. I think the fix is only three lines. That is, replace these three lines with EXTERN_INLINE C functions: #define write_barrier() /* nothing */ #define store_load_barrier() /* nothing */ #define load_load_barrier() /* nothing */ That would fix the -threaded/unthreaded disparity. But I still don't see how to access this stuff properly from foreign-primops in a library such that GHCI doesn't barf when trying to load the library.... -Ryan

On 20/07/13 07:28, Ryan Newton wrote:
Hi Simon,
That sounds like a good solution and I'll attempt a patch. I think the fix is only three lines. That is, replace these three lines with EXTERN_INLINE C functions:
#define write_barrier() /* nothing */ #define store_load_barrier() /* nothing */ #define load_load_barrier() /* nothing */
I think that should do it, yes.
That would fix the -threaded/unthreaded disparity. But I still don't see how to access this stuff properly from foreign-primops in a library such that GHCI doesn't barf when trying to load the library....
If you're referring to the problem with the missing stg_MUT_VAR_CLEAN_info symbol, I'll push a fix for that soon. Or is there something else? Cheers, Simon

That would fix the -threaded/unthreaded disparity. But I still don't
see how to access this stuff properly from foreign-primops in a library such that GHCI doesn't barf when trying to load the library....
If you're referring to the problem with the missing stg_MUT_VAR_CLEAN_info symbol, I'll push a fix for that soon. Or is there something else?
Ah, yes, I think that will be addressed by your fix. It's good to hear that it is considered an ok thing to depend on RTS symbols under all "ways".
participants (4)
-
Carter Schonwald
-
Edward Z. Yang
-
Ryan Newton
-
Simon Marlow