Design discussion for atomic primops to land in 7.8

There's a ticket that describes the design here:
http://ghc.haskell.org/trac/ghc/ticket/8157#comment:1
It is a fairly simple extension of the casMutVar# that has been in since
7.2. The implementation is on the `atomics` branch currently.
Feel free to add your views either here or on that task's comments.
One example of an alternative design would be Carter's proposal to expose
something closer to the full LLVM concurrency
opshttp://llvm.org/docs/Atomics.html
:
Schonwald
i'm kinda thinking that we should do the analogue of exposing all the different memory model level choices (because its not that hard to add that), and when the person building it has an old version of GCC, it falls back to the legacy atomic operations?
This also gives a nice path to how to upgrade to the inline asm approach.
These LLVM ops include many parameterized configurations of loads, stores, cmpxchg, atomicrmw and barriers. In fact, it implements much more than is natively supported in most hardware, but it provides a uniform abstraction. My original thought was that any kind of abstraction like that would be built and maintained as a Haskell library, and only the most rudimentary operations (required to get access to process features) would be exposed as primops. Let's call this the "small" set of concurrent ops. If we want the "big set" I think we're doomed to *reproduce* the logic that maps LLVM concurrency abstractions onto machine ops irrespective of whether those abstractions are implemented as Haskell functions or as primops: - If the former, then the Haskell library must map the full set of ops to the reduced small set (just like LLVM does internally) - If we instead have a large set of LLVM-isomorphic primops.... then to support the same primops *in the native code backend *will, again, require reimplementing all configurations of all operations. Unless... we want to make concurrency ops something that require the LLVM backend? Right now there is not a *performance* disadvantage to supporting a smaller rather than a larger set of concurrency ops (LLVM has to emulate these things anyway, or "round up" to more expensive ops). The scenario where it would be good to target ALL of LLVMs interface would be if processors and LLVM improved in the future, and we automatically got the benefit of better HW support for some op on on some arch. I'm a bit skeptical of that proposition itself, however. I personally don't really like a world where we program with "virtual operations" that don't really exist (and thus can't be *tested* against properly). Absent formal verification, it seems hard to get this code right anyway. Errors will be undetectable on existing architectures. -Ryan

Hey Ryan,
you raise some very good points.
The most important point you raise (I think) is this:
it would be very very nice to (where feasible) to add analogous machinery
to the native code gen, so that its not falling behind the llvm one quite
as much.
at least for these atomic operations (unlike the SIMD ones),
it may be worth investigating whats needed to add those to the native code
gen as well.
(adding simd support on the native codegen would be nice too, but
probably *substantially
*more work)
On Thu, Aug 22, 2013 at 11:40 AM, Ryan Newton
There's a ticket that describes the design here: http://ghc.haskell.org/trac/ghc/ticket/8157#comment:1 It is a fairly simple extension of the casMutVar# that has been in since 7.2. The implementation is on the `atomics` branch currently.
Feel free to add your views either here or on that task's comments.
One example of an alternative design would be Carter's proposal to expose something closer to the full LLVM concurrency opshttp://llvm.org/docs/Atomics.html :
Schonwald
wrote: i'm kinda thinking that we should do the analogue of exposing all the different memory model level choices (because its not that hard to add that), and when the person building it has an old version of GCC, it falls back to the legacy atomic operations?
This also gives a nice path to how to upgrade to the inline asm approach.
These LLVM ops include many parameterized configurations of loads, stores, cmpxchg, atomicrmw and barriers. In fact, it implements much more than is natively supported in most hardware, but it provides a uniform abstraction.
My original thought was that any kind of abstraction like that would be built and maintained as a Haskell library, and only the most rudimentary operations (required to get access to process features) would be exposed as primops. Let's call this the "small" set of concurrent ops.
If we want the "big set" I think we're doomed to *reproduce* the logic that maps LLVM concurrency abstractions onto machine ops irrespective of whether those abstractions are implemented as Haskell functions or as primops:
- If the former, then the Haskell library must map the full set of ops to the reduced small set (just like LLVM does internally) - If we instead have a large set of LLVM-isomorphic primops.... then to support the same primops *in the native code backend *will, again, require reimplementing all configurations of all operations.
Unless... we want to make concurrency ops something that require the LLVM backend?
Right now there is not a *performance* disadvantage to supporting a smaller rather than a larger set of concurrency ops (LLVM has to emulate these things anyway, or "round up" to more expensive ops). The scenario where it would be good to target ALL of LLVMs interface would be if processors and LLVM improved in the future, and we automatically got the benefit of better HW support for some op on on some arch.
I'm a bit skeptical of that proposition itself, however. I personally don't really like a world where we program with "virtual operations" that don't really exist (and thus can't be *tested* against properly). Absent formal verification, it seems hard to get this code right anyway. Errors will be undetectable on existing architectures.
-Ryan
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

Well, what's the long term plan? Is the LLVM backend going to become the only backend at some point? On Thu, Aug 22, 2013 at 1:43 PM, Carter Schonwald < carter.schonwald@gmail.com> wrote:
Hey Ryan, you raise some very good points.
The most important point you raise (I think) is this: it would be very very nice to (where feasible) to add analogous machinery to the native code gen, so that its not falling behind the llvm one quite as much.
at least for these atomic operations (unlike the SIMD ones), it may be worth investigating whats needed to add those to the native code gen as well.
(adding simd support on the native codegen would be nice too, but probably *substantially *more work)
On Thu, Aug 22, 2013 at 11:40 AM, Ryan Newton
wrote: There's a ticket that describes the design here: http://ghc.haskell.org/trac/ghc/ticket/8157#comment:1 It is a fairly simple extension of the casMutVar# that has been in since 7.2. The implementation is on the `atomics` branch currently.
Feel free to add your views either here or on that task's comments.
One example of an alternative design would be Carter's proposal to expose something closer to the full LLVM concurrency opshttp://llvm.org/docs/Atomics.html :
Schonwald
wrote: i'm kinda thinking that we should do the analogue of exposing all the different memory model level choices (because its not that hard to add that), and when the person building it has an old version of GCC, it falls back to the legacy atomic operations?
This also gives a nice path to how to upgrade to the inline asm approach.
These LLVM ops include many parameterized configurations of loads, stores, cmpxchg, atomicrmw and barriers. In fact, it implements much more than is natively supported in most hardware, but it provides a uniform abstraction.
My original thought was that any kind of abstraction like that would be built and maintained as a Haskell library, and only the most rudimentary operations (required to get access to process features) would be exposed as primops. Let's call this the "small" set of concurrent ops.
If we want the "big set" I think we're doomed to *reproduce* the logic that maps LLVM concurrency abstractions onto machine ops irrespective of whether those abstractions are implemented as Haskell functions or as primops:
- If the former, then the Haskell library must map the full set of ops to the reduced small set (just like LLVM does internally) - If we instead have a large set of LLVM-isomorphic primops.... then to support the same primops *in the native code backend *will, again, require reimplementing all configurations of all operations.
Unless... we want to make concurrency ops something that require the LLVM backend?
Right now there is not a *performance* disadvantage to supporting a smaller rather than a larger set of concurrency ops (LLVM has to emulate these things anyway, or "round up" to more expensive ops). The scenario where it would be good to target ALL of LLVMs interface would be if processors and LLVM improved in the future, and we automatically got the benefit of better HW support for some op on on some arch.
I'm a bit skeptical of that proposition itself, however. I personally don't really like a world where we program with "virtual operations" that don't really exist (and thus can't be *tested* against properly). Absent formal verification, it seems hard to get this code right anyway. Errors will be undetectable on existing architectures.
-Ryan
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

On 23/08/2013, at 3:52 AM, Ryan Newton wrote:
Well, what's the long term plan? Is the LLVM backend going to become the only backend at some point?
I wouldn't argue against ditching the NCG entirely. It's hard to justify fixing NCG performance problems when fixing them won't make the NCG faster than LLVM, and everyone uses LLVM anyway. We're going to need more and more SIMD support when processors supporting the Larrabee New Instructions (LRBni) appear on people's desks. At that time there still won't be a good enough reason to implement those instructions in the NCG. Ben.

2013/8/26 Ben Lippmeier
On 23/08/2013, at 3:52 AM, Ryan Newton wrote:
Well, what's the long term plan? Is the LLVM backend going to become the only backend at some point?
I wouldn't argue against ditching the NCG entirely. It's hard to justify fixing NCG performance problems when fixing them won't make the NCG faster than LLVM, and everyone uses LLVM anyway.
We're going to need more and more SIMD support when processors supporting
the Larrabee New Instructions (LRBni) appear on people's desks. At that time there still won't be a good enough reason to implement those instructions in the NCG.
Ben.
I hope to implement SIMD support for the native code gen soon. It's not a huge task and having feature parity between LLVM and NCG would be good. Niklas
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

Well, what's the long term plan? Is the LLVM backend going to become the only backend at some point?
I wouldn't argue against ditching the NCG entirely. It's hard to justify fixing NCG performance problems when fixing them won't make the NCG faster than LLVM, and everyone uses LLVM anyway.
We're going to need more and more SIMD support when processors supporting the Larrabee New Instructions (LRBni) appear on people's desks. At that time there still won't be a good enough reason to implement those instructions in the NCG.
I hope to implement SIMD support for the native code gen soon. It's not a huge task and having feature parity between LLVM and NCG would be good.
Will you also update the SIMD support, register allocators, and calling conventions in 2015 when AVX-512 lands on the desktop? On all supported platforms? What about support for the x86 vcompress and vexpand instructions with mask registers? What about when someone finally asks for packed conversions between 16xWord8s and 16xFloat32s where you need to split the result into four separate registers? LLVM does that automatically. I've been down this path before. In 2007 I implemented a separate graph colouring register allocator in the NCG to supposably improve GHC's numeric performance, but the LLVM backend subsumed that work and now having two separate register allocators is more of a maintenance burden than a help to anyone. At the time, LLVM was just becoming well known, so it wasn't obvious that implementing a new register allocator was a largely a redundant piece of work -- but I think it's clear now. I was happy to work on the project at the time, and I learned a lot from it, but when starting new projects now I also try to imagine the system that will replace the one I'm dreaming of. Of course, you should do what interests you -- I'm just pointing out a strategic consideration. Ben

On 26/08/13 08:17, Ben Lippmeier wrote:
> Well, what's the long term plan? Is the LLVM backend going to become the only backend at some point?
I wouldn't argue against ditching the NCG entirely. It's hard to justify fixing NCG performance problems when fixing them won't make the NCG faster than LLVM, and everyone uses LLVM anyway.
We're going to need more and more SIMD support when processors supporting the Larrabee New Instructions (LRBni) appear on people's desks. At that time there still won't be a good enough reason to implement those instructions in the NCG.
I hope to implement SIMD support for the native code gen soon. It's not a huge task and having feature parity between LLVM and NCG would be good.
Will you also update the SIMD support, register allocators, and calling conventions in 2015 when AVX-512 lands on the desktop? On all supported platforms? What about support for the x86 vcompress and vexpand instructions with mask registers? What about when someone finally asks for packed conversions between 16xWord8s and 16xFloat32s where you need to split the result into four separate registers? LLVM does that automatically.
I've been down this path before. In 2007 I implemented a separate graph colouring register allocator in the NCG to supposably improve GHC's numeric performance, but the LLVM backend subsumed that work and now having two separate register allocators is more of a maintenance burden than a help to anyone. At the time, LLVM was just becoming well known, so it wasn't obvious that implementing a new register allocator was a largely a redundant piece of work -- but I think it's clear now. I was happy to work on the project at the time, and I learned a lot from it, but when starting new projects now I also try to imagine the system that will replace the one I'm dreaming of.
Of course, you should do what interests you -- I'm just pointing out a strategic consideration.
The existence of LLVM is definitely an argument not to put any more effort into backend optimisation in GHC, at least for those optimisations that LLVM can already do. But as for whether the NCG is needed at all - there are a few ways that the LLVM backend needs to be improved before it can be considered to be a complete replacement for the NCG: 1. Compilation speed. LLVM approximately doubles compilation time. Avoiding going via the textual intermediate syntax would probably help here. 2. Shared library support (#4210, #5786). It works (or worked?) on a couple of platforms. But even on those platforms it generated worse code than the NCG due to using dynamic references for *all* symbols, whereas the NCG knows which symbols live in a separate package and need to use dynamic references. 3. Some low-level optimisation problems (#4308, #5567). The LLVM backend generates bad code for certain critical bits of the runtime, perhaps due to lack of good aliasing information. This hasn't been revisited in the light of the new codegen, so perhaps it's better now. Someone should benchmark the LLVM backend against the NCG with new codegen in GHC 7.8. It's possible that the new codegen is getting a slight boost because it doesn't have to split up proc points, so it can do better code generation for let-no-escapes. (It's also possible that LLVM is being penalised a bit for the same reason - I spent more time peering at NCG-generated code than LLVM-generated code). These are some good places to start if you want to see GHC drop the NCG. Cheers, Simon

To do this, IMO we'd also really have to start shipping our own copy
of LLVM. The current situation (use what we have configured or in
$PATH) won't really become feasible later on.
On platforms like ARM where there is no NCG, the mismatches can become
super painful, and it makes depending on certain features of the IR or
compiler toolchain (like an advanced, ISA-aware vectorizer in LLVM
3.3+) way more difficult, aside from being a management nightmare.
Fixing it does require taking a hit on things like build times,
though. Or we could use binary releases, but we occasionally may want
to tweak and/or fix things. If we ship our own LLVM for example, it's
reasonable to assume sometime in the future we'll want to change the
ABI during a release.
This does bring other benefits. Max Bolingbroke had an old alias
analysis plugin for LLVM that made a noticeable improvement on certain
kinds of programs, but shipping it against an arbitrary LLVM is
infeasible. Stuff like this could now be possible too.
In a way, I think there's some merit to having a simple, integrated
code generator that does the correct thing, with a high performance
option as we have now. LLVM is a huge project, and there's definitely
some part of me that thinks this may not lower our complexity budget
as much as we think, only shift parts of it around ('second rate'
platforms like PPC/ARM expose way more bugs in my experience, and
tracking them across such a massive surface area can be quite
difficult.) It's very stable and well tested, but an unequivocal
dependency on hundreds of thousands of lines of deeply complex code is
a big question no matter what.
But, the current NCG isn't that 'simple correct thing' either, though.
I think it's easily one of the least understood parts of the compiler
with a long history, it's rarely refactored or modified (very unlike
other parts,) and it's maintained only as necessary. Which doesn't
bode well for its future in any case.
On Mon, Aug 26, 2013 at 3:19 PM, Simon Marlow
On 26/08/13 08:17, Ben Lippmeier wrote:
> Well, what's the long term plan? Is the LLVM backend going to become the only backend at some point?
I wouldn't argue against ditching the NCG entirely. It's hard to justify fixing NCG performance problems when fixing them won't make the NCG faster than LLVM, and everyone uses LLVM anyway.
We're going to need more and more SIMD support when processors supporting the Larrabee New Instructions (LRBni) appear on people's desks. At that time there still won't be a good enough reason to implement those instructions in the NCG.
I hope to implement SIMD support for the native code gen soon. It's not a huge task and having feature parity between LLVM and NCG would be good.
Will you also update the SIMD support, register allocators, and calling conventions in 2015 when AVX-512 lands on the desktop? On all supported platforms? What about support for the x86 vcompress and vexpand instructions with mask registers? What about when someone finally asks for packed conversions between 16xWord8s and 16xFloat32s where you need to split the result into four separate registers? LLVM does that automatically.
I've been down this path before. In 2007 I implemented a separate graph colouring register allocator in the NCG to supposably improve GHC's numeric performance, but the LLVM backend subsumed that work and now having two separate register allocators is more of a maintenance burden than a help to anyone. At the time, LLVM was just becoming well known, so it wasn't obvious that implementing a new register allocator was a largely a redundant piece of work -- but I think it's clear now. I was happy to work on the project at the time, and I learned a lot from it, but when starting new projects now I also try to imagine the system that will replace the one I'm dreaming of.
Of course, you should do what interests you -- I'm just pointing out a strategic consideration.
The existence of LLVM is definitely an argument not to put any more effort into backend optimisation in GHC, at least for those optimisations that LLVM can already do.
But as for whether the NCG is needed at all - there are a few ways that the LLVM backend needs to be improved before it can be considered to be a complete replacement for the NCG:
1. Compilation speed. LLVM approximately doubles compilation time. Avoiding going via the textual intermediate syntax would probably help here.
2. Shared library support (#4210, #5786). It works (or worked?) on a couple of platforms. But even on those platforms it generated worse code than the NCG due to using dynamic references for *all* symbols, whereas the NCG knows which symbols live in a separate package and need to use dynamic references.
3. Some low-level optimisation problems (#4308, #5567). The LLVM backend generates bad code for certain critical bits of the runtime, perhaps due to lack of good aliasing information. This hasn't been revisited in the light of the new codegen, so perhaps it's better now.
Someone should benchmark the LLVM backend against the NCG with new codegen in GHC 7.8. It's possible that the new codegen is getting a slight boost because it doesn't have to split up proc points, so it can do better code generation for let-no-escapes. (It's also possible that LLVM is being penalised a bit for the same reason - I spent more time peering at NCG-generated code than LLVM-generated code).
These are some good places to start if you want to see GHC drop the NCG.
Cheers, Simon
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs
-- Regards, Austin - PGP: 4096R/0x91384671

I've collected the main points of this discussion on the wiki. http://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler/Backends/LLVM/Repla... Ben. On 28/08/2013, at 2:51 AM, Austin Seipp wrote:
Well, what's the long term plan? Is the LLVM backend going to become the only backend at some point?

On Sun, Aug 25, 2013 at 11:01 PM, Ben Lippmeier
On 23/08/2013, at 3:52 AM, Ryan Newton wrote:
Well, what's the long term plan? Is the LLVM backend going to become the only backend at some point?
I wouldn't argue against ditching the NCG entirely. It's hard to justify fixing NCG performance problems when fixing them won't make the NCG faster than LLVM, and everyone uses LLVM anyway.
This is not true. I don't believe I've ever seen the LLVM backend compile more quickly than the NCG, it usually takes significantly longer, and for at least some (most?) projects produces worse output. I don't have anything against the LLVM backend in principle*, but at present it's not as good as the NCG for us. We're going to need more and more SIMD support when processors supporting
the Larrabee New Instructions (LRBni) appear on people's desks. At that time there still won't be a good enough reason to implement those instructions in the NCG.
How about that the NCG is better than LLVM? ;) In all seriousness, I'm quite sympathetic to the desire to support only one backend, and LLVM can offer a lot (SIMD fallbacks, target architectures, etc). But at present, in my experience the LLVM backend doesn't really live up to what I've seen claimed for it. Given that, I think it's a bit premature to talk of dropping the NCG. My $0.02, John [1] Ok, I do have one issue with LLVM. It's always struck me as very brittle, with a lot of breakages between versions. Given that I just tried ghc -fllvm with LLVM-3.3 and the compiler bailed out due to a bad object file, my impression of brittleness doesn't seem likely to change any time soon. Given that LLVM releases major versions predictably often, I don't know that I want ghc devs spending time chasing after them. But in principle it seems the right thing to do.
participants (7)
-
Austin Seipp
-
Ben Lippmeier
-
Carter Schonwald
-
John Lato
-
Niklas Larsson
-
Ryan Newton
-
Simon Marlow