simd branch ready for review

Hi Simon, I've pushed my simd branch to darcs.haskell.org. Everything has been rebased against HEAD. Simon PJ and I looked over the changes together already, but I wanted to give you (and everyone on ghc-devs) the opportunity to look things over before I merge to HEAD. Simon PJ and I came up with a few questions/notes for you, but hopefully nothing that should delay a merge. * Win32 issues Modern 32-bit x86 *NIX systems align the stack to 16-bytes, but Win32 aligns only to 4-bytes. LLVM does not assume 16-byte stack alignment. Instead, on platforms where 16-byte stack alignment is not guaranteed, it 1) always outputs a function prologue that 2) aligns the stack to a 16-byte boundary with an "and" instructions, and it also 3) disables tail calls. Because LLVM aligns the stack for a function that has SSE register spills, it also generates movaps instructions (aligned SSE moves) for the spills. This makes SSE support on Win32 difficult, and in my opinion not worth worrying about. The alternative is to 1) patch LLVM to disable the stack-alignment code so that we recover the ability to use tail calls and so that ebp scribbled over by the prologue and 2) patch the mangler to rewrite LLVM's movaps (move aligned) instructions to movups (move unaligned) instructions. I have these patches, but they are not included in the simd branch. * How hard would it be to dump ArgRep for PrimRep? It looks straightforward. Is it worth doing? * How hard would it be to track bit width in PrimRep? I recall chatting with you once about adding explicit support for, e.g., 8- and 16-bit Word/Int primops instead of relying on narrowing. Since SIMD vectors need to know the exact bit-width of their elements, I've had to create a PrimElemRep data type in compiler/types/TyCon.lhs, but I'd really like to be able to re-use PrimRep instead. * If we replaced all old-style C-- code, could we get rid of the explicit STG registers completely? Simon PJ suggested that we use real machine registers directly, so, for example, GlobalReg's constructors would have FastString fields instead of Int fields. * Could we add a CmmType field to GlobalReg's constructors? You'll see that I added a new XmmReg constructor to GlobalReg, but because I don't know the type of an XmmReg, I have to bitcast everywhere in the generated LLVM code because LLVM wants to know not just that a value is a 16-byte vector, but that it is, e.g., a 16-byte vector containing 2 64-bit doubles. Having a CmmType attached to a GlobalReg---or pairing a GlobalReg with a CmmType when assigning registers---would let me avoid all these casts. Thanks! Geoff

On 31/01/13 11:38, Geoffrey Mainland wrote:
I've pushed my simd branch to darcs.haskell.org. Everything has been rebased against HEAD. Simon PJ and I looked over the changes together already, but I wanted to give you (and everyone on ghc-devs) the opportunity to look things over before I merge to HEAD. Simon PJ and I came up with a few questions/notes for you, but hopefully nothing that should delay a merge.
I'm happy for these to go in - we've already discussed the design a few times, and you've incorporated changes we agreed before, so as far as I'm concerned it's all good. Go for it!
* Win32 issues
Modern 32-bit x86 *NIX systems align the stack to 16-bytes, but Win32 aligns only to 4-bytes. LLVM does not assume 16-byte stack alignment. Instead, on platforms where 16-byte stack alignment is not guaranteed, it 1) always outputs a function prologue that 2) aligns the stack to a 16-byte boundary with an "and" instructions, and it also 3) disables tail calls. Because LLVM aligns the stack for a function that has SSE register spills, it also generates movaps instructions (aligned SSE moves) for the spills.
I must be misunderstanding your use of "always" above, because that would imply that the LLVM backend doesn't work on Win32 at all. Maybe LLVM only aligns the stack when it needs to store SSE values?
This makes SSE support on Win32 difficult, and in my opinion not worth worrying about.
The alternative is to 1) patch LLVM to disable the stack-alignment code so that we recover the ability to use tail calls and so that ebp scribbled over by the prologue and 2) patch the mangler to rewrite LLVM's movaps (move aligned) instructions to movups (move unaligned) instructions. I have these patches, but they are not included in the simd branch.
I don't have an opinion here - maybe ask David T what he'd prefer.
* How hard would it be to dump ArgRep for PrimRep? It looks straightforward. Is it worth doing?
ArgRep makes fewer distinctions than PrimRep, in particular it collapses IntRep/WordRep/AddrRep into N and Int64Rep/Word64Rep into L. I doubt it would improve things to get rid of ArgRep. It's only used in a very few places, but those places would get more complicated if they had to use PrimRep instead, because instead of pattern-matching on N you would need a guard. I think ArgRep is ok, because it matches the different ways we pass arguments to functions.
* How hard would it be to track bit width in PrimRep? I recall chatting with you once about adding explicit support for, e.g., 8- and 16-bit Word/Int primops instead of relying on narrowing. Since SIMD vectors need to know the exact bit-width of their elements, I've had to create a PrimElemRep data type in compiler/types/TyCon.lhs, but I'd really like to be able to re-use PrimRep instead.
This is something we really should do, but it's a big job. Feel free to have a go in your spare time!
* If we replaced all old-style C-- code, could we get rid of the explicit STG registers completely? Simon PJ suggested that we use real machine registers directly, so, for example, GlobalReg's constructors would have FastString fields instead of Int fields.
It's difficult to get rid of *all* the old-style C--. Some of the places I kept explicit-stack code because I was being lazy, but some of them are really hard to write in new C-- (or at least are hard to write in new C-- that compiles to good code). I don't think that renaming R1 to %rbx (etc.) achieves a lot. It would make things a bit more difficult for the LLVM backend, which has to reverse the mapping. You do need a platform-independent name for R1 in some places, like codeGen for example. I have thought about whether you could remove R1 and co altogether (not just rename them to machine registers) by extending Cmm to include information about incoming parameters. e.g. for a function f(x,y,z), we generate f: x = R1 -- %rbx y = R2 -- %r14 z = R3 -- %rsi ... body of f ... and here's the tricky bit: we almost never want to move those assignments, because they generate no code. The register allocator remembers that x is in %rbx and moves on; x can be spilled and %rbx can be reused at any time. But if you had a misguided optimisation pass that sinks these assignments down into the code somewhere, *then* the code is worse, because now %rbx is live from the beginning of the function until its use, and we have fewer registers to play with. With this in mind, it seems natural to represent the code as f(x = %rbx, y = %r14, z = %rsi): ... body of f ... ie. statically prevent the motion of the copy-in assignments by explicitly including them in the representation of a function. This is a cool idea, but then we have to do the same for return points. Instead of just plain labels, we have labels with register assignments. Furthermore, we sometimes like to jump directly to a return point from another code path (a join point), and load up the registers explicitly. This is where being able to write an assignment to R1 comes in handy. So I decided not to pursue this. It might still be a good idea, I'm not sure.
* Could we add a CmmType field to GlobalReg's constructors? You'll see that I added a new XmmReg constructor to GlobalReg, but because I don't know the type of an XmmReg, I have to bitcast everywhere in the generated LLVM code because LLVM wants to know not just that a value is a 16-byte vector, but that it is, e.g., a 16-byte vector containing 2 64-bit doubles. Having a CmmType attached to a GlobalReg---or pairing a GlobalReg with a CmmType when assigning registers---would let me avoid all these casts.
We already have a function globalRegType :: DynFlags -> GlobalReg -> CmmType so I see that you're guessing in the case of XmmReg. Why not just add the necessary information to XmmReg so that you don't have to guess in globalRegType? Cheers, Simon

On 01/31/2013 12:56 PM, Simon Marlow wrote:
On 31/01/13 11:38, Geoffrey Mainland wrote:
I've pushed my simd branch to darcs.haskell.org. Everything has been rebased against HEAD. Simon PJ and I looked over the changes together already, but I wanted to give you (and everyone on ghc-devs) the opportunity to look things over before I merge to HEAD. Simon PJ and I came up with a few questions/notes for you, but hopefully nothing that should delay a merge.
I'm happy for these to go in - we've already discussed the design a few times, and you've incorporated changes we agreed before, so as far as I'm concerned it's all good. Go for it!
Cool.
* Win32 issues
Modern 32-bit x86 *NIX systems align the stack to 16-bytes, but Win32 aligns only to 4-bytes. LLVM does not assume 16-byte stack alignment. Instead, on platforms where 16-byte stack alignment is not guaranteed, it 1) always outputs a function prologue that 2) aligns the stack to a 16-byte boundary with an "and" instructions, and it also 3) disables tail calls. Because LLVM aligns the stack for a function that has SSE register spills, it also generates movaps instructions (aligned SSE moves) for the spills.
I must be misunderstanding your use of "always" above, because that would imply that the LLVM backend doesn't work on Win32 at all. Maybe LLVM only aligns the stack when it needs to store SSE values?
You are correct---the stack-aligning prologue is only added by LLVM when SSE values are written to the stack, so this wasn't a problem before we had SSE support.
This makes SSE support on Win32 difficult, and in my opinion not worth worrying about.
The alternative is to 1) patch LLVM to disable the stack-alignment code so that we recover the ability to use tail calls and so that ebp scribbled over by the prologue and 2) patch the mangler to rewrite LLVM's movaps (move aligned) instructions to movups (move unaligned) instructions. I have these patches, but they are not included in the simd branch.
I don't have an opinion here - maybe ask David T what he'd prefer.
Requiring an LLVM hack seems pretty bad, and David yelled when I changed the mangler since he wants to get rid of it eventually. My patches are still around, so if we decide Win32 support is important, I can always add the changes.
* Could we add a CmmType field to GlobalReg's constructors? You'll see that I added a new XmmReg constructor to GlobalReg, but because I don't know the type of an XmmReg, I have to bitcast everywhere in the generated LLVM code because LLVM wants to know not just that a value is a 16-byte vector, but that it is, e.g., a 16-byte vector containing 2 64-bit doubles. Having a CmmType attached to a GlobalReg---or pairing a GlobalReg with a CmmType when assigning registers---would let me avoid all these casts.
We already have a function
globalRegType :: DynFlags -> GlobalReg -> CmmType
so I see that you're guessing in the case of XmmReg. Why not just add the necessary information to XmmReg so that you don't have to guess in globalRegType?
There doesn't seem to be a clear best choice for this extra info. A CmmType seems reasonable, and if I'm adding a CmmType to XmmReg, why not add it everywhere and simplify globalRegType? I'll go ahead and stick with what I have now. Thanks for all your answers. Geoff

On 31 January 2013 09:52, Geoffrey Mainland
On 01/31/2013 12:56 PM, Simon Marlow wrote:
On 31/01/13 11:38, Geoffrey Mainland wrote:
I've pushed my simd branch to darcs.haskell.org. Everything has been rebased against HEAD. Simon PJ and I looked over the changes together already, but I wanted to give you (and everyone on ghc-devs) the opportunity to look things over before I merge to HEAD. Simon PJ and I came up with a few questions/notes for you, but hopefully nothing that should delay a merge.
I'm happy for these to go in - we've already discussed the design a few times, and you've incorporated changes we agreed before, so as far as I'm concerned it's all good. Go for it!
Cool.
* Win32 issues
Modern 32-bit x86 *NIX systems align the stack to 16-bytes, but Win32 aligns only to 4-bytes. LLVM does not assume 16-byte stack alignment. Instead, on platforms where 16-byte stack alignment is not guaranteed, it 1) always outputs a function prologue that 2) aligns the stack to a 16-byte boundary with an "and" instructions, and it also 3) disables tail calls. Because LLVM aligns the stack for a function that has SSE register spills, it also generates movaps instructions (aligned SSE moves) for the spills.
I must be misunderstanding your use of "always" above, because that would imply that the LLVM backend doesn't work on Win32 at all. Maybe LLVM only aligns the stack when it needs to store SSE values?
You are correct---the stack-aligning prologue is only added by LLVM when SSE values are written to the stack, so this wasn't a problem before we had SSE support.
This makes SSE support on Win32 difficult, and in my opinion not worth worrying about.
The alternative is to 1) patch LLVM to disable the stack-alignment code so that we recover the ability to use tail calls and so that ebp scribbled over by the prologue and 2) patch the mangler to rewrite LLVM's movaps (move aligned) instructions to movups (move unaligned) instructions. I have these patches, but they are not included in the simd branch.
I don't have an opinion here - maybe ask David T what he'd prefer.
Requiring an LLVM hack seems pretty bad, and David yelled when I changed the mangler since he wants to get rid of it eventually. My patches are still around, so if we decide Win32 support is important, I can always add the changes.
Not supporting Win32 sucks but yes, I want to move to just requiring LLVM un-patched and no mangler. How ugly are the patches for LLVM? I'd be supportive of it if the plan is to get them merged upstream. Otherwise, I don't think it is worth the effort of having to carry around our own patched LLVM for installation on windows. Cheers, David
* Could we add a CmmType field to GlobalReg's constructors? You'll see that I added a new XmmReg constructor to GlobalReg, but because I don't know the type of an XmmReg, I have to bitcast everywhere in the generated LLVM code because LLVM wants to know not just that a value is a 16-byte vector, but that it is, e.g., a 16-byte vector containing 2 64-bit doubles. Having a CmmType attached to a GlobalReg---or pairing a GlobalReg with a CmmType when assigning registers---would let me avoid all these casts.
We already have a function
globalRegType :: DynFlags -> GlobalReg -> CmmType
so I see that you're guessing in the case of XmmReg. Why not just add the necessary information to XmmReg so that you don't have to guess in globalRegType?
There doesn't seem to be a clear best choice for this extra info. A CmmType seems reasonable, and if I'm adding a CmmType to XmmReg, why not add it everywhere and simplify globalRegType? I'll go ahead and stick with what I have now.
Thanks for all your answers.
Geoff
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

On 01/31/2013 07:10 PM, David Terei wrote:
On 31 January 2013 09:52, Geoffrey Mainland
wrote: On 01/31/2013 12:56 PM, Simon Marlow wrote:
On 31/01/13 11:38, Geoffrey Mainland wrote:
* Win32 issues
Modern 32-bit x86 *NIX systems align the stack to 16-bytes, but Win32 aligns only to 4-bytes. LLVM does not assume 16-byte stack alignment. Instead, on platforms where 16-byte stack alignment is not guaranteed, it 1) always outputs a function prologue that 2) aligns the stack to a 16-byte boundary with an "and" instructions, and it also 3) disables tail calls. Because LLVM aligns the stack for a function that has SSE register spills, it also generates movaps instructions (aligned SSE moves) for the spills.
I must be misunderstanding your use of "always" above, because that would imply that the LLVM backend doesn't work on Win32 at all. Maybe LLVM only aligns the stack when it needs to store SSE values?
You are correct---the stack-aligning prologue is only added by LLVM when SSE values are written to the stack, so this wasn't a problem before we had SSE support.
This makes SSE support on Win32 difficult, and in my opinion not worth worrying about.
The alternative is to 1) patch LLVM to disable the stack-alignment code so that we recover the ability to use tail calls and so that ebp scribbled over by the prologue and 2) patch the mangler to rewrite LLVM's movaps (move aligned) instructions to movups (move unaligned) instructions. I have these patches, but they are not included in the simd branch.
I don't have an opinion here - maybe ask David T what he'd prefer.
Requiring an LLVM hack seems pretty bad, and David yelled when I changed the mangler since he wants to get rid of it eventually. My patches are still around, so if we decide Win32 support is important, I can always add the changes.
Not supporting Win32 sucks but yes, I want to move to just requiring LLVM un-patched and no mangler. How ugly are the patches for LLVM? I'd be supportive of it if the plan is to get them merged upstream. Otherwise, I don't think it is worth the effort of having to carry around our own patched LLVM for installation on windows.
The patch against LLVM 3.0 is here: https://github.com/mainland/ghc-simd-tests/blob/master/patches/llvm-3.0.patc... If you were to look, you'd see that it's not appropriate for upstream integration. Please don't look :) Since we have support for Win64 as of GHC 7.6, I vote that we forget about Win32 support for SSE. Simon, this reminds me of two other issues... 1) SSE vector values are only passed in registers on x86-64 anyway right now. MAX_REAL_FLOAT_REG and MAX_REAL_DOUBLE_REG are both #defined to 0 on x86-32 in includes/stg/MachRegs.h. Are floats and double not passed in registers on x86-32? I'm confused as to how this works. The GHC calling convention in LLVM certainly says they are passed in registers. 2) SSE support is processor and platform dependent. What is the proper way for the programmer to know what SSE primitives are available? A CPP define? If so, what should it be called? Right now one can look at the TARGET_* and __GLASGOW_HASKELL_LLVM__ CPP macros and make a decision as to whether or not SSE primitives are available, but that's not a great solution. Also, what happens when we want to add AVX support? How do we control the inclusion of AVX support when building GHC, and how do we let the programmer know that the AVX primops/primtypes are available for use? Geoff

On Thu, Jan 31, 2013 at 12:30 PM, Geoffrey Mainland
2) SSE support is processor and platform dependent. What is the proper way for the programmer to know what SSE primitives are available? A CPP define? If so, what should it be called?
This needs a combination of compile-time and run-time information. The compiler can tell you what instructions it's willing to use, but you also have to ask the CPU at runtime what it supports, otherwise you'll end up with crashes when people move code from a machine that has SSE4.2 to a machine that has SSE2. Johan added some support for the compile-time bit recently.

On 01/31/2013 08:40 PM, Bryan O'Sullivan wrote:
On Thu, Jan 31, 2013 at 12:30 PM, Geoffrey Mainland
wrote: 2) SSE support is processor and platform dependent. What is the proper way for the programmer to know what SSE primitives are available? A CPP define? If so, what should it be called?
This needs a combination of compile-time and run-time information. The compiler can tell you what instructions it's willing to use, but you also have to ask the CPU at runtime what it supports, otherwise you'll end up with crashes when people move code from a machine that has SSE4.2 to a machine that has SSE2.
Johan added some support for the compile-time bit recently.
Yes, I saw his patches to set CPP defines when -msee and -msse2 are passed to GHC. Some sort of decision about, e.g., AVX support will have to be made when GHC is built due to the existence of GHC.PrimopWrappers. I think you are suggesting that binaries built on a machine that supports SSE4.2 should not *require* SSE4.2 to run just because they were built on a machine with SSE4.2 support? I agree. However, if the user makes a decision at GHC build time to make use of AVX instructions, then we can't expect the resulting binaries to run on a machine that doesn't support AVX. How does the user specify that GHC should be built with support for AVX primops? And how do we then tell the programmer which set(s) of primops are available? Geoff

On 31 January 2013 12:30, Geoffrey Mainland
On 01/31/2013 07:10 PM, David Terei wrote:
On 31 January 2013 09:52, Geoffrey Mainland
wrote: On 01/31/2013 12:56 PM, Simon Marlow wrote:
On 31/01/13 11:38, Geoffrey Mainland wrote:
* Win32 issues
Modern 32-bit x86 *NIX systems align the stack to 16-bytes, but Win32 aligns only to 4-bytes. LLVM does not assume 16-byte stack alignment. Instead, on platforms where 16-byte stack alignment is not guaranteed, it 1) always outputs a function prologue that 2) aligns the stack to a 16-byte boundary with an "and" instructions, and it also 3) disables tail calls. Because LLVM aligns the stack for a function that has SSE register spills, it also generates movaps instructions (aligned SSE moves) for the spills.
I must be misunderstanding your use of "always" above, because that would imply that the LLVM backend doesn't work on Win32 at all. Maybe LLVM only aligns the stack when it needs to store SSE values?
You are correct---the stack-aligning prologue is only added by LLVM when SSE values are written to the stack, so this wasn't a problem before we had SSE support.
This makes SSE support on Win32 difficult, and in my opinion not worth worrying about.
The alternative is to 1) patch LLVM to disable the stack-alignment code so that we recover the ability to use tail calls and so that ebp scribbled over by the prologue and 2) patch the mangler to rewrite LLVM's movaps (move aligned) instructions to movups (move unaligned) instructions. I have these patches, but they are not included in the simd branch.
I don't have an opinion here - maybe ask David T what he'd prefer.
Requiring an LLVM hack seems pretty bad, and David yelled when I changed the mangler since he wants to get rid of it eventually. My patches are still around, so if we decide Win32 support is important, I can always add the changes.
Not supporting Win32 sucks but yes, I want to move to just requiring LLVM un-patched and no mangler. How ugly are the patches for LLVM? I'd be supportive of it if the plan is to get them merged upstream. Otherwise, I don't think it is worth the effort of having to carry around our own patched LLVM for installation on windows.
The patch against LLVM 3.0 is here:
https://github.com/mainland/ghc-simd-tests/blob/master/patches/llvm-3.0.patc...
If you were to look, you'd see that it's not appropriate for upstream integration. Please don't look :)
Done :).
Since we have support for Win64 as of GHC 7.6, I vote that we forget about Win32 support for SSE.
Yes, I meant to ask about Win64. Strongly agreed.
Simon, this reminds me of two other issues...
1) SSE vector values are only passed in registers on x86-64 anyway right now. MAX_REAL_FLOAT_REG and MAX_REAL_DOUBLE_REG are both #defined to 0 on x86-32 in includes/stg/MachRegs.h. Are floats and double not passed in registers on x86-32? I'm confused as to how this works. The GHC calling convention in LLVM certainly says they are passed in registers.
Not on x86-32. From the LLVM userguide on the GHC calling convention: "On X86-32 only supports up to 4 bit type parameters. No floating point types are supported. On X86-64 only supports up to 10 bit type parameters and 6 floating point parameters."
2) SSE support is processor and platform dependent. What is the proper way for the programmer to know what SSE primitives are available? A CPP define? If so, what should it be called?
Right now one can look at the TARGET_* and __GLASGOW_HASKELL_LLVM__ CPP macros and make a decision as to whether or not SSE primitives are available, but that's not a great solution. Also, what happens when we want to add AVX support? How do we control the inclusion of AVX support when building GHC, and how do we let the programmer know that the AVX primops/primtypes are available for use?
Geoff

On 31/01/13 20:30, Geoffrey Mainland wrote:
Simon, this reminds me of two other issues...
1) SSE vector values are only passed in registers on x86-64 anyway right now. MAX_REAL_FLOAT_REG and MAX_REAL_DOUBLE_REG are both #defined to 0 on x86-32 in includes/stg/MachRegs.h. Are floats and double not passed in registers on x86-32? I'm confused as to how this works. The GHC calling convention in LLVM certainly says they are passed in registers.
No, floats are not passed in registers on x86-32. I don't know about LLVM.
2) SSE support is processor and platform dependent. What is the proper way for the programmer to know what SSE primitives are available? A CPP define? If so, what should it be called?
Right now one can look at the TARGET_* and __GLASGOW_HASKELL_LLVM__ CPP macros and make a decision as to whether or not SSE primitives are available, but that's not a great solution. Also, what happens when we want to add AVX support? How do we control the inclusion of AVX support when building GHC, and how do we let the programmer know that the AVX primops/primtypes are available for use?
We #define __SSE__: http://hackage.haskell.org/trac/ghc/ticket/7554 Similar things would need to happen for AVX. Cheers, Simon

On 02/01/2013 08:03 AM, Simon Marlow wrote:
On 31/01/13 20:30, Geoffrey Mainland wrote:
2) SSE support is processor and platform dependent. What is the proper way for the programmer to know what SSE primitives are available? A CPP define? If so, what should it be called?
Right now one can look at the TARGET_* and __GLASGOW_HASKELL_LLVM__ CPP macros and make a decision as to whether or not SSE primitives are available, but that's not a great solution. Also, what happens when we want to add AVX support? How do we control the inclusion of AVX support when building GHC, and how do we let the programmer know that the AVX primops/primtypes are available for use?
We #define __SSE__: http://hackage.haskell.org/trac/ghc/ticket/7554
Similar things would need to happen for AVX.
If I invoke ghc with -msse on a Win32 box then __SSE__ will be 1, but that doesn't mean I will be able to use SSE primops due to the stack alignment issue on that platform. Also, the decision as to whether or not SSE primops will be available at all needs to be made when GHC is itself built. How should we expose that knob? As an aside, what's the proper way for me to test the ARM cross-compilation support? I'm afraid my patches may break things there. Thanks, Geoff

On 01/02/13 08:19, Geoffrey Mainland wrote:
On 02/01/2013 08:03 AM, Simon Marlow wrote:
On 31/01/13 20:30, Geoffrey Mainland wrote:
2) SSE support is processor and platform dependent. What is the proper way for the programmer to know what SSE primitives are available? A CPP define? If so, what should it be called?
Right now one can look at the TARGET_* and __GLASGOW_HASKELL_LLVM__ CPP macros and make a decision as to whether or not SSE primitives are available, but that's not a great solution. Also, what happens when we want to add AVX support? How do we control the inclusion of AVX support when building GHC, and how do we let the programmer know that the AVX primops/primtypes are available for use?
We #define __SSE__: http://hackage.haskell.org/trac/ghc/ticket/7554
Similar things would need to happen for AVX.
If I invoke ghc with -msse on a Win32 box then __SSE__ will be 1, but that doesn't mean I will be able to use SSE primops due to the stack alignment issue on that platform.
I guess we should not allow -msse with -fllvm on Win32. Or perhaps not -msse at all.
Also, the decision as to whether or not SSE primops will be available at all needs to be made when GHC is itself built. How should we expose that knob?
Why does it need to be decided at build time? Isn't it dependent on -msse on x86-32, and always available on x86-64?
As an aside, what's the proper way for me to test the ARM cross-compilation support? I'm afraid my patches may break things there.
You could follow the instructions for building an RPi cross-compiler here: http://hackage.haskell.org/trac/ghc/wiki/Building/Preparation/RaspberryPi it should be fairly smooth. Cheers, Simon

On 02/01/2013 08:47 AM, Simon Marlow wrote:
On 01/02/13 08:19, Geoffrey Mainland wrote:
On 02/01/2013 08:03 AM, Simon Marlow wrote:
On 31/01/13 20:30, Geoffrey Mainland wrote:
2) SSE support is processor and platform dependent. What is the proper way for the programmer to know what SSE primitives are available? A CPP define? If so, what should it be called?
Right now one can look at the TARGET_* and __GLASGOW_HASKELL_LLVM__ CPP macros and make a decision as to whether or not SSE primitives are available, but that's not a great solution. Also, what happens when we want to add AVX support? How do we control the inclusion of AVX support when building GHC, and how do we let the programmer know that the AVX primops/primtypes are available for use?
We #define __SSE__: http://hackage.haskell.org/trac/ghc/ticket/7554
Similar things would need to happen for AVX.
If I invoke ghc with -msse on a Win32 box then __SSE__ will be 1, but that doesn't mean I will be able to use SSE primops due to the stack alignment issue on that platform.
I guess we should not allow -msse with -fllvm on Win32. Or perhaps not -msse at all.
Hm, OK, but that would also means external C files could not be compiled with -msse when using the LLVM back-end on Win32 since we pass the flags through to the C compiler, right?
Also, the decision as to whether or not SSE primops will be available at all needs to be made when GHC is itself built. How should we expose that knob?
Why does it need to be decided at build time? Isn't it dependent on -msse on x86-32, and always available on x86-64?
The primops would need to be compiled in to GHC for -msse to be able expose them. In general, how would we make the set of primops and primtypes a function of dynflags? It's my understanding that this isn't possible right now. Or is it? And what about GHC.PrimopWrappers?
As an aside, what's the proper way for me to test the ARM cross-compilation support? I'm afraid my patches may break things there.
You could follow the instructions for building an RPi cross-compiler here:
http://hackage.haskell.org/trac/ghc/wiki/Building/Preparation/RaspberryPi
it should be fairly smooth.
Thanks! Geoff

On 02/ 1/13 09:19 AM, Geoffrey Mainland wrote:
As an aside, what's the proper way for me to test the ARM cross-compilation support? I'm afraid my patches may break things there.
I've seen you've merged your changes into mainline so I've done a build of GHC HEAD on my arm/linux and it's gone fine so you've not broken anything -- at least from the build perspective. Thanks, Karel

On 02/02/2013 09:37 AM, Karel Gardas wrote:
On 02/ 1/13 09:19 AM, Geoffrey Mainland wrote:
As an aside, what's the proper way for me to test the ARM cross-compilation support? I'm afraid my patches may break things there.
I've seen you've merged your changes into mainline so I've done a build of GHC HEAD on my arm/linux and it's gone fine so you've not broken anything -- at least from the build perspective.
Thanks, Karel
Thanks for the confirmation. I followed the instructions for building the Raspberry Pi cross GHC and tested the simd branch before I merged, but I'm glad to know I didn't break anything obvious for you either! Geoff

I'm really excited to see this merged in! Props on all involved
*question 1*: Will this be included in the upcoming 7.8 release?
*question 2*: I see that some of the useful (albeit specialized) SSE
primops aren't included, though it looks like adding them (at least for
platforms that support them)
would be largely mechanical..... If adding those primops is something GHC
HQ would welcome (ignoring the sorting out the whole supporting SSE2 vs
full AVX discussion), I'm more than happy to spend some time turning the
crank to add those primops.
thanks
-Carter Schonwald
On Sat, Feb 2, 2013 at 4:46 AM, Geoffrey Mainland
On 02/02/2013 09:37 AM, Karel Gardas wrote:
On 02/ 1/13 09:19 AM, Geoffrey Mainland wrote:
As an aside, what's the proper way for me to test the ARM cross-compilation support? I'm afraid my patches may break things there.
I've seen you've merged your changes into mainline so I've done a build of GHC HEAD on my arm/linux and it's gone fine so you've not broken anything -- at least from the build perspective.
Thanks, Karel
Thanks for the confirmation. I followed the instructions for building the Raspberry Pi cross GHC and tested the simd branch before I merged, but I'm glad to know I didn't break anything obvious for you either!
Geoff
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

On 02/04/2013 09:34 PM, Carter Schonwald wrote:
I'm really excited to see this merged in! Props on all involved
question 1: Will this be included in the upcoming 7.8 release?
Yes, that's the plan!
question 2: I see that some of the useful (albeit specialized) SSE primops aren't included, though it looks like adding them (at least for platforms that support them) would be largely mechanical..... If adding those primops is something GHC HQ would welcome (ignoring the sorting out the whole supporting SSE2 vs full AVX discussion), I'm more than happy to spend some time turning the crank to add those primops.
I'd like to figure out how to properly support having the set of available primops depend on the dynamic flags before adding too much more. I'll be speaking to Simon PJ about it tomorrow. Do you have specific needs for any missing primops? If so, I'd like to know---customers are good :) We talked a while ago about you possibly cooking up some sample programs that needed SSE instructions. Have there been any recent developments? Thanks, Geoff
thanks -Carter Schonwald
On Sat, Feb 2, 2013 at 4:46 AM, Geoffrey Mainland
wrote: On 02/02/2013 09:37 AM, Karel Gardas wrote:
On 02/ 1/13 09:19 AM, Geoffrey Mainland wrote:
As an aside, what's the proper way for me to test the ARM cross-compilation support? I'm afraid my patches may break things there.
I've seen you've merged your changes into mainline so I've done a build of GHC HEAD on my arm/linux and it's gone fine so you've not broken anything -- at least from the build perspective.
Thanks, Karel
Thanks for the confirmation. I followed the instructions for building the Raspberry Pi cross GHC and tested the simd branch before I merged, but I'm glad to know I didn't break anything obvious for you either!
Geoff

On Mon, Feb 4, 2013 at 2:09 PM, Geoffrey Mainland
I'd like to figure out how to properly support having the set of available primops depend on the dynamic flags before adding too much more. I'll be speaking to Simon PJ about it tomorrow.
Could we use a fallback, like we did for e.g. popcount? I don't think have conditionally defined primops is a good idea. How would you use them in programs? You'd have to do something like: #ifdef ??? -- use primops #else -- use fallback #endif and everyone would write their own fallback. It would be better if GHC fell back to some generic implementation. -- Johan

On 02/04/2013 10:12 PM, Johan Tibell wrote:
On Mon, Feb 4, 2013 at 2:09 PM, Geoffrey Mainland
wrote: I'd like to figure out how to properly support having the set of available primops depend on the dynamic flags before adding too much more. I'll be speaking to Simon PJ about it tomorrow.
Could we use a fallback, like we did for e.g. popcount? I don't think have conditionally defined primops is a good idea. How would you use them in programs? You'd have to do something like:
#ifdef ??? -- use primops #else -- use fallback #endif
and everyone would write their own fallback. It would be better if GHC fell back to some generic implementation.
-- Johan
What would a sensible fallback be for AVX instructions? What should we fall back on when the LLVM backend is not being used? Maybe we could desugar AVX instructions to SSE instructions on platforms that support SSE but not AVX, but in practice people would then #ifdef anyway and just use SSE if AVX weren't available. The current idea is to hide the #ifdefs in a library. Clients of the library would then get the "best" short-vector implementation available for their platform by using this library. Right now this library is a modified version of primitive, and I have modified versions of vector and DPH that use this version of the primitive library to generate SSE code. I am certainly open to alternative designs. Geoff

On Mon, Feb 4, 2013 at 3:19 PM, Geoffrey Mainland
What would a sensible fallback be for AVX instructions? What should we fall back on when the LLVM backend is not being used?
Depends on the instruction. A 256-bit multiply could be replaced by N multiplies etc. For popcount we have a little bit of C code in ghc-prim that we use if SSE 4.2 isn't enabled. An alternative is to emit some different assembly in e.g. the x86-64 backend if AVX isn't enabled.
Maybe we could desugar AVX instructions to SSE instructions on platforms that support SSE but not AVX, but in practice people would then #ifdef anyway and just use SSE if AVX weren't available.
I don't follow here. If you conditionally emitted different instructions in the backends depending on which -m flags are passed to GHC, why would people #ifdef?
The current idea is to hide the #ifdefs in a library. Clients of the library would then get the "best" short-vector implementation available for their platform by using this library. Right now this library is a modified version of primitive, and I have modified versions of vector and DPH that use this version of the primitive library to generate SSE code.
You would still end up with an GHC.Exts that exports a different API depending on which flags (e.g. -m<something>) are passed to GHC. Couldn't you use ghc-prim for your fallbacks and have GHC.Exts.yourPrimOp use either those fallbacks or the AVX instructions. -- Johan

On 02/04/2013 11:56 PM, Johan Tibell wrote:
On Mon, Feb 4, 2013 at 3:19 PM, Geoffrey Mainland
wrote: What would a sensible fallback be for AVX instructions? What should we fall back on when the LLVM backend is not being used?
Depends on the instruction. A 256-bit multiply could be replaced by N multiplies etc. For popcount we have a little bit of C code in ghc-prim that we use if SSE 4.2 isn't enabled. An alternative is to emit some different assembly in e.g. the x86-64 backend if AVX isn't enabled.
Maybe we could desugar AVX instructions to SSE instructions on platforms that support SSE but not AVX, but in practice people would then #ifdef anyway and just use SSE if AVX weren't available.
I don't follow here. If you conditionally emitted different instructions in the backends depending on which -m flags are passed to GHC, why would people #ifdef?
I think you are suggesting that the user should always use 256-bit short-vector instructions, and that on platforms where AVX is not available, this would fall back to an implementation that performed multiple SSE instructions for each 256-bit vector instruction---and used multiple XMM registers to hold each 256-bit vector value (or spilled). Anyone using low-level primops should only do so if they really want low-level control. The most efficient SSE implementation of a function is not going to be whatever implementation falls out of a desugaring of generic 256-bit short-vector primitives. Therefore, I suspect that anyone using low-level vector primops like this will #ifdef and provide two implementations---one for SSE, one for AVX. Anyone who doesn't care about this level of detail should use a higher-level interface---which we have already implemented---and which does not require any ifdefs. People will #ifdef because they can provide better SSE implementations than GHC when AVX instructions are not available. I am suggesting that we push the "ifdefs" into a library. The vast majority of programmers will never see the ifdefs, because they will use the library. I think you are suggesting that we push the "ifdefs" into GHC. That way nobody will have a choice---they get whatever desugaring GHC gives them. I understand your point of view---having primops that don't work everywhere is a real pain and aesthetically unpleasing---but I prefer exposing more low-level details in our primops even if it means a bit of unpleasantness once in a while. This does mean a tiny segment of programmers will have to deal with ifdefs, but I suspect that this tiny segment of programmers would prefer ifdefs to a lack of control. If a population count operation translates to a few extra instructions, I don't think anyone will care. If a body of code performing short-vector operations desugars to twice as many instructions that require twice as many registers, thereby resulting in a bunch of extra spills, it will matter. Put differently, there is a more-or-less canonical desugaring of population count. For a given function using short-vector instructions of one width, there is not a canonical desugaring into a function using short-vector instructions of a lesser width.
The current idea is to hide the #ifdefs in a library. Clients of the library would then get the "best" short-vector implementation available for their platform by using this library. Right now this library is a modified version of primitive, and I have modified versions of vector and DPH that use this version of the primitive library to generate SSE code.
You would still end up with an GHC.Exts that exports a different API depending on which flags (e.g. -m<something>) are passed to GHC. Couldn't you use ghc-prim for your fallbacks and have GHC.Exts.yourPrimOp use either those fallbacks or the AVX instructions.
This is basically what I've implemented, expect there is a Multi type family that "picks" the appropriate short-vector representation for a type, e.g., DoubleX2# for Double on machines with SSE, DoubleX4# for Double on machines with AVX, and accompanying set of short-vector operations. We have a concrete design and implementation---take a look at the primitive, vector, and dph packages on my github page (http://github.com/mainland). I would be very happy to discuss any concrete alternative design. We also have a paper with some performance measurements (http://www.eecs.harvard.edu/~mainland/publications/mainland12simd.pdf). I would not be thrilled with a design that resulting in significantly worse benchmarks. Geoff

I'm noticing that linked paper (very nice results!) mentions a prefetch
primops that were added to ghc.
Is there any documentation current or pending ?
https://github.com/mainland/vector/commit/cfce37d3a9c228fe4bdf627ffb777399f5...
to have the relevant prim ops mentioned in the paper
thanks
-Carter
On Mon, Feb 4, 2013 at 7:36 PM, Geoffrey Mainland
On 02/04/2013 11:56 PM, Johan Tibell wrote:
On Mon, Feb 4, 2013 at 3:19 PM, Geoffrey Mainland
wrote: What would a sensible fallback be for AVX instructions? What should we fall back on when the LLVM backend is not being used?
Depends on the instruction. A 256-bit multiply could be replaced by N multiplies etc. For popcount we have a little bit of C code in ghc-prim that we use if SSE 4.2 isn't enabled. An alternative is to emit some different assembly in e.g. the x86-64 backend if AVX isn't enabled.
Maybe we could desugar AVX instructions to SSE instructions on platforms that support SSE but not AVX, but in practice people would then #ifdef anyway and just use SSE if AVX weren't available.
I don't follow here. If you conditionally emitted different instructions in the backends depending on which -m flags are passed to GHC, why would people #ifdef?
I think you are suggesting that the user should always use 256-bit short-vector instructions, and that on platforms where AVX is not available, this would fall back to an implementation that performed multiple SSE instructions for each 256-bit vector instruction---and used multiple XMM registers to hold each 256-bit vector value (or spilled).
Anyone using low-level primops should only do so if they really want low-level control. The most efficient SSE implementation of a function is not going to be whatever implementation falls out of a desugaring of generic 256-bit short-vector primitives. Therefore, I suspect that anyone using low-level vector primops like this will #ifdef and provide two implementations---one for SSE, one for AVX. Anyone who doesn't care about this level of detail should use a higher-level interface---which we have already implemented---and which does not require any ifdefs. People will #ifdef because they can provide better SSE implementations than GHC when AVX instructions are not available.
I am suggesting that we push the "ifdefs" into a library. The vast majority of programmers will never see the ifdefs, because they will use the library.
I think you are suggesting that we push the "ifdefs" into GHC. That way nobody will have a choice---they get whatever desugaring GHC gives them.
I understand your point of view---having primops that don't work everywhere is a real pain and aesthetically unpleasing---but I prefer exposing more low-level details in our primops even if it means a bit of unpleasantness once in a while. This does mean a tiny segment of programmers will have to deal with ifdefs, but I suspect that this tiny segment of programmers would prefer ifdefs to a lack of control.
If a population count operation translates to a few extra instructions, I don't think anyone will care. If a body of code performing short-vector operations desugars to twice as many instructions that require twice as many registers, thereby resulting in a bunch of extra spills, it will matter. Put differently, there is a more-or-less canonical desugaring of population count. For a given function using short-vector instructions of one width, there is not a canonical desugaring into a function using short-vector instructions of a lesser width.
The current idea is to hide the #ifdefs in a library. Clients of the library would then get the "best" short-vector implementation available for their platform by using this library. Right now this library is a modified version of primitive, and I have modified versions of vector and DPH that use this version of the primitive library to generate SSE code.
You would still end up with an GHC.Exts that exports a different API depending on which flags (e.g. -m<something>) are passed to GHC. Couldn't you use ghc-prim for your fallbacks and have GHC.Exts.yourPrimOp use either those fallbacks or the AVX instructions.
This is basically what I've implemented, expect there is a Multi type family that "picks" the appropriate short-vector representation for a type, e.g., DoubleX2# for Double on machines with SSE, DoubleX4# for Double on machines with AVX, and accompanying set of short-vector operations.
We have a concrete design and implementation---take a look at the primitive, vector, and dph packages on my github page (http://github.com/mainland). I would be very happy to discuss any concrete alternative design. We also have a paper with some performance measurements (http://www.eecs.harvard.edu/~mainland/publications/mainland12simd.pdf). I would not be thrilled with a design that resulting in significantly worse benchmarks.
Geoff

On 05/02/13 00:36, Geoffrey Mainland wrote:
On 02/04/2013 11:56 PM, Johan Tibell wrote:
On Mon, Feb 4, 2013 at 3:19 PM, Geoffrey Mainland
wrote: What would a sensible fallback be for AVX instructions? What should we fall back on when the LLVM backend is not being used?
Depends on the instruction. A 256-bit multiply could be replaced by N multiplies etc. For popcount we have a little bit of C code in ghc-prim that we use if SSE 4.2 isn't enabled. An alternative is to emit some different assembly in e.g. the x86-64 backend if AVX isn't enabled.
Maybe we could desugar AVX instructions to SSE instructions on platforms that support SSE but not AVX, but in practice people would then #ifdef anyway and just use SSE if AVX weren't available.
I don't follow here. If you conditionally emitted different instructions in the backends depending on which -m flags are passed to GHC, why would people #ifdef?
I think you are suggesting that the user should always use 256-bit short-vector instructions, and that on platforms where AVX is not available, this would fall back to an implementation that performed multiple SSE instructions for each 256-bit vector instruction---and used multiple XMM registers to hold each 256-bit vector value (or spilled).
Anyone using low-level primops should only do so if they really want low-level control. The most efficient SSE implementation of a function is not going to be whatever implementation falls out of a desugaring of generic 256-bit short-vector primitives. Therefore, I suspect that anyone using low-level vector primops like this will #ifdef and provide two implementations---one for SSE, one for AVX. Anyone who doesn't care about this level of detail should use a higher-level interface---which we have already implemented---and which does not require any ifdefs. People will #ifdef because they can provide better SSE implementations than GHC when AVX instructions are not available.
I am suggesting that we push the "ifdefs" into a library. The vast majority of programmers will never see the ifdefs, because they will use the library.
I think you are suggesting that we push the "ifdefs" into GHC. That way nobody will have a choice---they get whatever desugaring GHC gives them.
I understand your point of view---having primops that don't work everywhere is a real pain and aesthetically unpleasing---but I prefer exposing more low-level details in our primops even if it means a bit of unpleasantness once in a while. This does mean a tiny segment of programmers will have to deal with ifdefs, but I suspect that this tiny segment of programmers would prefer ifdefs to a lack of control.
If a population count operation translates to a few extra instructions, I don't think anyone will care. If a body of code performing short-vector operations desugars to twice as many instructions that require twice as many registers, thereby resulting in a bunch of extra spills, it will matter. Put differently, there is a more-or-less canonical desugaring of population count. For a given function using short-vector instructions of one width, there is not a canonical desugaring into a function using short-vector instructions of a lesser width.
While I agree with Geoff, there's one thing we have to be careful about: inlining. If the primop is exposed via an inline definition, then either we have to check and disable the inlining if the primop is not available in the current compilation, or else prevent the inlining from being visible in the first place. I believe this is what Johan had in mind when he gave popcount a fallback. Geoff, maybe you've thought about this already - what's the plan for the vector library? Cheers, Simon

On 02/05/2013 09:06 AM, Simon Marlow wrote:
On 05/02/13 00:36, Geoffrey Mainland wrote:
On 02/04/2013 11:56 PM, Johan Tibell wrote:
On Mon, Feb 4, 2013 at 3:19 PM, Geoffrey Mainland
wrote: What would a sensible fallback be for AVX instructions? What should we fall back on when the LLVM backend is not being used?
Depends on the instruction. A 256-bit multiply could be replaced by N multiplies etc. For popcount we have a little bit of C code in ghc-prim that we use if SSE 4.2 isn't enabled. An alternative is to emit some different assembly in e.g. the x86-64 backend if AVX isn't enabled.
Maybe we could desugar AVX instructions to SSE instructions on platforms that support SSE but not AVX, but in practice people would then #ifdef anyway and just use SSE if AVX weren't available.
I don't follow here. If you conditionally emitted different instructions in the backends depending on which -m flags are passed to GHC, why would people #ifdef?
I think you are suggesting that the user should always use 256-bit short-vector instructions, and that on platforms where AVX is not available, this would fall back to an implementation that performed multiple SSE instructions for each 256-bit vector instruction---and used multiple XMM registers to hold each 256-bit vector value (or spilled).
Anyone using low-level primops should only do so if they really want low-level control. The most efficient SSE implementation of a function is not going to be whatever implementation falls out of a desugaring of generic 256-bit short-vector primitives. Therefore, I suspect that anyone using low-level vector primops like this will #ifdef and provide two implementations---one for SSE, one for AVX. Anyone who doesn't care about this level of detail should use a higher-level interface---which we have already implemented---and which does not require any ifdefs. People will #ifdef because they can provide better SSE implementations than GHC when AVX instructions are not available.
I am suggesting that we push the "ifdefs" into a library. The vast majority of programmers will never see the ifdefs, because they will use the library.
I think you are suggesting that we push the "ifdefs" into GHC. That way nobody will have a choice---they get whatever desugaring GHC gives them.
I understand your point of view---having primops that don't work everywhere is a real pain and aesthetically unpleasing---but I prefer exposing more low-level details in our primops even if it means a bit of unpleasantness once in a while. This does mean a tiny segment of programmers will have to deal with ifdefs, but I suspect that this tiny segment of programmers would prefer ifdefs to a lack of control.
If a population count operation translates to a few extra instructions, I don't think anyone will care. If a body of code performing short-vector operations desugars to twice as many instructions that require twice as many registers, thereby resulting in a bunch of extra spills, it will matter. Put differently, there is a more-or-less canonical desugaring of population count. For a given function using short-vector instructions of one width, there is not a canonical desugaring into a function using short-vector instructions of a lesser width.
While I agree with Geoff, there's one thing we have to be careful about: inlining. If the primop is exposed via an inline definition, then either we have to check and disable the inlining if the primop is not available in the current compilation, or else prevent the inlining from being visible in the first place.
I believe this is what Johan had in mind when he gave popcount a fallback. Geoff, maybe you've thought about this already - what's the plan for the vector library?
Cheers, Simon
Right now, the short-vector primops are only visible if you use the -fllvm switch when compiling. If you compile the vector package with -fllvm and then try to use this package with the native back end and an SSE primop gets inlined, the native back end will error out and tell you to use -fllvm. This is not a good solution. On the one hand, if you use an -msse4.2-compiled C library on a machine without SSE 4.2 support, you should not expect it to work. I would be fine with a world in which compiling the vector library with -mavx would result in a package that the compiler would not allow the programmer to use from a program that wasn't also compiled with -mavx, i.e., a world in which the compiler checked flag compatibility. Having two back-ends makes things more difficult, because we certainly don't want a package compiled with -fllvm to be unusable from the native back end. I don't have a good solution. I am assuming that we decide that having the set of available primops be a function of DynFlags is OK. Then there are two problems. 1) What mechanism do we add to GHC to make the set of available primops be a function of DynFlags? Right now we have a llvm_only attribute in compiler/prelude/primops.txt.pp so that the SSE primops are only available when using the LLVM back end. This is a stopgap measure and not correct. What's the right way to do it? How do we then communicate to the user which primops are available? -msse, and thus __SSE__, doesn't mean that the SSE primops are available, because we might be using the native back end. Does the user have to test __SSE__ and __GLASGOW_HASKELL_LLVM__ to know that the SSE primops are available? That's not a good solution, and it certainly doesn't scale. Note that when I say "user" I mean the person who writes the Multi type family instances---even in the current ifdef purgatory situation, most users can use the Multi type family without worrying about ifdefs. 2) What do we do about unfoldings? As a straw-man proposal, we could find a subset of DynFlags that uniquely determines the set of available primops, and then disable unfoldings that come from a module that was compiled with incompatible DynFlags. Once we solve (1), we could (I think) straightforwardly implement this. Geoff

If I understand the code correctly, you use __GLASGOW_HASKELL_LLVM__ to make sure that AVX instructions are available. But using LLVM is no guarantee for that; it depends on what -m flags are passed to LLVM. Presumably LLVM already does what I suggest and lower vector instructions to different instructions, depending on the actual compilation target. Or does it fail altogether if the target architecture doesn't support vector instructions?

On 05/02/13 10:34, Geoffrey Mainland wrote:
On 02/05/2013 09:06 AM, Simon Marlow wrote:
On 05/02/13 00:36, Geoffrey Mainland wrote:
On 02/04/2013 11:56 PM, Johan Tibell wrote:
On Mon, Feb 4, 2013 at 3:19 PM, Geoffrey Mainland
wrote: What would a sensible fallback be for AVX instructions? What should we fall back on when the LLVM backend is not being used?
Depends on the instruction. A 256-bit multiply could be replaced by N multiplies etc. For popcount we have a little bit of C code in ghc-prim that we use if SSE 4.2 isn't enabled. An alternative is to emit some different assembly in e.g. the x86-64 backend if AVX isn't enabled.
Maybe we could desugar AVX instructions to SSE instructions on platforms that support SSE but not AVX, but in practice people would then #ifdef anyway and just use SSE if AVX weren't available.
I don't follow here. If you conditionally emitted different instructions in the backends depending on which -m flags are passed to GHC, why would people #ifdef?
I think you are suggesting that the user should always use 256-bit short-vector instructions, and that on platforms where AVX is not available, this would fall back to an implementation that performed multiple SSE instructions for each 256-bit vector instruction---and used multiple XMM registers to hold each 256-bit vector value (or spilled).
Anyone using low-level primops should only do so if they really want low-level control. The most efficient SSE implementation of a function is not going to be whatever implementation falls out of a desugaring of generic 256-bit short-vector primitives. Therefore, I suspect that anyone using low-level vector primops like this will #ifdef and provide two implementations---one for SSE, one for AVX. Anyone who doesn't care about this level of detail should use a higher-level interface---which we have already implemented---and which does not require any ifdefs. People will #ifdef because they can provide better SSE implementations than GHC when AVX instructions are not available.
I am suggesting that we push the "ifdefs" into a library. The vast majority of programmers will never see the ifdefs, because they will use the library.
I think you are suggesting that we push the "ifdefs" into GHC. That way nobody will have a choice---they get whatever desugaring GHC gives them.
I understand your point of view---having primops that don't work everywhere is a real pain and aesthetically unpleasing---but I prefer exposing more low-level details in our primops even if it means a bit of unpleasantness once in a while. This does mean a tiny segment of programmers will have to deal with ifdefs, but I suspect that this tiny segment of programmers would prefer ifdefs to a lack of control.
If a population count operation translates to a few extra instructions, I don't think anyone will care. If a body of code performing short-vector operations desugars to twice as many instructions that require twice as many registers, thereby resulting in a bunch of extra spills, it will matter. Put differently, there is a more-or-less canonical desugaring of population count. For a given function using short-vector instructions of one width, there is not a canonical desugaring into a function using short-vector instructions of a lesser width.
While I agree with Geoff, there's one thing we have to be careful about: inlining. If the primop is exposed via an inline definition, then either we have to check and disable the inlining if the primop is not available in the current compilation, or else prevent the inlining from being visible in the first place.
I believe this is what Johan had in mind when he gave popcount a fallback. Geoff, maybe you've thought about this already - what's the plan for the vector library?
Cheers, Simon
Right now, the short-vector primops are only visible if you use the -fllvm switch when compiling. If you compile the vector package with -fllvm and then try to use this package with the native back end and an SSE primop gets inlined, the native back end will error out and tell you to use -fllvm. This is not a good solution.
On the one hand, if you use an -msse4.2-compiled C library on a machine without SSE 4.2 support, you should not expect it to work. I would be fine with a world in which compiling the vector library with -mavx would result in a package that the compiler would not allow the programmer to use from a program that wasn't also compiled with -mavx, i.e., a world in which the compiler checked flag compatibility. Having two back-ends makes things more difficult, because we certainly don't want a package compiled with -fllvm to be unusable from the native back end.
I don't have a good solution. I am assuming that we decide that having the set of available primops be a function of DynFlags is OK. Then there are two problems.
I think it will be difficult to make the set of primops vary depending on flags. The reason is that the contents of GHC.Prim is re-exported by various modules: GHC.Base and GHC.Exts for example, and each of those .hi files lists the names of the exported primops. So we can't change the set after these modules have been compiled. (well we can, but odd things will happen). So I think GHC.Prim should always export the full set of primops. It is OK for compilation to fail if the source code mentions an unsupported primop. What about unfoldings? We cannot have compilation failing if an unsupported primop gets inlined into the current module, that is a non-deterministic compilation failure. So then we have two options: 1) disable an unfolding if it contains an unsupported primop 2) implement unsupported primops via fallback C functions Both options lead to performance problems, so we want the compiler to warn if this happens. But we cannot fail the compilation. If we do (2), then we don't have to make it an error to use unsupported primops directly, but it should at least be a warning. Fallbacks are reasonably easy to implement I think: gcc provides generic vector operations that compile on any target (if I'm understanding the docs correctly). I suppose I don't mind whether we do (1) or (2). Cheers, Simon
1) What mechanism do we add to GHC to make the set of available primops be a function of DynFlags? Right now we have a llvm_only attribute in compiler/prelude/primops.txt.pp so that the SSE primops are only available when using the LLVM back end. This is a stopgap measure and not correct. What's the right way to do it? How do we then communicate to the user which primops are available? -msse, and thus __SSE__, doesn't mean that the SSE primops are available, because we might be using the native back end. Does the user have to test __SSE__ and __GLASGOW_HASKELL_LLVM__ to know that the SSE primops are available? That's not a good solution, and it certainly doesn't scale. Note that when I say "user" I mean the person who writes the Multi type family instances---even in the current ifdef purgatory situation, most users can use the Multi type family without worrying about ifdefs.
2) What do we do about unfoldings? As a straw-man proposal, we could find a subset of DynFlags that uniquely determines the set of available primops, and then disable unfoldings that come from a module that was compiled with incompatible DynFlags. Once we solve (1), we could (I think) straightforwardly implement this.
Geoff

On 05/02/13 10:34, Geoffrey Mainland wrote:
On 02/05/2013 09:06 AM, Simon Marlow wrote:
On 05/02/13 00:36, Geoffrey Mainland wrote:
On 02/04/2013 11:56 PM, Johan Tibell wrote:
On Mon, Feb 4, 2013 at 3:19 PM, Geoffrey Mainland
wrote: What would a sensible fallback be for AVX instructions? What should we fall back on when the LLVM backend is not being used?
Depends on the instruction. A 256-bit multiply could be replaced by N multiplies etc. For popcount we have a little bit of C code in ghc-prim that we use if SSE 4.2 isn't enabled. An alternative is to emit some different assembly in e.g. the x86-64 backend if AVX isn't enabled.
Maybe we could desugar AVX instructions to SSE instructions on
that support SSE but not AVX, but in practice people would then #ifdef anyway and just use SSE if AVX weren't available.
I don't follow here. If you conditionally emitted different instructions in the backends depending on which -m flags are passed to GHC, why would people #ifdef?
I think you are suggesting that the user should always use 256-bit short-vector instructions, and that on platforms where AVX is not available, this would fall back to an implementation that performed multiple SSE instructions for each 256-bit vector instruction---and used multiple XMM registers to hold each 256-bit vector value (or spilled).
Anyone using low-level primops should only do so if they really want low-level control. The most efficient SSE implementation of a function is not going to be whatever implementation falls out of a desugaring of generic 256-bit short-vector primitives. Therefore, I suspect that anyone using low-level vector primops like this will #ifdef and provide two implementations---one for SSE, one for AVX. Anyone who doesn't care about this level of detail should use a higher-level interface---which we have already implemented---and which does not require any ifdefs. People will #ifdef because they can provide better SSE implementations than GHC when AVX instructions are not available.
I am suggesting that we push the "ifdefs" into a library. The vast majority of programmers will never see the ifdefs, because they will use the library.
I think you are suggesting that we push the "ifdefs" into GHC. That way nobody will have a choice---they get whatever desugaring GHC gives
On 02/06/2013 09:24 AM, Simon Marlow wrote: platforms them.
I understand your point of view---having primops that don't work everywhere is a real pain and aesthetically unpleasing---but I prefer exposing more low-level details in our primops even if it means a
bit of
unpleasantness once in a while. This does mean a tiny segment of programmers will have to deal with ifdefs, but I suspect that this tiny segment of programmers would prefer ifdefs to a lack of control.
If a population count operation translates to a few extra instructions, I don't think anyone will care. If a body of code performing short-vector operations desugars to twice as many instructions that require twice as many registers, thereby resulting in a bunch of extra spills, it will matter. Put differently, there is a more-or-less canonical desugaring of population count. For a given function using short-vector instructions of one width, there is not a canonical desugaring into a function using short-vector instructions of a lesser width.
While I agree with Geoff, there's one thing we have to be careful about: inlining. If the primop is exposed via an inline definition, then either we have to check and disable the inlining if the primop is not available in the current compilation, or else prevent the inlining from being visible in the first place.
I believe this is what Johan had in mind when he gave popcount a fallback. Geoff, maybe you've thought about this already - what's the plan for the vector library?
Cheers, Simon
Right now, the short-vector primops are only visible if you use the -fllvm switch when compiling. If you compile the vector package with -fllvm and then try to use this package with the native back end and an SSE primop gets inlined, the native back end will error out and tell you to use -fllvm. This is not a good solution.
On the one hand, if you use an -msse4.2-compiled C library on a machine without SSE 4.2 support, you should not expect it to work. I would be fine with a world in which compiling the vector library with -mavx would result in a package that the compiler would not allow the programmer to use from a program that wasn't also compiled with -mavx, i.e., a world in which the compiler checked flag compatibility. Having two back-ends makes things more difficult, because we certainly don't want a package compiled with -fllvm to be unusable from the native back end.
I don't have a good solution. I am assuming that we decide that having the set of available primops be a function of DynFlags is OK. Then there are two problems.
I think it will be difficult to make the set of primops vary depending on flags. The reason is that the contents of GHC.Prim is re-exported by various modules: GHC.Base and GHC.Exts for example, and each of those .hi files lists the names of the exported primops. So we can't change the set after these modules have been compiled. (well we can, but odd things will happen).
So I think GHC.Prim should always export the full set of primops.
It is OK for compilation to fail if the source code mentions an unsupported primop.
What about unfoldings?
We cannot have compilation failing if an unsupported primop gets inlined into the current module, that is a non-deterministic compilation failure.
So then we have two options:
1) disable an unfolding if it contains an unsupported primop 2) implement unsupported primops via fallback C functions
Both options lead to performance problems, so we want the compiler to warn if this happens. But we cannot fail the compilation.
If we do (2), then we don't have to make it an error to use unsupported primops directly, but it should at least be a warning.
Fallbacks are reasonably easy to implement I think: gcc provides generic vector operations that compile on any target (if I'm understanding the docs correctly).
I suppose I don't mind whether we do (1) or (2).
Cheers, Simon
To answer Johan's question from a separate email, LLVM is supposed to lower vector instructions, but I have had it error out in odd ways on certain platforms. The example I recall was projecting an element from a vector using a non-constant index. So perhaps we can just rely on LLVM to do the lowering, solving the inlining problem. I think users will still want to test the various CPP defines to see whether or not, for example, AVX instructions are really available, and provide alternate implementations of low-level functions depending in case only SSE is available. I think the proposal is then that there will be a set of extra primops available when compiling with LLVM, but that's it. We will provide short-vector primops of multiple widths on all platforms, but some may not produce efficient code. The user who cares can test the CPP defines __SSE__, __AVX__, etc. Currently, for these extra primops to be available, the base libraries must also be compiled with LLVM---in particular ghc-prim, due to the fact that GHC.PrimopWrappers lives there. Is that acceptable? There are still interoperability problems if we allow LLVM to perform lowering. Turning on AVX code generation will change the calling convention. With -mavx, 256-bit wide vectors will be passed in the ymm* registers. Without -mavx, this obviously won't happen. How should we deal with that? Also, inlining code from an LLVM-compiled module will cause an error in a native-back-end-compiled module if any LLVM-only primops show up in the unfolding. The error will tell the user to use -fllvm. Is this acceptable? Geoff
1) What mechanism do we add to GHC to make the set of available primops be a function of DynFlags? Right now we have a llvm_only attribute in compiler/prelude/primops.txt.pp so that the SSE primops are only available when using the LLVM back end. This is a stopgap measure and not correct. What's the right way to do it? How do we then communicate to the user which primops are available? -msse, and thus __SSE__, doesn't mean that the SSE primops are available, because we might be using the native back end. Does the user have to test __SSE__ and __GLASGOW_HASKELL_LLVM__ to know that the SSE primops are available? That's not a good solution, and it certainly doesn't scale. Note that when I say "user" I mean the person who writes the Multi type family instances---even in the current ifdef purgatory situation, most users can use the Multi type family without worrying about ifdefs.
2) What do we do about unfoldings? As a straw-man proposal, we could find a subset of DynFlags that uniquely determines the set of available primops, and then disable unfoldings that come from a module that was compiled with incompatible DynFlags. Once we solve (1), we could (I think) straightforwardly implement this.

On Tue, Feb 5, 2013 at 12:56 AM, Johan Tibell
On Mon, Feb 4, 2013 at 3:19 PM, Geoffrey Mainland
wrote: What would a sensible fallback be for AVX instructions? What should we fall back on when the LLVM backend is not being used?
Depends on the instruction. A 256-bit multiply could be replaced by N multiplies etc. For popcount we have a little bit of C code in ghc-prim that we use if SSE 4.2 isn't enabled. An alternative is to emit some different assembly in e.g. the x86-64 backend if AVX isn't enabled.
The currently widest registers are 512 bits on Intel Phi. AVX is designed to handle 1024-bit wide registers (there's an unused bit in the VEX prefix). Alexander

1) Awesome
2) Got it, yeah that piece is important.
I just took some time to look through whats currently in (on the wiki at
http://hackage.haskell.org/trac/ghc/wiki/SIMD/Design)
and It looks like for now that'll cover my immediate needs quite nicely!
(I have a reflex to jump towards "ALL the THINGS" in engineering)
I'm still monotonically working on those tools.
Haven't had the time to play with the SIMD branch (vagaries of time
delimited by doing consulting for income). Once 7.8 is out in at least RC
status, I"ll have the bandwidth to properly start playing with the SIMD
primops properly and trying out comparative benching.
again: really exciting stuff, and i'm looking forward to using it soon!
-Carter
On Mon, Feb 4, 2013 at 5:09 PM, Geoffrey Mainland
On 02/04/2013 09:34 PM, Carter Schonwald wrote:
I'm really excited to see this merged in! Props on all involved
question 1: Will this be included in the upcoming 7.8 release?
Yes, that's the plan!
question 2: I see that some of the useful (albeit specialized) SSE primops aren't included, though it looks like adding them (at least for platforms that support them) would be largely mechanical..... If adding those primops is something GHC HQ would welcome (ignoring the sorting out the whole supporting SSE2 vs full AVX discussion), I'm more than happy to spend some time turning the crank to add those primops.
I'd like to figure out how to properly support having the set of available primops depend on the dynamic flags before adding too much more. I'll be speaking to Simon PJ about it tomorrow.
Do you have specific needs for any missing primops? If so, I'd like to know---customers are good :)
We talked a while ago about you possibly cooking up some sample programs that needed SSE instructions. Have there been any recent developments?
Thanks, Geoff
thanks -Carter Schonwald
On Sat, Feb 2, 2013 at 4:46 AM, Geoffrey Mainland
wrote: On 02/02/2013 09:37 AM, Karel Gardas wrote:
On 02/ 1/13 09:19 AM, Geoffrey Mainland wrote:
As an aside, what's the proper way for me to test the ARM cross-compilation support? I'm afraid my patches may break things there.
I've seen you've merged your changes into mainline so I've done a build of GHC HEAD on my arm/linux and it's gone fine so you've not broken anything -- at least from the build perspective.
Thanks, Karel
Thanks for the confirmation. I followed the instructions for building the Raspberry Pi cross GHC and tested the simd branch before I merged, but I'm glad to know I didn't break anything obvious for you either!
Geoff
participants (8)
-
Alexander Kjeldaas
-
Bryan O'Sullivan
-
Carter Schonwald
-
David Terei
-
Geoffrey Mainland
-
Johan Tibell
-
Karel Gardas
-
Simon Marlow