
On 02/04/2013 11:56 PM, Johan Tibell wrote:
On Mon, Feb 4, 2013 at 3:19 PM, Geoffrey Mainland
wrote: What would a sensible fallback be for AVX instructions? What should we fall back on when the LLVM backend is not being used?
Depends on the instruction. A 256-bit multiply could be replaced by N multiplies etc. For popcount we have a little bit of C code in ghc-prim that we use if SSE 4.2 isn't enabled. An alternative is to emit some different assembly in e.g. the x86-64 backend if AVX isn't enabled.
Maybe we could desugar AVX instructions to SSE instructions on platforms that support SSE but not AVX, but in practice people would then #ifdef anyway and just use SSE if AVX weren't available.
I don't follow here. If you conditionally emitted different instructions in the backends depending on which -m flags are passed to GHC, why would people #ifdef?
I think you are suggesting that the user should always use 256-bit short-vector instructions, and that on platforms where AVX is not available, this would fall back to an implementation that performed multiple SSE instructions for each 256-bit vector instruction---and used multiple XMM registers to hold each 256-bit vector value (or spilled). Anyone using low-level primops should only do so if they really want low-level control. The most efficient SSE implementation of a function is not going to be whatever implementation falls out of a desugaring of generic 256-bit short-vector primitives. Therefore, I suspect that anyone using low-level vector primops like this will #ifdef and provide two implementations---one for SSE, one for AVX. Anyone who doesn't care about this level of detail should use a higher-level interface---which we have already implemented---and which does not require any ifdefs. People will #ifdef because they can provide better SSE implementations than GHC when AVX instructions are not available. I am suggesting that we push the "ifdefs" into a library. The vast majority of programmers will never see the ifdefs, because they will use the library. I think you are suggesting that we push the "ifdefs" into GHC. That way nobody will have a choice---they get whatever desugaring GHC gives them. I understand your point of view---having primops that don't work everywhere is a real pain and aesthetically unpleasing---but I prefer exposing more low-level details in our primops even if it means a bit of unpleasantness once in a while. This does mean a tiny segment of programmers will have to deal with ifdefs, but I suspect that this tiny segment of programmers would prefer ifdefs to a lack of control. If a population count operation translates to a few extra instructions, I don't think anyone will care. If a body of code performing short-vector operations desugars to twice as many instructions that require twice as many registers, thereby resulting in a bunch of extra spills, it will matter. Put differently, there is a more-or-less canonical desugaring of population count. For a given function using short-vector instructions of one width, there is not a canonical desugaring into a function using short-vector instructions of a lesser width.
The current idea is to hide the #ifdefs in a library. Clients of the library would then get the "best" short-vector implementation available for their platform by using this library. Right now this library is a modified version of primitive, and I have modified versions of vector and DPH that use this version of the primitive library to generate SSE code.
You would still end up with an GHC.Exts that exports a different API depending on which flags (e.g. -m<something>) are passed to GHC. Couldn't you use ghc-prim for your fallbacks and have GHC.Exts.yourPrimOp use either those fallbacks or the AVX instructions.
This is basically what I've implemented, expect there is a Multi type family that "picks" the appropriate short-vector representation for a type, e.g., DoubleX2# for Double on machines with SSE, DoubleX4# for Double on machines with AVX, and accompanying set of short-vector operations. We have a concrete design and implementation---take a look at the primitive, vector, and dph packages on my github page (http://github.com/mainland). I would be very happy to discuss any concrete alternative design. We also have a paper with some performance measurements (http://www.eecs.harvard.edu/~mainland/publications/mainland12simd.pdf). I would not be thrilled with a design that resulting in significantly worse benchmarks. Geoff