Re: SIMD/SSE support & alignment

12 Mar 2013

      On 03/10/2013 09:52 PM, Nicolas Trangez wrote:
...
All,
I've been toying with the SSE code generation in GHC 7.7 and Geoffrey
Mainland's work to integrate this into the 'vector' library in order to
generate SIMD code from high-level Haskell code.
While working with this, I wrote some simple code for testing purposes,
then compiled this into LLVM IR and x86_64 assembly form in order to
figure out how 'good' the resulting code would be.
First and foremost: I'm really impressed. Whilst there's most certainly
room for improvement (one of them touched in this mail, though I also
noticed unnecessary constant memory reads inside a tight loop), the
initial results look very promising, especially taking into account how
high-level the source code is. This is pretty amazing!
As an example, here's 'test.hs':
{-# OPTIONS_GHC -fllvm -O3 -optlo-O3 -optlc-O=3 -funbox-strict-fields
#-}
module Test (sum) where
import Prelude hiding (sum)
import Data.Int (Int32)
import Data.Vector.Unboxed (Vector)
import qualified Data.Vector.Unboxed as U
sum :: Vector Int32 -> Int32
sum v = U.mfold' (+) (+) 0 v
When compiling this into assembly (compiler/library version details at
the end of this message), the 'sum' function yields (among other things)
this code:
.LBB2_3: # %c1C0
# =>This Inner Loop Header:
Depth=1
prefetcht0 (%rsi)
movdqu -1536(%rsi), %xmm1
paddd %xmm1, %xmm0
addq $16, %rsi
addq $4, %rcx
cmpq %rdx, %rcx
jl .LBB2_3
The full LLVM IR and assembler output are attached to this message.
Whilst this is a nice and tight loop, I noticed the use of 'movdqu',
which is used for non-128bit aligned memory access in SSE code. For
aligned memory, 'movdqa' can be used, and this can have a major
performance impact.
Whilst I understand why this code is currently generated as-is (also in
other sample inputs), I wondered whether there are plans/approaches to
tackle this. In some cases (e.g. in 'sum') this could be by using the
scalar calculation at the beginning of the vector up until an aligned
boundary, then use aligned access and handle the tail using scalars
again, but I assume OTOH that's not trivial when multiple 'source'
vectors are used in the calculation.
This might even become more complex when using AVX code, which needs
256bit alignments.
Whilst I can't propose an out-of-the-box solution, I'd like to point at
the 'vector-simd' code [1] I wrote some months ago, which might propose
some ideas. In this package, I created an unboxed vector-like type whose
alignment is tracked at type level, and functions which consume a vector
define the minimal required alignment. As such, vectors can be allocated
at the minimal alignment they're required to be, throughout all code
using them.
As an example, if I'd use this code (OTOH):
sseFoo :: (Storable a, AlignedToAtLeast A16 o1, AlignedToAtLeast A16 o2)
=> Vector o1 a -> Vector o2 a
sseFoo = undefined
avxFoo :: (Storable a, AlignedToAtLeast A32 o1, AlignedToAtLeast A32 o2,
AlignedToAtLeast A32 o3) => Vector o1 a -> Vector o2 a -> Vector o3 a
avxFoo = undefined
the type of
combinedFoo v = avxFoo sv sv
where
sv = sseFoo v
would automagically be
combinedFoo :: (Storable a, AlignedToAtLeast A16 o1, AlignedToAtLeast
A32 o2) => Vector o1 a -> Vector o2 a
and when using this
v1 = combinedFoo (Vector.fromList [1 :: Int32, 2, 3, 4, 5, 6, 7, 8])
the allocated argument vector (result of Vector.fromList) will be
16byte-aligned as expected/required for the SSE function to work with
unaligned loads internally (assuming no unaligned slices are supported,
etc), whilst the intermediate result of 'sseFoo' ('sv') will be 32-byte
aligned as required by 'avxFoo'.
Attached: test.ll and test.s, compilation results of test.hs using
$ ghc-7.7.20130302 -keep-llvm-files
-package-db=cabal-dev/packages-7.7.20130302.conf -fforce-recomp -S
test.hs
GHC from HEAD/master compiled on my Fedora 18 system using system LLVM
(3.1), 'primitive' 8aef578fa5e7fb9fac3eac17336b722cbae2f921 from
git://github.com/mainland/primitive.git and 'vector'
e1a6c403bcca07b4c8121753daf120d30dedb1b0 from
git://github.com/mainland/vector.git
Nicolas
[1] https://github.com/NicolasT/vector-simd
Hi Nicolas,

Have you read our paper about the SIMD work? It's available here:

https://research.microsoft.com/en-us/um/people/simonpj/papers/ndp/haskell-be...

The paper describes the issues involved with integrated SIMD
instructions with the vector fusion framework.

There are two primary issues with alignment: stack alignment and heap
alignment.

We cannot rely on the stack being properly aligned for AVX spills on any
platform, and LLVM's stack fixup code does not play well with GHC, so we
*rewrite* all AVX spill instructions to their unaligned counterparts. On
Win32 we must do the same for SSE.

Unboxed vectors are allocated by GHC, and it does not align memory on
16-byte boundaries, so our first cut at SSE intrinsics simply used
unaligned accesses. Obviously with ForeignPtr's we can control alignment
and potentially use the aligned variants of SSE instructions, but this
will almost double the number of primops. One could imagine extending
our fusion framework to transition to aligned move instructions.

Finally, LLVM 3.2 does not work with GHC. This means we cannot yet take
advantage of its new vectorization optimizations, which is a shame.

So, four projects for you or anyone else who is interested, in rough
dependency order:

1) Get LLVM 3.2 working with GHC's LLVM back end.

2) Fix the stack alignment issue with LLVM. This will likely require a
patch to LLVM.

3) Add support for aligned move primops.

4) Extend the current SIMD fusion framework to handle transitioning to
aligned move instructions. As an alternative, only use aligned move
instructions on memory that we know is aligned.

These are all on my todo list, but my plate is quite full at the moment.

Geoff

Re: SIMD/SSE support & alignment

Geoffrey Mainland