SIMD/SSE support & alignment

10 Mar 2013

      All,

I've been toying with the SSE code generation in GHC 7.7 and Geoffrey
Mainland's work to integrate this into the 'vector' library in order to
generate SIMD code from high-level Haskell code.

While working with this, I wrote some simple code for testing purposes,
then compiled this into LLVM IR and x86_64 assembly form in order to
figure out how 'good' the resulting code would be.

First and foremost: I'm really impressed. Whilst there's most certainly
room for improvement (one of them touched in this mail, though I also
noticed unnecessary constant memory reads inside a tight loop), the
initial results look very promising, especially taking into account how
high-level the source code is. This is pretty amazing!

As an example, here's 'test.hs':

{-# OPTIONS_GHC -fllvm -O3  -optlo-O3 -optlc-O=3 -funbox-strict-fields
#-}
module Test (sum) where

import Prelude hiding (sum)
import Data.Int (Int32)
import Data.Vector.Unboxed (Vector)
import qualified Data.Vector.Unboxed as U

sum :: Vector Int32 -> Int32
sum v = U.mfold' (+) (+) 0 v

When compiling this into assembly (compiler/library version details at
the end of this message), the 'sum' function yields (among other things)
this code:

.LBB2_3:                                # %c1C0
                                        # =>This Inner Loop Header:
Depth=1
	prefetcht0	(%rsi)
	movdqu	-1536(%rsi), %xmm1
	paddd	%xmm1, %xmm0
	addq	$16, %rsi
	addq	$4, %rcx
	cmpq	%rdx, %rcx
	jl	.LBB2_3

The full LLVM IR and assembler output are attached to this message.

Whilst this is a nice and tight loop, I noticed the use of 'movdqu',
which is used for non-128bit aligned memory access in SSE code. For
aligned memory, 'movdqa' can be used, and this can have a major
performance impact.

Whilst I understand why this code is currently generated as-is (also in
other sample inputs), I wondered whether there are plans/approaches to
tackle this. In some cases (e.g. in 'sum') this could be by using the
scalar calculation at the beginning of the vector up until an aligned
boundary, then use aligned access and handle the tail using scalars
again, but I assume OTOH that's not trivial when multiple 'source'
vectors are used in the calculation.

This might even become more complex when using AVX code, which needs
256bit alignments.

Whilst I can't propose an out-of-the-box solution, I'd like to point at
the 'vector-simd' code [1] I wrote some months ago, which might propose
some ideas. In this package, I created an unboxed vector-like type whose
alignment is tracked at type level, and functions which consume a vector
define the minimal required alignment. As such, vectors can be allocated
at the minimal alignment they're required to be, throughout all code
using them.

As an example, if I'd use this code (OTOH):

sseFoo :: (Storable a, AlignedToAtLeast A16 o1, AlignedToAtLeast A16 o2)
=> Vector o1 a -> Vector o2 a
sseFoo = undefined

avxFoo :: (Storable a, AlignedToAtLeast A32 o1, AlignedToAtLeast A32 o2,
AlignedToAtLeast A32 o3) => Vector o1 a -> Vector o2 a -> Vector o3 a
avxFoo = undefined

the type of

combinedFoo v = avxFoo sv sv
  where
    sv = sseFoo v

would automagically be

combinedFoo :: (Storable a, AlignedToAtLeast A16 o1, AlignedToAtLeast
A32 o2) => Vector o1 a -> Vector o2 a

and when using this

v1 = combinedFoo (Vector.fromList [1 :: Int32, 2, 3, 4, 5, 6, 7, 8])

the allocated argument vector (result of Vector.fromList) will be
16byte-aligned as expected/required for the SSE function to work with
unaligned loads internally (assuming no unaligned slices are supported,
etc), whilst the intermediate result of 'sseFoo' ('sv') will be 32-byte
aligned as required by 'avxFoo'.

Attached: test.ll and test.s, compilation results of test.hs using

$ ghc-7.7.20130302 -keep-llvm-files
-package-db=cabal-dev/packages-7.7.20130302.conf -fforce-recomp -S
test.hs

GHC from HEAD/master compiled on my Fedora 18 system using system LLVM
(3.1), 'primitive' 8aef578fa5e7fb9fac3eac17336b722cbae2f921 from
git://github.com/mainland/primitive.git and 'vector'
e1a6c403bcca07b4c8121753daf120d30dedb1b0 from
git://github.com/mainland/vector.git

Nicolas

[1] https://github.com/NicolasT/vector-simd

SIMD/SSE support & alignment

Nicolas Trangez