
On 03/10/2013 09:52 PM, Nicolas Trangez wrote:
All,
I've been toying with the SSE code generation in GHC 7.7 and Geoffrey Mainland's work to integrate this into the 'vector' library in order to generate SIMD code from high-level Haskell code.
While working with this, I wrote some simple code for testing purposes, then compiled this into LLVM IR and x86_64 assembly form in order to figure out how 'good' the resulting code would be.
First and foremost: I'm really impressed. Whilst there's most certainly room for improvement (one of them touched in this mail, though I also noticed unnecessary constant memory reads inside a tight loop), the initial results look very promising, especially taking into account how high-level the source code is. This is pretty amazing!
As an example, here's 'test.hs':
{-# OPTIONS_GHC -fllvm -O3 -optlo-O3 -optlc-O=3 -funbox-strict-fields #-} module Test (sum) where
import Prelude hiding (sum) import Data.Int (Int32) import Data.Vector.Unboxed (Vector) import qualified Data.Vector.Unboxed as U
sum :: Vector Int32 -> Int32 sum v = U.mfold' (+) (+) 0 v
When compiling this into assembly (compiler/library version details at the end of this message), the 'sum' function yields (among other things) this code:
.LBB2_3: # %c1C0 # =>This Inner Loop Header: Depth=1 prefetcht0 (%rsi) movdqu -1536(%rsi), %xmm1 paddd %xmm1, %xmm0 addq $16, %rsi addq $4, %rcx cmpq %rdx, %rcx jl .LBB2_3
The full LLVM IR and assembler output are attached to this message.
Whilst this is a nice and tight loop, I noticed the use of 'movdqu', which is used for non-128bit aligned memory access in SSE code. For aligned memory, 'movdqa' can be used, and this can have a major performance impact.
Whilst I understand why this code is currently generated as-is (also in other sample inputs), I wondered whether there are plans/approaches to tackle this. In some cases (e.g. in 'sum') this could be by using the scalar calculation at the beginning of the vector up until an aligned boundary, then use aligned access and handle the tail using scalars again, but I assume OTOH that's not trivial when multiple 'source' vectors are used in the calculation.
This might even become more complex when using AVX code, which needs 256bit alignments.
Whilst I can't propose an out-of-the-box solution, I'd like to point at the 'vector-simd' code [1] I wrote some months ago, which might propose some ideas. In this package, I created an unboxed vector-like type whose alignment is tracked at type level, and functions which consume a vector define the minimal required alignment. As such, vectors can be allocated at the minimal alignment they're required to be, throughout all code using them.
As an example, if I'd use this code (OTOH):
sseFoo :: (Storable a, AlignedToAtLeast A16 o1, AlignedToAtLeast A16 o2) => Vector o1 a -> Vector o2 a sseFoo = undefined
avxFoo :: (Storable a, AlignedToAtLeast A32 o1, AlignedToAtLeast A32 o2, AlignedToAtLeast A32 o3) => Vector o1 a -> Vector o2 a -> Vector o3 a avxFoo = undefined
the type of
combinedFoo v = avxFoo sv sv where sv = sseFoo v
would automagically be
combinedFoo :: (Storable a, AlignedToAtLeast A16 o1, AlignedToAtLeast A32 o2) => Vector o1 a -> Vector o2 a
and when using this
v1 = combinedFoo (Vector.fromList [1 :: Int32, 2, 3, 4, 5, 6, 7, 8])
the allocated argument vector (result of Vector.fromList) will be 16byte-aligned as expected/required for the SSE function to work with unaligned loads internally (assuming no unaligned slices are supported, etc), whilst the intermediate result of 'sseFoo' ('sv') will be 32-byte aligned as required by 'avxFoo'.
Attached: test.ll and test.s, compilation results of test.hs using
$ ghc-7.7.20130302 -keep-llvm-files -package-db=cabal-dev/packages-7.7.20130302.conf -fforce-recomp -S test.hs
GHC from HEAD/master compiled on my Fedora 18 system using system LLVM (3.1), 'primitive' 8aef578fa5e7fb9fac3eac17336b722cbae2f921 from git://github.com/mainland/primitive.git and 'vector' e1a6c403bcca07b4c8121753daf120d30dedb1b0 from git://github.com/mainland/vector.git
Nicolas
Hi Nicolas, Have you read our paper about the SIMD work? It's available here: https://research.microsoft.com/en-us/um/people/simonpj/papers/ndp/haskell-be... The paper describes the issues involved with integrated SIMD instructions with the vector fusion framework. There are two primary issues with alignment: stack alignment and heap alignment. We cannot rely on the stack being properly aligned for AVX spills on any platform, and LLVM's stack fixup code does not play well with GHC, so we *rewrite* all AVX spill instructions to their unaligned counterparts. On Win32 we must do the same for SSE. Unboxed vectors are allocated by GHC, and it does not align memory on 16-byte boundaries, so our first cut at SSE intrinsics simply used unaligned accesses. Obviously with ForeignPtr's we can control alignment and potentially use the aligned variants of SSE instructions, but this will almost double the number of primops. One could imagine extending our fusion framework to transition to aligned move instructions. Finally, LLVM 3.2 does not work with GHC. This means we cannot yet take advantage of its new vectorization optimizations, which is a shame. So, four projects for you or anyone else who is interested, in rough dependency order: 1) Get LLVM 3.2 working with GHC's LLVM back end. 2) Fix the stack alignment issue with LLVM. This will likely require a patch to LLVM. 3) Add support for aligned move primops. 4) Extend the current SIMD fusion framework to handle transitioning to aligned move instructions. As an alternative, only use aligned move instructions on memory that we know is aligned. These are all on my todo list, but my plate is quite full at the moment. Geoff