
All, I've been toying with the SSE code generation in GHC 7.7 and Geoffrey Mainland's work to integrate this into the 'vector' library in order to generate SIMD code from high-level Haskell code. While working with this, I wrote some simple code for testing purposes, then compiled this into LLVM IR and x86_64 assembly form in order to figure out how 'good' the resulting code would be. First and foremost: I'm really impressed. Whilst there's most certainly room for improvement (one of them touched in this mail, though I also noticed unnecessary constant memory reads inside a tight loop), the initial results look very promising, especially taking into account how high-level the source code is. This is pretty amazing! As an example, here's 'test.hs': {-# OPTIONS_GHC -fllvm -O3 -optlo-O3 -optlc-O=3 -funbox-strict-fields #-} module Test (sum) where import Prelude hiding (sum) import Data.Int (Int32) import Data.Vector.Unboxed (Vector) import qualified Data.Vector.Unboxed as U sum :: Vector Int32 -> Int32 sum v = U.mfold' (+) (+) 0 v When compiling this into assembly (compiler/library version details at the end of this message), the 'sum' function yields (among other things) this code: .LBB2_3: # %c1C0 # =>This Inner Loop Header: Depth=1 prefetcht0 (%rsi) movdqu -1536(%rsi), %xmm1 paddd %xmm1, %xmm0 addq $16, %rsi addq $4, %rcx cmpq %rdx, %rcx jl .LBB2_3 The full LLVM IR and assembler output are attached to this message. Whilst this is a nice and tight loop, I noticed the use of 'movdqu', which is used for non-128bit aligned memory access in SSE code. For aligned memory, 'movdqa' can be used, and this can have a major performance impact. Whilst I understand why this code is currently generated as-is (also in other sample inputs), I wondered whether there are plans/approaches to tackle this. In some cases (e.g. in 'sum') this could be by using the scalar calculation at the beginning of the vector up until an aligned boundary, then use aligned access and handle the tail using scalars again, but I assume OTOH that's not trivial when multiple 'source' vectors are used in the calculation. This might even become more complex when using AVX code, which needs 256bit alignments. Whilst I can't propose an out-of-the-box solution, I'd like to point at the 'vector-simd' code [1] I wrote some months ago, which might propose some ideas. In this package, I created an unboxed vector-like type whose alignment is tracked at type level, and functions which consume a vector define the minimal required alignment. As such, vectors can be allocated at the minimal alignment they're required to be, throughout all code using them. As an example, if I'd use this code (OTOH): sseFoo :: (Storable a, AlignedToAtLeast A16 o1, AlignedToAtLeast A16 o2) => Vector o1 a -> Vector o2 a sseFoo = undefined avxFoo :: (Storable a, AlignedToAtLeast A32 o1, AlignedToAtLeast A32 o2, AlignedToAtLeast A32 o3) => Vector o1 a -> Vector o2 a -> Vector o3 a avxFoo = undefined the type of combinedFoo v = avxFoo sv sv where sv = sseFoo v would automagically be combinedFoo :: (Storable a, AlignedToAtLeast A16 o1, AlignedToAtLeast A32 o2) => Vector o1 a -> Vector o2 a and when using this v1 = combinedFoo (Vector.fromList [1 :: Int32, 2, 3, 4, 5, 6, 7, 8]) the allocated argument vector (result of Vector.fromList) will be 16byte-aligned as expected/required for the SSE function to work with unaligned loads internally (assuming no unaligned slices are supported, etc), whilst the intermediate result of 'sseFoo' ('sv') will be 32-byte aligned as required by 'avxFoo'. Attached: test.ll and test.s, compilation results of test.hs using $ ghc-7.7.20130302 -keep-llvm-files -package-db=cabal-dev/packages-7.7.20130302.conf -fforce-recomp -S test.hs GHC from HEAD/master compiled on my Fedora 18 system using system LLVM (3.1), 'primitive' 8aef578fa5e7fb9fac3eac17336b722cbae2f921 from git://github.com/mainland/primitive.git and 'vector' e1a6c403bcca07b4c8121753daf120d30dedb1b0 from git://github.com/mainland/vector.git Nicolas [1] https://github.com/NicolasT/vector-simd