
All, I've been toying with the SSE code generation in GHC 7.7 and Geoffrey Mainland's work to integrate this into the 'vector' library in order to generate SIMD code from high-level Haskell code. While working with this, I wrote some simple code for testing purposes, then compiled this into LLVM IR and x86_64 assembly form in order to figure out how 'good' the resulting code would be. First and foremost: I'm really impressed. Whilst there's most certainly room for improvement (one of them touched in this mail, though I also noticed unnecessary constant memory reads inside a tight loop), the initial results look very promising, especially taking into account how high-level the source code is. This is pretty amazing! As an example, here's 'test.hs': {-# OPTIONS_GHC -fllvm -O3 -optlo-O3 -optlc-O=3 -funbox-strict-fields #-} module Test (sum) where import Prelude hiding (sum) import Data.Int (Int32) import Data.Vector.Unboxed (Vector) import qualified Data.Vector.Unboxed as U sum :: Vector Int32 -> Int32 sum v = U.mfold' (+) (+) 0 v When compiling this into assembly (compiler/library version details at the end of this message), the 'sum' function yields (among other things) this code: .LBB2_3: # %c1C0 # =>This Inner Loop Header: Depth=1 prefetcht0 (%rsi) movdqu -1536(%rsi), %xmm1 paddd %xmm1, %xmm0 addq $16, %rsi addq $4, %rcx cmpq %rdx, %rcx jl .LBB2_3 The full LLVM IR and assembler output are attached to this message. Whilst this is a nice and tight loop, I noticed the use of 'movdqu', which is used for non-128bit aligned memory access in SSE code. For aligned memory, 'movdqa' can be used, and this can have a major performance impact. Whilst I understand why this code is currently generated as-is (also in other sample inputs), I wondered whether there are plans/approaches to tackle this. In some cases (e.g. in 'sum') this could be by using the scalar calculation at the beginning of the vector up until an aligned boundary, then use aligned access and handle the tail using scalars again, but I assume OTOH that's not trivial when multiple 'source' vectors are used in the calculation. This might even become more complex when using AVX code, which needs 256bit alignments. Whilst I can't propose an out-of-the-box solution, I'd like to point at the 'vector-simd' code [1] I wrote some months ago, which might propose some ideas. In this package, I created an unboxed vector-like type whose alignment is tracked at type level, and functions which consume a vector define the minimal required alignment. As such, vectors can be allocated at the minimal alignment they're required to be, throughout all code using them. As an example, if I'd use this code (OTOH): sseFoo :: (Storable a, AlignedToAtLeast A16 o1, AlignedToAtLeast A16 o2) => Vector o1 a -> Vector o2 a sseFoo = undefined avxFoo :: (Storable a, AlignedToAtLeast A32 o1, AlignedToAtLeast A32 o2, AlignedToAtLeast A32 o3) => Vector o1 a -> Vector o2 a -> Vector o3 a avxFoo = undefined the type of combinedFoo v = avxFoo sv sv where sv = sseFoo v would automagically be combinedFoo :: (Storable a, AlignedToAtLeast A16 o1, AlignedToAtLeast A32 o2) => Vector o1 a -> Vector o2 a and when using this v1 = combinedFoo (Vector.fromList [1 :: Int32, 2, 3, 4, 5, 6, 7, 8]) the allocated argument vector (result of Vector.fromList) will be 16byte-aligned as expected/required for the SSE function to work with unaligned loads internally (assuming no unaligned slices are supported, etc), whilst the intermediate result of 'sseFoo' ('sv') will be 32-byte aligned as required by 'avxFoo'. Attached: test.ll and test.s, compilation results of test.hs using $ ghc-7.7.20130302 -keep-llvm-files -package-db=cabal-dev/packages-7.7.20130302.conf -fforce-recomp -S test.hs GHC from HEAD/master compiled on my Fedora 18 system using system LLVM (3.1), 'primitive' 8aef578fa5e7fb9fac3eac17336b722cbae2f921 from git://github.com/mainland/primitive.git and 'vector' e1a6c403bcca07b4c8121753daf120d30dedb1b0 from git://github.com/mainland/vector.git Nicolas [1] https://github.com/NicolasT/vector-simd

On 03/10/2013 09:52 PM, Nicolas Trangez wrote:
All,
I've been toying with the SSE code generation in GHC 7.7 and Geoffrey Mainland's work to integrate this into the 'vector' library in order to generate SIMD code from high-level Haskell code.
While working with this, I wrote some simple code for testing purposes, then compiled this into LLVM IR and x86_64 assembly form in order to figure out how 'good' the resulting code would be.
First and foremost: I'm really impressed. Whilst there's most certainly room for improvement (one of them touched in this mail, though I also noticed unnecessary constant memory reads inside a tight loop), the initial results look very promising, especially taking into account how high-level the source code is. This is pretty amazing!
As an example, here's 'test.hs':
{-# OPTIONS_GHC -fllvm -O3 -optlo-O3 -optlc-O=3 -funbox-strict-fields #-} module Test (sum) where
import Prelude hiding (sum) import Data.Int (Int32) import Data.Vector.Unboxed (Vector) import qualified Data.Vector.Unboxed as U
sum :: Vector Int32 -> Int32 sum v = U.mfold' (+) (+) 0 v
When compiling this into assembly (compiler/library version details at the end of this message), the 'sum' function yields (among other things) this code:
.LBB2_3: # %c1C0 # =>This Inner Loop Header: Depth=1 prefetcht0 (%rsi) movdqu -1536(%rsi), %xmm1 paddd %xmm1, %xmm0 addq $16, %rsi addq $4, %rcx cmpq %rdx, %rcx jl .LBB2_3
The full LLVM IR and assembler output are attached to this message.
Whilst this is a nice and tight loop, I noticed the use of 'movdqu', which is used for non-128bit aligned memory access in SSE code. For aligned memory, 'movdqa' can be used, and this can have a major performance impact.
Whilst I understand why this code is currently generated as-is (also in other sample inputs), I wondered whether there are plans/approaches to tackle this. In some cases (e.g. in 'sum') this could be by using the scalar calculation at the beginning of the vector up until an aligned boundary, then use aligned access and handle the tail using scalars again, but I assume OTOH that's not trivial when multiple 'source' vectors are used in the calculation.
This might even become more complex when using AVX code, which needs 256bit alignments.
Whilst I can't propose an out-of-the-box solution, I'd like to point at the 'vector-simd' code [1] I wrote some months ago, which might propose some ideas. In this package, I created an unboxed vector-like type whose alignment is tracked at type level, and functions which consume a vector define the minimal required alignment. As such, vectors can be allocated at the minimal alignment they're required to be, throughout all code using them.
As an example, if I'd use this code (OTOH):
sseFoo :: (Storable a, AlignedToAtLeast A16 o1, AlignedToAtLeast A16 o2) => Vector o1 a -> Vector o2 a sseFoo = undefined
avxFoo :: (Storable a, AlignedToAtLeast A32 o1, AlignedToAtLeast A32 o2, AlignedToAtLeast A32 o3) => Vector o1 a -> Vector o2 a -> Vector o3 a avxFoo = undefined
the type of
combinedFoo v = avxFoo sv sv where sv = sseFoo v
would automagically be
combinedFoo :: (Storable a, AlignedToAtLeast A16 o1, AlignedToAtLeast A32 o2) => Vector o1 a -> Vector o2 a
and when using this
v1 = combinedFoo (Vector.fromList [1 :: Int32, 2, 3, 4, 5, 6, 7, 8])
the allocated argument vector (result of Vector.fromList) will be 16byte-aligned as expected/required for the SSE function to work with unaligned loads internally (assuming no unaligned slices are supported, etc), whilst the intermediate result of 'sseFoo' ('sv') will be 32-byte aligned as required by 'avxFoo'.
Attached: test.ll and test.s, compilation results of test.hs using
$ ghc-7.7.20130302 -keep-llvm-files -package-db=cabal-dev/packages-7.7.20130302.conf -fforce-recomp -S test.hs
GHC from HEAD/master compiled on my Fedora 18 system using system LLVM (3.1), 'primitive' 8aef578fa5e7fb9fac3eac17336b722cbae2f921 from git://github.com/mainland/primitive.git and 'vector' e1a6c403bcca07b4c8121753daf120d30dedb1b0 from git://github.com/mainland/vector.git
Nicolas
Hi Nicolas, Have you read our paper about the SIMD work? It's available here: https://research.microsoft.com/en-us/um/people/simonpj/papers/ndp/haskell-be... The paper describes the issues involved with integrated SIMD instructions with the vector fusion framework. There are two primary issues with alignment: stack alignment and heap alignment. We cannot rely on the stack being properly aligned for AVX spills on any platform, and LLVM's stack fixup code does not play well with GHC, so we *rewrite* all AVX spill instructions to their unaligned counterparts. On Win32 we must do the same for SSE. Unboxed vectors are allocated by GHC, and it does not align memory on 16-byte boundaries, so our first cut at SSE intrinsics simply used unaligned accesses. Obviously with ForeignPtr's we can control alignment and potentially use the aligned variants of SSE instructions, but this will almost double the number of primops. One could imagine extending our fusion framework to transition to aligned move instructions. Finally, LLVM 3.2 does not work with GHC. This means we cannot yet take advantage of its new vectorization optimizations, which is a shame. So, four projects for you or anyone else who is interested, in rough dependency order: 1) Get LLVM 3.2 working with GHC's LLVM back end. 2) Fix the stack alignment issue with LLVM. This will likely require a patch to LLVM. 3) Add support for aligned move primops. 4) Extend the current SIMD fusion framework to handle transitioning to aligned move instructions. As an alternative, only use aligned move instructions on memory that we know is aligned. These are all on my todo list, but my plate is quite full at the moment. Geoff

On Tue, Mar 12, 2013 at 9:09 AM, Geoffrey Mainlandwrote: > On 03/10/2013 09:52 PM, Nicolas Trangez wrote: >> ... snip ... > ... > 1) Get LLVM 3.2 working with GHC's LLVM back end. This should work fine in HEAD on OS X and Linux at least. David sorted it out after I filed a lot of bugs and a few patches to fix it. However, I believe at one point or another (before SIMD landed) I tried the new loop and basic block vectorizer in 3.2 with HEAD. I used a simple loop like the one Nicholas posted - with the new code generator, the resulting Cmm is very compact, so I figured it ideal for vectorization. But I believe it made things slower. I don't remember the details, but I'll have a copy of HEAD worth testing soon... > 2) Fix the stack alignment issue with LLVM. This will likely require a > patch to LLVM. > > 3) Add support for aligned move primops. > > 4) Extend the current SIMD fusion framework to handle transitioning to > aligned move instructions. As an alternative, only use aligned move > instructions on memory that we know is aligned. > > These are all on my todo list, but my plate is quite full at the moment. > > Geoff > > > _______________________________________________ > ghc-devs mailing list > ghc-devs@haskell.org > http://www.haskell.org/mailman/listinfo/ghc-devs -- Regards, Austin

On 03/12/2013 02:52 PM, Austin Seipp wrote: > On Tue, Mar 12, 2013 at 9:09 AM, Geoffrey Mainlandwrote: >> On 03/10/2013 09:52 PM, Nicolas Trangez wrote: >>> ... snip ... >> ... >> 1) Get LLVM 3.2 working with GHC's LLVM back end. > > This should work fine in HEAD on OS X and Linux at least. David sorted > it out after I filed a lot of bugs and a few patches to fix it. > > However, I believe at one point or another (before SIMD landed) I > tried the new loop and basic block vectorizer in 3.2 with HEAD. I used > a simple loop like the one Nicholas posted - with the new code > generator, the resulting Cmm is very compact, so I figured it ideal > for vectorization. But I believe it made things slower. I don't > remember the details, but I'll have a copy of HEAD worth testing > soon... My stage2 compiler fails with LLVM 3.2 on Ubuntu 12.04 64-bit. If you know how to fix that, please tell me how. >> 2) Fix the stack alignment issue with LLVM. This will likely require a >> patch to LLVM. >> >> 3) Add support for aligned move primops. >> >> 4) Extend the current SIMD fusion framework to handle transitioning to >> aligned move instructions. As an alternative, only use aligned move >> instructions on memory that we know is aligned. >> >> These are all on my todo list, but my plate is quite full at the moment. >> >> Geoff

On Tue, Mar 12, 2013 at 7:09 AM, Geoffrey Mainland
Unboxed vectors are allocated by GHC, and it does not align memory on 16-byte boundaries, so our first cut at SSE intrinsics simply used unaligned accesses. Obviously with ForeignPtr's we can control alignment and potentially use the aligned variants of SSE instructions, but this will almost double the number of primops. One could imagine extending our fusion framework to transition to aligned move instructions.
When I implemented the memcpy primops I added an optional alignment parameter (rather than a new primop) to each primop. LLVM uses the same setup. Perhaps it could work for you? -- Johan

On 03/12/2013 03:01 PM, Johan Tibell wrote:
On Tue, Mar 12, 2013 at 7:09 AM, Geoffrey Mainland
wrote: Unboxed vectors are allocated by GHC, and it does not align memory on 16-byte boundaries, so our first cut at SSE intrinsics simply used unaligned accesses. Obviously with ForeignPtr's we can control alignment and potentially use the aligned variants of SSE instructions, but this will almost double the number of primops. One could imagine extending our fusion framework to transition to aligned move instructions.
When I implemented the memcpy primops I added an optional alignment parameter (rather than a new primop) to each primop. LLVM uses the same setup. Perhaps it could work for you?
-- Johan
LLVM needs to know statically whether or not an SSE move is aligned---it can't be computed at runtime. I don't think passing an extra Int# argument (or whatever) to a primop is going to work. Geoff

On Tue, Mar 12, 2013 at 8:08 AM, Geoffrey Mainland
LLVM needs to know statically whether or not an SSE move is aligned---it can't be computed at runtime. I don't think passing an extra Int# argument (or whatever) to a primop is going to work.
Ah, but sometimes optimization exposes that information statically (i.e. by inlining). In my case I know Array# is word-aligned and if someone uses memcpy with a 0 offset the compiler might spot this and use the right alignment (this is what happens today). Now, if e.g. ByteArray# can never allocated 16-byte aligned, this doesn't help us much.

On 03/12/2013 03:12 PM, Johan Tibell wrote:
On Tue, Mar 12, 2013 at 8:08 AM, Geoffrey Mainland
wrote: LLVM needs to know statically whether or not an SSE move is aligned---it can't be computed at runtime. I don't think passing an extra Int# argument (or whatever) to a primop is going to work.
Ah, but sometimes optimization exposes that information statically (i.e. by inlining). In my case I know Array# is word-aligned and if someone uses memcpy with a 0 offset the compiler might spot this and use the right alignment (this is what happens today). Now, if e.g. ByteArray# can never allocated 16-byte aligned, this doesn't help us much.
So you are suggesting that if the inliner doesn't happen to expose the argument statically, then we should fall back to unaligned SSE move instructions?

On Tue, Mar 12, 2013 at 8:14 AM, Geoffrey Mainland
So you are suggesting that if the inliner doesn't happen to expose the argument statically, then we should fall back to unaligned SSE move instructions?
I'm not necessarily suggesting anything. :) I'm just saying that there's a way to support both align and unaligned versions using one primop. We might still have to expose two versions to users (wrapping the primop) in case we don't believe the compiler will recover the alignment.

Hey, On Tue, 2013-03-12 at 14:09 +0000, Geoffrey Mainland wrote:
On 03/10/2013 09:52 PM, Nicolas Trangez wrote:
...
Hi Nicolas,
Have you read our paper about the SIMD work? It's available here:
https://research.microsoft.com/en-us/um/people/simonpj/papers/ndp/haskell-be...
I didn't read that one before (read other stream-fusion related papers before), but did now. I got most of it already while reading the vector simd branch commits. Benchmarks results look very nice! I'm afraid I didn't 'get' how the framework would allow for both AVX and SSE instructions to work on streams, since it seems to assume Multi's are always a fixed number of bytes wide (in this case 16 for SSE).
The paper describes the issues involved with integrated SIMD instructions with the vector fusion framework.
There are two primary issues with alignment: stack alignment and heap alignment.
We cannot rely on the stack being properly aligned for AVX spills on any platform, and LLVM's stack fixup code does not play well with GHC, so we *rewrite* all AVX spill instructions to their unaligned counterparts. On Win32 we must do the same for SSE.
Does this imply stack values are always 16-byte aligned? I haven't worked with AVX yet (my CPU doesn't support it).
Unboxed vectors are allocated by GHC, and it does not align memory on 16-byte boundaries, so our first cut at SSE intrinsics simply used unaligned accesses. Obviously with ForeignPtr's we can control alignment and potentially use the aligned variants of SSE instructions, but this will almost double the number of primops. One could imagine extending our fusion framework to transition to aligned move instructions.
Right. I created the patch of #7067 (http://hackage.haskell.org/trac/ghc/ticket/7067) for vector-simd purposed back then (adding mallocForeignPtrAlignedBytes and mallocPlainForeignPtrAlignedBytes).
Finally, LLVM 3.2 does not work with GHC. This means we cannot yet take advantage of its new vectorization optimizations, which is a shame.
So, four projects for you or anyone else who is interested, in rough dependency order:
1) Get LLVM 3.2 working with GHC's LLVM back end.
According to other mails in this thread this should be fixed. I'll give it a go.
2) Fix the stack alignment issue with LLVM. This will likely require a patch to LLVM.
I'm afraid that's a bit out of my league for now :-)
3) Add support for aligned move primops.
I looked into this before, might give it a stab.
4) Extend the current SIMD fusion framework to handle transitioning to aligned move instructions. As an alternative, only use aligned move instructions on memory that we know is aligned.
This is why I sent my previous mail initially: is there any plan how to approach the 'memory that we know is aligned' bit? Would it make sense to have a more general 'alignment restriction' framework for arbitrary values, not only unboxed vectors (if there are any other use-cases)?
These are all on my todo list, but my plate is quite full at the moment.
Heh, sounds familiar ;-) Thanks, Nicolas

On 03/12/2013 08:08 PM, Nicolas Trangez wrote:
Hey,
On Tue, 2013-03-12 at 14:09 +0000, Geoffrey Mainland wrote:
On 03/10/2013 09:52 PM, Nicolas Trangez wrote:
...
Hi Nicolas,
Have you read our paper about the SIMD work? It's available here:
https://research.microsoft.com/en-us/um/people/simonpj/papers/ndp/haskell-be...
I didn't read that one before (read other stream-fusion related papers before), but did now. I got most of it already while reading the vector simd branch commits. Benchmarks results look very nice!
I'm afraid I didn't 'get' how the framework would allow for both AVX and SSE instructions to work on streams, since it seems to assume Multi's are always a fixed number of bytes wide (in this case 16 for SSE).
The width of a Multi depends on the platform. If the platform supports AVX, the Multi will be 256 bits wide. Otherwise, it will be 128 bits wide.
The paper describes the issues involved with integrated SIMD instructions with the vector fusion framework.
There are two primary issues with alignment: stack alignment and heap alignment.
We cannot rely on the stack being properly aligned for AVX spills on any platform, and LLVM's stack fixup code does not play well with GHC, so we *rewrite* all AVX spill instructions to their unaligned counterparts. On Win32 we must do the same for SSE.
Does this imply stack values are always 16-byte aligned? I haven't worked with AVX yet (my CPU doesn't support it).
On Linux, and, I believe, MacOS X, the stack is 16-byte aligned. On Windows it is not.
Unboxed vectors are allocated by GHC, and it does not align memory on 16-byte boundaries, so our first cut at SSE intrinsics simply used unaligned accesses. Obviously with ForeignPtr's we can control alignment and potentially use the aligned variants of SSE instructions, but this will almost double the number of primops. One could imagine extending our fusion framework to transition to aligned move instructions.
Right. I created the patch of #7067 (http://hackage.haskell.org/trac/ghc/ticket/7067) for vector-simd purposed back then (adding mallocForeignPtrAlignedBytes and mallocPlainForeignPtrAlignedBytes).
Finally, LLVM 3.2 does not work with GHC. This means we cannot yet take advantage of its new vectorization optimizations, which is a shame.
So, four projects for you or anyone else who is interested, in rough dependency order:
1) Get LLVM 3.2 working with GHC's LLVM back end.
According to other mails in this thread this should be fixed. I'll give it a go.
GHC doesn't bootstrap with LLVM 3.2 for me, so something seems wrong. Let me know if you get it working.
2) Fix the stack alignment issue with LLVM. This will likely require a patch to LLVM.
I'm afraid that's a bit out of my league for now :-)
3) Add support for aligned move primops.
I looked into this before, might give it a stab.
4) Extend the current SIMD fusion framework to handle transitioning to aligned move instructions. As an alternative, only use aligned move instructions on memory that we know is aligned.
This is why I sent my previous mail initially: is there any plan how to approach the 'memory that we know is aligned' bit? Would it make sense to have a more general 'alignment restriction' framework for arbitrary values, not only unboxed vectors (if there are any other use-cases)?
Not until the LLVM 3.2 and stack alignment issues are resolved. I'm not sure what a general alignment restriction framework would look like. I think a framework for aligned Data.Vector.Unboxed/Storable vectors makes a lot of sense though.
These are all on my todo list, but my plate is quite full at the moment.
Heh, sounds familiar ;-)
Thanks,
Nicolas

I had one more remark about the prefetch instruction included in the compilation result (and if I understood the paper correctly, they're there on purpose). On Sun, 2013-03-10 at 22:52 +0100, Nicolas Trangez wrote:
As an example, here's 'test.hs':
{-# OPTIONS_GHC -fllvm -O3 -optlo-O3 -optlc-O=3 -funbox-strict-fields #-} module Test (sum) where
import Prelude hiding (sum) import Data.Int (Int32) import Data.Vector.Unboxed (Vector) import qualified Data.Vector.Unboxed as U
sum :: Vector Int32 -> Int32 sum v = U.mfold' (+) (+) 0 v
When compiling this into assembly (compiler/library version details at the end of this message), the 'sum' function yields (among other things) this code:
.LBB2_3: # %c1C0 # =>This Inner Loop Header: Depth=1 prefetcht0 (%rsi) movdqu -1536(%rsi), %xmm1 paddd %xmm1, %xmm0 addq $16, %rsi addq $4, %rcx cmpq %rdx, %rcx jl .LBB2_3
If I'm not mistaken, this results in 'prefetcht0 (%rsi)' to be executed for blocks of 16 bytes, in every loop iteration. This seems to be overkill: prefetch* loads a full cache-line, which (according to some cursory reading online) is guaranteed to be at least 32 bytes. It seems to be 64 bytes on my CPU. As a result, 4 (potentially unaligned!) prefetch instructions are executed whilst there's no real use for 3 of them. Next to this, as written in the paper having automatically-generated suitable 'prefetch' instruction can be cool, but alas: in some benchmarks I performed some time ago on linear well-aligned vectors using SSE instructions (using C and inline assembly), removing the prefetch instructions increased runtime performance (I guess due to reduced opcode dispatch, and the processor's heuristic prefetcher doing a good job when scanning over a linear memory range). There might be some more interesting research in here ;-) Nicolas
participants (4)
-
Austin Seipp
-
Geoffrey Mainland
-
Johan Tibell
-
Nicolas Trangez