Re: SIMD/SSE support & alignment

13 Mar 2013

      I had one more remark about the prefetch instruction included in the
compilation result (and if I understood the paper correctly, they're
there on purpose).

On Sun, 2013-03-10 at 22:52 +0100, Nicolas Trangez wrote:
...
As an example, here's 'test.hs':
{-# OPTIONS_GHC -fllvm -O3  -optlo-O3 -optlc-O=3 -funbox-strict-fields
#-}
module Test (sum) where
import Prelude hiding (sum)
import Data.Int (Int32)
import Data.Vector.Unboxed (Vector)
import qualified Data.Vector.Unboxed as U
sum :: Vector Int32 -> Int32
sum v = U.mfold' (+) (+) 0 v
When compiling this into assembly (compiler/library version details at
the end of this message), the 'sum' function yields (among other
things)
this code:
.LBB2_3:                                # %c1C0
                                        # =>This Inner Loop Header:
Depth=1
        prefetcht0      (%rsi)
        movdqu  -1536(%rsi), %xmm1
        paddd   %xmm1, %xmm0
        addq    $16, %rsi
        addq    $4, %rcx
        cmpq    %rdx, %rcx
        jl      .LBB2_3
If I'm not mistaken, this results in 'prefetcht0 (%rsi)' to be executed
for blocks of 16 bytes, in every loop iteration.

This seems to be overkill: prefetch* loads a full cache-line, which
(according to some cursory reading online) is guaranteed to be at least
32 bytes. It seems to be 64 bytes on my CPU.

As a result, 4 (potentially unaligned!) prefetch instructions are
executed whilst there's no real use for 3 of them.

Next to this, as written in the paper having automatically-generated
suitable 'prefetch' instruction can be cool, but alas: in some
benchmarks I performed some time ago on linear well-aligned vectors
using SSE instructions (using C and inline assembly), removing the
prefetch instructions increased runtime performance (I guess due to
reduced opcode dispatch, and the processor's heuristic prefetcher doing
a good job when scanning over a linear memory range).

There might be some more interesting research in here ;-)

Nicolas