
I had one more remark about the prefetch instruction included in the compilation result (and if I understood the paper correctly, they're there on purpose). On Sun, 2013-03-10 at 22:52 +0100, Nicolas Trangez wrote:
As an example, here's 'test.hs':
{-# OPTIONS_GHC -fllvm -O3 -optlo-O3 -optlc-O=3 -funbox-strict-fields #-} module Test (sum) where
import Prelude hiding (sum) import Data.Int (Int32) import Data.Vector.Unboxed (Vector) import qualified Data.Vector.Unboxed as U
sum :: Vector Int32 -> Int32 sum v = U.mfold' (+) (+) 0 v
When compiling this into assembly (compiler/library version details at the end of this message), the 'sum' function yields (among other things) this code:
.LBB2_3: # %c1C0 # =>This Inner Loop Header: Depth=1 prefetcht0 (%rsi) movdqu -1536(%rsi), %xmm1 paddd %xmm1, %xmm0 addq $16, %rsi addq $4, %rcx cmpq %rdx, %rcx jl .LBB2_3
If I'm not mistaken, this results in 'prefetcht0 (%rsi)' to be executed for blocks of 16 bytes, in every loop iteration. This seems to be overkill: prefetch* loads a full cache-line, which (according to some cursory reading online) is guaranteed to be at least 32 bytes. It seems to be 64 bytes on my CPU. As a result, 4 (potentially unaligned!) prefetch instructions are executed whilst there's no real use for 3 of them. Next to this, as written in the paper having automatically-generated suitable 'prefetch' instruction can be cool, but alas: in some benchmarks I performed some time ago on linear well-aligned vectors using SSE instructions (using C and inline assembly), removing the prefetch instructions increased runtime performance (I guess due to reduced opcode dispatch, and the processor's heuristic prefetcher doing a good job when scanning over a linear memory range). There might be some more interesting research in here ;-) Nicolas