
By which I mean having this family of proposed primops. Its not obvious to me at least how GHC could intelligently infer / use these implicitly for the end user / library writer.
I have couple of ideas how to implement this, but having explicit set of primops will make using of the vector instructions less magical. As for having only valid set of primops for given arch/CPU target will make things much more complicated - llvm takes care of implementing vector operation from smaller instructions - operations DoubleX16 primitive types gets compiled into something like plusDoubleX16# :: DoubleX16# -> DoubleX16# -> DoubleX16# movq %r13, 616(%rsp) movq %rbp, 608(%rsp) movq %r12, 600(%rsp) movq %rbx, 592(%rsp) movq %r15, 544(%rsp) movq 592(%rsp), %rax movq %rax, 344(%rsp) movq 608(%rsp), %rax vmovups (%rax), %ymm0 vmovups 32(%rax), %ymm1 vmovups 64(%rax), %ymm2 vmovups 96(%rax), %ymm3 vmovaps %ymm3, 224(%rsp) vmovaps %ymm2, 192(%rsp) vmovaps %ymm1, 160(%rsp) vmovaps %ymm0, 128(%rsp) movq 608(%rsp), %rax vmovups 128(%rax), %ymm0 vmovups 160(%rax), %ymm1 vmovups 192(%rax), %ymm2 vmovups 224(%rax), %ymm3 vmovaps %ymm3, 96(%rsp) vmovaps %ymm2, 64(%rsp) vmovaps %ymm1, 32(%rsp) vmovaps %ymm0, (%rsp) movq 344(%rsp), %rbx movq %rbx, 592(%rsp) movq 544(%rsp), %r15 movq 600(%rsp), %r12 movq 608(%rsp), %rax movq 616(%rsp), %r13 movq %rax, %rbp vzeroupper (Still it should be possible to compile this with less amount of movements)