On Tue, Mar 12, 2013 at 7:09 AM, Geoffrey Mainland <mainland@apeiron.net> wrote:
Unboxed vectors are allocated by GHC, and it does not align memory on
16-byte boundaries, so our first cut at SSE intrinsics simply used
unaligned accesses. Obviously with ForeignPtr's we can control alignment
and potentially use the aligned variants of SSE instructions, but this
will almost double the number of primops. One could imagine extending
our fusion framework to transition to aligned move instructions.

When I implemented the memcpy primops I added an optional alignment parameter (rather than a new primop) to each primop. LLVM uses the same setup. Perhaps it could work for you?

-- Johan