That, rather tangentially, reminds me: If we do start to teach the code generator about how to produce these sorts of things from simpler parts, e.g. via enabling something like LLVM's vectorization pass, or some internal future ghc compiler pass that checks for, say,
Superword-Level Parallelism in the style of Jaewook Shin, then we need to differentiate between flags for what ghc/llvm is allowed to produce via optimization, etc. and what the end user is allowed to explicitly emit. e.g. in my own code I can safely call avx2 primitives after I set up guards to check that I'm on a CPU that supports them, but I can only currently emit that code after I tell GHC that I want it to allow the avx2 instructions. If I build a complicated dispatch mechanism in Haskell for picking the right ISA and emitting code for several of them, I'm going to need to tell ghc to let me build with all sorts of instruction sets that the machine the final executable runs on may not fully support. We should be careful not to conflate these two things.
-Edward