
* and not low-level enough: How do I tell GHC to pack (coerce?) `data Pos` into `Word64`? (It's not https://ghc.gitlab.haskell.org/ghc/doc/users_guide/exts/pragmas.html#unpack-... ?)
And would it help? Or is it even needed? If I have the data spread over several words, it could still be fine - as long as it's kept in registers?
Actually, as long as it is kept in a CPU cache line. Cf. Ulrich Drepper: What every programmer should know about memory, https://www.akkadia.org/drepper/cpumemory.pdf The paper tells me that data locality wrt. cache lines (i.e. keeping data accessed together in a single cache line) can have an order-of-magnitude effect. (It also talks about multithreading, which can have two orders of magnitude. It's not relevant to vector optimization though.) It's quite possible that the speedups from using a CPU's vector operations is mostly because of better cacheline locality since the vector operations enforce data locality - though vector operations probably give you a nice boost on top of that. Does ghc do memory locality analysis? It would need to find out what data items are going to be accessed roughly at the same time, and making sure they're close together in memory. Deforestation and such will help with locality as a nice side effect (because you get rid of list spines and such so the data stretches across less cache lines anyway), but is there any analysis on top of that? Regards, Jo