faster faster faster but not uglier (how to make nice code AND nice core)?

Dear Cafe, * how to un-uglify the following program? * and is it even ugly enough? The following is some artificial example but I think it illustrates some general questions - for which I don't have answers. Do you? I wanted a really fast implementation of a vector addition system where the "vector" contains small numbers only. and I want it packed into a machine word (and finally put these into Data.IntSet). I can write this in low-level Haskell (I did), and I could even use some low-level language (I also did) but I was hoping that I don't need to. https://gitlab.imn.htwk-leipzig.de/waldmann/tf/-/blob/master/TF4.hs#L108 * this is too low-level: This is for vectors of one fixed size - but I want the code to be more generic, and have the compiler do the specialization/unrolling. * and not low-level enough: How do I tell GHC to pack (coerce?) `data Pos` into `Word64`? (It's not https://ghc.gitlab.haskell.org/ghc/doc/users_guide/exts/pragmas.html#unpack-... ?) And would it help? Or is it even needed? If I have the data spread over several words, it could still be fine - as long as it's kept in registers? How can I then check that "it's fine"? Yes I know, just look at the core. (That seems to be the ultima ratio of https://ghc.gitlab.haskell.org/ghc/doc/users_guide/hints.html#faster-produci...) But is there/could there be some higher-level mechanism? Perhaps a (type?) annotation that guarantees or checks that core (locally) looks "good"? (does not allocate, ...) NB: For the concrete application (evaluation of positions in some game): 1. graph size is exponential, so a linear speed-up won't really help. 2. brute-force gameplay should be replaced by some symbolic evaluation and simplification. That is in fact my goal. But - as long as we don't have this, the program may be useful for generating conjectures. - J.

On Wed, 19 May 2021, Johannes Waldmann wrote:
I wanted a really fast implementation of a vector addition system where the "vector" contains small numbers only.
With integer elements?
and I want it packed into a machine word (and finally put these into Data.IntSet).
Why not using Integer? With a bit of bit tricks you can implement a vector add via Integer addition. https://en.wikipedia.org/wiki/SWAR E.g. like this vectorAdd x y = maskOddElements (maskOddElements x + maskOddElements y) + maskEvenElements (maskEvenElements x + maskEvenElements y) But I suspect that larger Integers are not hold in registers. Chances are better for a record of Word8's, Word16's ... Maybe if they are put in a compact region: http://hackage.haskell.org/package/ghc-compact Or you use the CPU vector types that GHC provides. http://hackage.haskell.org/package/ghc-prim-0.7.0/docs/GHC-Prim.html#g:35 There seem to be some packages for small vectors: https://hackage.haskell.org/package/compact-word-vectors https://hackage.haskell.org/package/smallarray

Henning Thielemann
Or you use the CPU vector types that GHC provides. http://hackage.haskell.org/package/ghc-prim-0.7.0/docs/GHC-Prim.html#g:35
More operations would be nice, in particular, I'd love to get xorWord64X4. -- CYa, ⡍⠁⠗⠊⠕

* and not low-level enough: How do I tell GHC to pack (coerce?) `data Pos` into `Word64`? (It's not https://ghc.gitlab.haskell.org/ghc/doc/users_guide/exts/pragmas.html#unpack-... ?)
And would it help? Or is it even needed? If I have the data spread over several words, it could still be fine - as long as it's kept in registers?
Actually, as long as it is kept in a CPU cache line. Cf. Ulrich Drepper: What every programmer should know about memory, https://www.akkadia.org/drepper/cpumemory.pdf The paper tells me that data locality wrt. cache lines (i.e. keeping data accessed together in a single cache line) can have an order-of-magnitude effect. (It also talks about multithreading, which can have two orders of magnitude. It's not relevant to vector optimization though.) It's quite possible that the speedups from using a CPU's vector operations is mostly because of better cacheline locality since the vector operations enforce data locality - though vector operations probably give you a nice boost on top of that. Does ghc do memory locality analysis? It would need to find out what data items are going to be accessed roughly at the same time, and making sure they're close together in memory. Deforestation and such will help with locality as a nice side effect (because you get rid of list spines and such so the data stretches across less cache lines anyway), but is there any analysis on top of that? Regards, Jo
participants (4)
-
Henning Thielemann
-
Joachim Durchholz
-
Johannes Waldmann
-
Mario Lang