
On the other hand, I still think it would be worth actually benchmarking this stuff to see how much difference it makes. Wouldn't surprise me if the CPU designers did some clever trickery with pipelining and superscalar execution to make two adjacent 32-bit instructions execute the same way as a single 64-bit instruction would...
(I've seen various sources claim that running software in 64-bit mode only gives you a 2% speedup. Then again, they presumably aren't testing with chess software which heavily utilises explicit 64-bit operations.)
For a chess engine this is for sure not true. I guess that this is one of very few domains where it really matters! The most (basic) operations with bitboards are anding, oring, xoring, shifting and (for magic bitboards) multiplying 64 bits values. When using 32 bits you need for some of these more then double time to achieve the same.
I'm still left wondering if using 32-bit instructions to manipulate 64-bit values is actually that much slower. Back in the old days of non-pipelined, uniscalar CPUs, it would certainly have been the case. Today's processors are far more complex than that. Things like cache misses tend to have a way, way bigger performance hit than anything arithmetic-related.
I was wondering if there is a possibility to support 64 bit native codes without other stuff (calling conventions, win64 specific system calls etc). This could be perhaps a first step to full 64 bit support. But from the code of ghc I could not understand what this would mean.
I'm wondering if you could write the operations you want is small C stub functions, and FFI to them and do it that way. I don't really know enough about this sort of thing to know whether that'll work...