Hey Daniel 👋
That's really an interesting topic, because we never analyzed the emitted RISC-V assembly with statistical measures.
(We haven't decided which RISC-V profile to require. I.e. requiring the very latest extensions would frustrate people with older hardware... However, it's in anycase good to have possible improvements documented in tickets.)
I'm wondering if you really have to go through QEMU. Or, if feeding assembly code to a parser and then doing the math on that wouldn't be sufficient? (Of course, tracing the execution is more accurate. However, it's much more complicated as well.)
To account Assembly instructions to Cmm statements you may use the GHC parameters -ddump-cmm and -dppr-debug (and to stream this into files instead of stdout -ddump-to-file.) This will add comments for most Cmm statements into the dumped assembly code.
Skimming over the NCG code and watching out for longer or repeating instruction lists might be a good strategy to make educated guesses.
So, you could raise the question if - analog to compressed expressions - it wouldn't make sense to have extended expressions that cover two words. Such that the first word is the instruction and the second it's immediate(s). (Hardware designers would probably hate that, because it means a bigger change to the instruction decoding unit. However, I got asked as a software developer ;) )
Other than that, I've unfortunately got no great ideas.
Please feel free to keep us in the loop (especially regarding the results of your analyses.) And, if you've got any questions regarding the RISC-V NCG, please feel free to reach out either here or directly to me. There's also a #GHC "room" on Matrix where you can quickly drop smaller scoped questions.
I hope that was of any help. Best regards,
Sven