My earlier experiment was on GHC-7.10.3. I repeated this on GHC-8.0.1 and the assembly traced was exactly the same except for a marginal improvement. The 8.0.1 code generator removed the r14/r11 swap but the rest of the register ring shift remains the same. I have updated the github gist with the 8.0.1 trace:

https://gist.github.com/harendra-kumar/7d34c6745f604a15a872768e57cd2447

thanks,
harendra

On 13 June 2016 at 00:08, Harendra Kumar <harendra.kumar@gmail.com> wrote:
Hi,

I am implementing unicode normalization in Haskell. I challenged myself to match the performance with the best C/C++ implementation, the best being the ICU library. I am almost there, beating it in one of the benchmarks and within 30% for others. I am out of all application level tricks that I could think of and now need help from the compiler.

I started with a bare minimum loop and adding functionality incrementally watching where the performance trips. At one point I saw that adding just one 'if' condition reduced the performance by half. I looked at what's going on at the assembly level. Here is a github gist of the assembly instructions executed in the fast path of the loop, corresponding cmm snippets and also the full cmm corresponding to the loop:

https://gist.github.com/harendra-kumar/7d34c6745f604a15a872768e57cd2447

I have annotated the assembly code with labels matching the corresponding CMM.

With the addition of another "if" condition the loop which was pretty simple till now suddenly got bloated with a lot of register reassignments. Here is a snippet of the register movements added:

# _n4se:
# swap r14 <-> r11
=> 0x408d6b: mov    %r11,0x98(%rsp)
=> 0x408d73: mov    %r14,%r11
=> 0x408d76: mov    0x98(%rsp),%r14

# reassignments
# rbx -> r10 -> r9 -> r8 -> rdi -> rsi -> rdx -> rcx -> rbx
=> 0x408d7e: mov    %rbx,0x90(%rsp)
=> 0x408d86: mov    %rcx,%rbx
=> 0x408d89: mov    %rdx,%rcx
=> 0x408d8c: mov    %rsi,%rdx
=> 0x408d8f: mov    %rdi,%rsi
=> 0x408d92: mov    %r8,%rdi
=> 0x408d95: mov    %r9,%r8
=> 0x408d98: mov    %r10,%r9
=> 0x408d9b: mov    0x90(%rsp),%r10
.
.
.
loop logic here which uses only %rax, %r10 and %r9 . 
.
.
.
# _n4s8:
# shuffle back to original assignments
=> 0x4090dc: mov    %r14,%r11
=> 0x4090df: mov    %r9,%r10
=> 0x4090e2: mov    %r8,%r9
=> 0x4090e5: mov    %rdi,%r8
=> 0x4090e8: mov    %rsi,%rdi
=> 0x4090eb: mov    %rdx,%rsi
=> 0x4090ee: mov    %rcx,%rdx
=> 0x4090f1: mov    %rbx,%rcx
=> 0x4090f4: mov    %rax,%rbx
=> 0x4090f7: mov    0x88(%rsp),%rax

=> 0x4090ff: jmpq   0x408d2a


The registers seem to be getting reassigned here, data flowing from one to the next. In this particular path a lot of these register movements seem unnecessary and are only undone at the end without being used. 

Maybe this is because these are reusable blocks and the movement is necessary when used in some other path?

Can this be avoided? Or at least avoided in a certain fast path somehow by hinting the compiler? Any pointers to the GHC code will be appreciated. I am not yet much familiar with the GHC code but can dig deeper pretty quickly. But before that I hope someone knowledgeable in this area can shed some light on this at a conceptual level or if at all it can be improved. I can provide more details and experiment more if needed.

Thanks,
Harendra