Re: CMM-to-ASM: Register allocation wierdness

13 Jun 2016

      My earlier experiment was on GHC-7.10.3. I repeated this on GHC-8.0.1 and
the assembly traced was exactly the same except for a marginal improvement.
The 8.0.1 code generator removed the r14/r11 swap but the rest of the
register ring shift remains the same. I have updated the github gist with
the 8.0.1 trace:

https://gist.github.com/harendra-kumar/7d34c6745f604a15a872768e57cd2447

thanks,
harendra

On 13 June 2016 at 00:08, Harendra Kumar  wrote:
...
Hi,
I am implementing unicode normalization in Haskell. I challenged myself to
match the performance with the best C/C++ implementation, the best being
the ICU library. I am almost there, beating it in one of the benchmarks and
within 30% for others. I am out of all application level tricks that I
could think of and now need help from the compiler.
I started with a bare minimum loop and adding functionality incrementally
watching where the performance trips. At one point I saw that adding just
one 'if' condition reduced the performance by half. I looked at what's
going on at the assembly level. Here is a github gist of the assembly
instructions executed in the fast path of the loop, corresponding cmm
snippets and also the full cmm corresponding to the loop:
https://gist.github.com/harendra-kumar/7d34c6745f604a15a872768e57cd2447
I have annotated the assembly code with labels matching the corresponding
CMM.
With the addition of another "if" condition the loop which was pretty
simple till now suddenly got bloated with a lot of register reassignments.
Here is a snippet of the register movements added:
# _n4se:
# swap r14 <-> r11
=> 0x408d6b: mov    %r11,0x98(%rsp)
=> 0x408d73: mov    %r14,%r11
=> 0x408d76: mov    0x98(%rsp),%r14
# reassignments
# rbx -> r10 -> r9 -> r8 -> rdi -> rsi -> rdx -> rcx -> rbx
=> 0x408d7e: mov    %rbx,0x90(%rsp)
=> 0x408d86: mov    %rcx,%rbx
=> 0x408d89: mov    %rdx,%rcx
=> 0x408d8c: mov    %rsi,%rdx
=> 0x408d8f: mov    %rdi,%rsi
=> 0x408d92: mov    %r8,%rdi
=> 0x408d95: mov    %r9,%r8
=> 0x408d98: mov    %r10,%r9
=> 0x408d9b: mov    0x90(%rsp),%r10
.
.
.
loop logic here which uses only %rax, %r10 and %r9 .
.
.
.
# _n4s8:
# shuffle back to original assignments
=> 0x4090dc: mov    %r14,%r11
=> 0x4090df: mov    %r9,%r10
=> 0x4090e2: mov    %r8,%r9
=> 0x4090e5: mov    %rdi,%r8
=> 0x4090e8: mov    %rsi,%rdi
=> 0x4090eb: mov    %rdx,%rsi
=> 0x4090ee: mov    %rcx,%rdx
=> 0x4090f1: mov    %rbx,%rcx
=> 0x4090f4: mov    %rax,%rbx
=> 0x4090f7: mov    0x88(%rsp),%rax
=> 0x4090ff: jmpq   0x408d2a
The registers seem to be getting reassigned here, data flowing from one to
the next. In this particular path a lot of these register movements seem
unnecessary and are only undone at the end without being used.
Maybe this is because these are reusable blocks and the movement is
necessary when used in some other path?
Can this be avoided? Or at least avoided in a certain fast path somehow by
hinting the compiler? Any pointers to the GHC code will be appreciated. I
am not yet much familiar with the GHC code but can dig deeper pretty
quickly. But before that I hope someone knowledgeable in this area can shed
some light on this at a conceptual level or if at all it can be improved. I
can provide more details and experiment more if needed.
Thanks,
Harendra