
My earlier experiment was on GHC-7.10.3. I repeated this on GHC-8.0.1 and
the assembly traced was exactly the same except for a marginal improvement.
The 8.0.1 code generator removed the r14/r11 swap but the rest of the
register ring shift remains the same. I have updated the github gist with
the 8.0.1 trace:
https://gist.github.com/harendra-kumar/7d34c6745f604a15a872768e57cd2447
thanks,
harendra
On 13 June 2016 at 00:08, Harendra Kumar
Hi,
I am implementing unicode normalization in Haskell. I challenged myself to match the performance with the best C/C++ implementation, the best being the ICU library. I am almost there, beating it in one of the benchmarks and within 30% for others. I am out of all application level tricks that I could think of and now need help from the compiler.
I started with a bare minimum loop and adding functionality incrementally watching where the performance trips. At one point I saw that adding just one 'if' condition reduced the performance by half. I looked at what's going on at the assembly level. Here is a github gist of the assembly instructions executed in the fast path of the loop, corresponding cmm snippets and also the full cmm corresponding to the loop:
https://gist.github.com/harendra-kumar/7d34c6745f604a15a872768e57cd2447
I have annotated the assembly code with labels matching the corresponding CMM.
With the addition of another "if" condition the loop which was pretty simple till now suddenly got bloated with a lot of register reassignments. Here is a snippet of the register movements added:
# _n4se: # swap r14 <-> r11 => 0x408d6b: mov %r11,0x98(%rsp) => 0x408d73: mov %r14,%r11 => 0x408d76: mov 0x98(%rsp),%r14
# reassignments # rbx -> r10 -> r9 -> r8 -> rdi -> rsi -> rdx -> rcx -> rbx => 0x408d7e: mov %rbx,0x90(%rsp) => 0x408d86: mov %rcx,%rbx => 0x408d89: mov %rdx,%rcx => 0x408d8c: mov %rsi,%rdx => 0x408d8f: mov %rdi,%rsi => 0x408d92: mov %r8,%rdi => 0x408d95: mov %r9,%r8 => 0x408d98: mov %r10,%r9 => 0x408d9b: mov 0x90(%rsp),%r10 . . . loop logic here which uses only %rax, %r10 and %r9 . . . . # _n4s8: # shuffle back to original assignments => 0x4090dc: mov %r14,%r11 => 0x4090df: mov %r9,%r10 => 0x4090e2: mov %r8,%r9 => 0x4090e5: mov %rdi,%r8 => 0x4090e8: mov %rsi,%rdi => 0x4090eb: mov %rdx,%rsi => 0x4090ee: mov %rcx,%rdx => 0x4090f1: mov %rbx,%rcx => 0x4090f4: mov %rax,%rbx => 0x4090f7: mov 0x88(%rsp),%rax
=> 0x4090ff: jmpq 0x408d2a
The registers seem to be getting reassigned here, data flowing from one to the next. In this particular path a lot of these register movements seem unnecessary and are only undone at the end without being used.
Maybe this is because these are reusable blocks and the movement is necessary when used in some other path?
Can this be avoided? Or at least avoided in a certain fast path somehow by hinting the compiler? Any pointers to the GHC code will be appreciated. I am not yet much familiar with the GHC code but can dig deeper pretty quickly. But before that I hope someone knowledgeable in this area can shed some light on this at a conceptual level or if at all it can be improved. I can provide more details and experiment more if needed.
Thanks, Harendra