redundant loads and saves in code generated for recursive functions?

Dear All, I am new to Haskell so please forgive me if I am asking about something already well-understood. I was trying to understand the performance of my Haskell program compiled with the LLVM backend. I used -ddump-llvm to dump the LLVM assembly and then ran llc -O3 on the resulting file to look at the native assembly. One of the generated function starts off with s5BH_info: # @s5BH_info # BB#0: subq $208, %rsp movq %r13, 200(%rsp) movq %rbp, 192(%rsp) movq %r12, 184(%rsp) movq %rbx, 176(%rsp) movq %r14, 168(%rsp) movq %rsi, 160(%rsp) movq %rdi, 152(%rsp) movq %r8, 144(%rsp) movq %r9, 136(%rsp) movq %r15, 128(%rsp) movss %xmm1, 124(%rsp) movss %xmm2, 120(%rsp) movss %xmm3, 116(%rsp) movss %xmm4, 112(%rsp) movsd %xmm5, 104(%rsp) movsd %xmm6, 96(%rsp) At some point down the line the function makes a tail call to itself and this is the code generated movq %r14, 168(%rsp) movq 200(%rsp), %r13 movq 192(%rsp), %rbp movq 184(%rsp), %r12 movq 176(%rsp), %rbx movq 128(%rsp), %r15 movsd 104(%rsp), %xmm5 addq $208, %rsp jmp s5BH_info So it looks like some values are being moved from registers to the stack only to be immediately moved from the stack to the register on entry to the function. It should be possible to eliminate both the load and the stores. Is this behaviour due to LLVM or GHC? If it is GHC, it this an optimization a newcomer can attempt to implement or are there deep issues here? Jyotirmoy Bhattacharya

On reading this again I realise that I got the order of loads and stores wrong. The arguments are being stored on entering the function and loaded before the call. But still, is there a chance of eliminating this redundancy? Jyotirmoy On Wed, Jul 30, 2014 at 1:54 PM, Jyotirmoy Bhattacharya < jyotirmoy@jyotirmoy.net> wrote:
Dear All,
I am new to Haskell so please forgive me if I am asking about something already well-understood.
I was trying to understand the performance of my Haskell program compiled with the LLVM backend. I used -ddump-llvm to dump the LLVM assembly and then ran llc -O3 on the resulting file to look at the native assembly.
One of the generated function starts off with s5BH_info: # @s5BH_info # BB#0: subq $208, %rsp movq %r13, 200(%rsp) movq %rbp, 192(%rsp) movq %r12, 184(%rsp) movq %rbx, 176(%rsp) movq %r14, 168(%rsp) movq %rsi, 160(%rsp) movq %rdi, 152(%rsp) movq %r8, 144(%rsp) movq %r9, 136(%rsp) movq %r15, 128(%rsp) movss %xmm1, 124(%rsp) movss %xmm2, 120(%rsp) movss %xmm3, 116(%rsp) movss %xmm4, 112(%rsp) movsd %xmm5, 104(%rsp) movsd %xmm6, 96(%rsp)
At some point down the line the function makes a tail call to itself and this is the code generated movq %r14, 168(%rsp) movq 200(%rsp), %r13 movq 192(%rsp), %rbp movq 184(%rsp), %r12 movq 176(%rsp), %rbx movq 128(%rsp), %r15 movsd 104(%rsp), %xmm5 addq $208, %rsp jmp s5BH_info
So it looks like some values are being moved from registers to the stack only to be immediately moved from the stack to the register on entry to the function. It should be possible to eliminate both the load and the stores.
Is this behaviour due to LLVM or GHC? If it is GHC, it this an optimization a newcomer can attempt to implement or are there deep issues here?
Jyotirmoy Bhattacharya

Hi Jyotirmoy,
I didn't read your assembly carefully, but it sounds similar to
https://ghc.haskell.org/trac/ghc/ticket/8905, which is not fixed yet.
On Wed, Jul 30, 2014 at 12:03 PM, Jyotirmoy Bhattacharya
On reading this again I realise that I got the order of loads and stores wrong. The arguments are being stored on entering the function and loaded before the call. But still, is there a chance of eliminating this redundancy?
Jyotirmoy
On Wed, Jul 30, 2014 at 1:54 PM, Jyotirmoy Bhattacharya
wrote: Dear All,
I am new to Haskell so please forgive me if I am asking about something already well-understood.
I was trying to understand the performance of my Haskell program compiled with the LLVM backend. I used -ddump-llvm to dump the LLVM assembly and then ran llc -O3 on the resulting file to look at the native assembly.
One of the generated function starts off with s5BH_info: # @s5BH_info # BB#0: subq $208, %rsp movq %r13, 200(%rsp) movq %rbp, 192(%rsp) movq %r12, 184(%rsp) movq %rbx, 176(%rsp) movq %r14, 168(%rsp) movq %rsi, 160(%rsp) movq %rdi, 152(%rsp) movq %r8, 144(%rsp) movq %r9, 136(%rsp) movq %r15, 128(%rsp) movss %xmm1, 124(%rsp) movss %xmm2, 120(%rsp) movss %xmm3, 116(%rsp) movss %xmm4, 112(%rsp) movsd %xmm5, 104(%rsp) movsd %xmm6, 96(%rsp)
At some point down the line the function makes a tail call to itself and this is the code generated movq %r14, 168(%rsp) movq 200(%rsp), %r13 movq 192(%rsp), %rbp movq 184(%rsp), %r12 movq 176(%rsp), %rbx movq 128(%rsp), %r15 movsd 104(%rsp), %xmm5 addq $208, %rsp jmp s5BH_info
So it looks like some values are being moved from registers to the stack only to be immediately moved from the stack to the register on entry to the function. It should be possible to eliminate both the load and the stores.
Is this behaviour due to LLVM or GHC? If it is GHC, it this an optimization a newcomer can attempt to implement or are there deep issues here?
Jyotirmoy Bhattacharya
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Hi,
It doesn't seem the same to me.
Unlike the bug you point to, the C-- does not have any extra stores. The
stores and loads appear first in the LLVM. I am attaching the C--, LLVM and
assembly codes for the function.
The real missed opportunity seems to me the absence of a recognition that
we are in fact making a tail call to ourselves. Recognizing that might
allow jumping to some point after the initial stores.
Jyotirmoy Bhattacharya
On Wed, Jul 30, 2014 at 4:14 PM, Johan Tibell
Hi Jyotirmoy,
I didn't read your assembly carefully, but it sounds similar to https://ghc.haskell.org/trac/ghc/ticket/8905, which is not fixed yet.
On reading this again I realise that I got the order of loads and stores wrong. The arguments are being stored on entering the function and loaded before the call. But still, is there a chance of eliminating this redundancy?
Jyotirmoy
On Wed, Jul 30, 2014 at 1:54 PM, Jyotirmoy Bhattacharya
wrote: Dear All,
I am new to Haskell so please forgive me if I am asking about something already well-understood.
I was trying to understand the performance of my Haskell program
compiled
with the LLVM backend. I used -ddump-llvm to dump the LLVM assembly and
ran llc -O3 on the resulting file to look at the native assembly.
One of the generated function starts off with s5BH_info: # @s5BH_info # BB#0: subq $208, %rsp movq %r13, 200(%rsp) movq %rbp, 192(%rsp) movq %r12, 184(%rsp) movq %rbx, 176(%rsp) movq %r14, 168(%rsp) movq %rsi, 160(%rsp) movq %rdi, 152(%rsp) movq %r8, 144(%rsp) movq %r9, 136(%rsp) movq %r15, 128(%rsp) movss %xmm1, 124(%rsp) movss %xmm2, 120(%rsp) movss %xmm3, 116(%rsp) movss %xmm4, 112(%rsp) movsd %xmm5, 104(%rsp) movsd %xmm6, 96(%rsp)
At some point down the line the function makes a tail call to itself and this is the code generated movq %r14, 168(%rsp) movq 200(%rsp), %r13 movq 192(%rsp), %rbp movq 184(%rsp), %r12 movq 176(%rsp), %rbx movq 128(%rsp), %r15 movsd 104(%rsp), %xmm5 addq $208, %rsp jmp s5BH_info
So it looks like some values are being moved from registers to the stack only to be immediately moved from the stack to the register on entry to
On Wed, Jul 30, 2014 at 12:03 PM, Jyotirmoy Bhattacharya
wrote: then the function. It should be possible to eliminate both the load and the stores.
Is this behaviour due to LLVM or GHC? If it is GHC, it this an optimization a newcomer can attempt to implement or are there deep issues here?
Jyotirmoy Bhattacharya
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

I see. I think this is worthwhile to file a bug for. Please include
the cmm, asm, and llvm as attachedment (and perhaps a short Haskell
program that show the issue).
On Wed, Jul 30, 2014 at 2:10 PM, Jyotirmoy Bhattacharya
Hi,
It doesn't seem the same to me.
Unlike the bug you point to, the C-- does not have any extra stores. The stores and loads appear first in the LLVM. I am attaching the C--, LLVM and assembly codes for the function.
The real missed opportunity seems to me the absence of a recognition that we are in fact making a tail call to ourselves. Recognizing that might allow jumping to some point after the initial stores.
Jyotirmoy Bhattacharya
On Wed, Jul 30, 2014 at 4:14 PM, Johan Tibell
wrote: Hi Jyotirmoy,
I didn't read your assembly carefully, but it sounds similar to https://ghc.haskell.org/trac/ghc/ticket/8905, which is not fixed yet.
On Wed, Jul 30, 2014 at 12:03 PM, Jyotirmoy Bhattacharya
wrote: On reading this again I realise that I got the order of loads and stores wrong. The arguments are being stored on entering the function and loaded before the call. But still, is there a chance of eliminating this redundancy?
Jyotirmoy
On Wed, Jul 30, 2014 at 1:54 PM, Jyotirmoy Bhattacharya
wrote: Dear All,
I am new to Haskell so please forgive me if I am asking about something already well-understood.
I was trying to understand the performance of my Haskell program compiled with the LLVM backend. I used -ddump-llvm to dump the LLVM assembly and then ran llc -O3 on the resulting file to look at the native assembly.
One of the generated function starts off with s5BH_info: # @s5BH_info # BB#0: subq $208, %rsp movq %r13, 200(%rsp) movq %rbp, 192(%rsp) movq %r12, 184(%rsp) movq %rbx, 176(%rsp) movq %r14, 168(%rsp) movq %rsi, 160(%rsp) movq %rdi, 152(%rsp) movq %r8, 144(%rsp) movq %r9, 136(%rsp) movq %r15, 128(%rsp) movss %xmm1, 124(%rsp) movss %xmm2, 120(%rsp) movss %xmm3, 116(%rsp) movss %xmm4, 112(%rsp) movsd %xmm5, 104(%rsp) movsd %xmm6, 96(%rsp)
At some point down the line the function makes a tail call to itself and this is the code generated movq %r14, 168(%rsp) movq 200(%rsp), %r13 movq 192(%rsp), %rbp movq 184(%rsp), %r12 movq 176(%rsp), %rbx movq 128(%rsp), %r15 movsd 104(%rsp), %xmm5 addq $208, %rsp jmp s5BH_info
So it looks like some values are being moved from registers to the stack only to be immediately moved from the stack to the register on entry to the function. It should be possible to eliminate both the load and the stores.
Is this behaviour due to LLVM or GHC? If it is GHC, it this an optimization a newcomer can attempt to implement or are there deep issues here?
Jyotirmoy Bhattacharya
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

I have filed a bug
https://ghc.haskell.org/trac/ghc/ticket/9387
On Wed, Jul 30, 2014 at 6:03 PM, Johan Tibell
I see. I think this is worthwhile to file a bug for. Please include the cmm, asm, and llvm as attachedment (and perhaps a short Haskell program that show the issue).
Hi,
It doesn't seem the same to me.
Unlike the bug you point to, the C-- does not have any extra stores. The stores and loads appear first in the LLVM. I am attaching the C--, LLVM and assembly codes for the function.
The real missed opportunity seems to me the absence of a recognition
On Wed, Jul 30, 2014 at 2:10 PM, Jyotirmoy Bhattacharya
wrote: that we are in fact making a tail call to ourselves. Recognizing that might allow jumping to some point after the initial stores.
Jyotirmoy Bhattacharya
On Wed, Jul 30, 2014 at 4:14 PM, Johan Tibell
wrote: Hi Jyotirmoy,
I didn't read your assembly carefully, but it sounds similar to https://ghc.haskell.org/trac/ghc/ticket/8905, which is not fixed yet.
On Wed, Jul 30, 2014 at 12:03 PM, Jyotirmoy Bhattacharya
wrote: On reading this again I realise that I got the order of loads and
stores
wrong. The arguments are being stored on entering the function and loaded before the call. But still, is there a chance of eliminating this redundancy?
Jyotirmoy
On Wed, Jul 30, 2014 at 1:54 PM, Jyotirmoy Bhattacharya
wrote: Dear All,
I am new to Haskell so please forgive me if I am asking about
something
already well-understood.
I was trying to understand the performance of my Haskell program compiled with the LLVM backend. I used -ddump-llvm to dump the LLVM assembly and then ran llc -O3 on the resulting file to look at the native assembly.
One of the generated function starts off with s5BH_info: # @s5BH_info # BB#0: subq $208, %rsp movq %r13, 200(%rsp) movq %rbp, 192(%rsp) movq %r12, 184(%rsp) movq %rbx, 176(%rsp) movq %r14, 168(%rsp) movq %rsi, 160(%rsp) movq %rdi, 152(%rsp) movq %r8, 144(%rsp) movq %r9, 136(%rsp) movq %r15, 128(%rsp) movss %xmm1, 124(%rsp) movss %xmm2, 120(%rsp) movss %xmm3, 116(%rsp) movss %xmm4, 112(%rsp) movsd %xmm5, 104(%rsp) movsd %xmm6, 96(%rsp)
At some point down the line the function makes a tail call to itself and this is the code generated movq %r14, 168(%rsp) movq 200(%rsp), %r13 movq 192(%rsp), %rbp movq 184(%rsp), %r12 movq 176(%rsp), %rbx movq 128(%rsp), %r15 movsd 104(%rsp), %xmm5 addq $208, %rsp jmp s5BH_info
So it looks like some values are being moved from registers to the stack only to be immediately moved from the stack to the register on entry to the function. It should be possible to eliminate both the load and the stores.
Is this behaviour due to LLVM or GHC? If it is GHC, it this an optimization a newcomer can attempt to implement or are there deep issues here?
Jyotirmoy Bhattacharya
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Interestingly enough, if you run ‘opt’ on the llvm code before compiling it with ‘llc’, you get the optimised code you would expect. On Thursday 31 July 2014 at 01:37, Jyotirmoy Bhattacharya wrote:
I have filed a bug https://ghc.haskell.org/trac/ghc/ticket/9387
On Wed, Jul 30, 2014 at 6:03 PM, Johan Tibell
wrote: I see. I think this is worthwhile to file a bug for. Please include the cmm, asm, and llvm as attachedment (and perhaps a short Haskell program that show the issue).
On Wed, Jul 30, 2014 at 2:10 PM, Jyotirmoy Bhattacharya
wrote: Hi,
It doesn't seem the same to me.
Unlike the bug you point to, the C-- does not have any extra stores. The stores and loads appear first in the LLVM. I am attaching the C--, LLVM and assembly codes for the function.
The real missed opportunity seems to me the absence of a recognition that we are in fact making a tail call to ourselves. Recognizing that might allow jumping to some point after the initial stores.
Jyotirmoy Bhattacharya
On Wed, Jul 30, 2014 at 4:14 PM, Johan Tibell
wrote: Hi Jyotirmoy,
I didn't read your assembly carefully, but it sounds similar to https://ghc.haskell.org/trac/ghc/ticket/8905, which is not fixed yet.
On Wed, Jul 30, 2014 at 12:03 PM, Jyotirmoy Bhattacharya
wrote: On reading this again I realise that I got the order of loads and stores wrong. The arguments are being stored on entering the function and loaded before the call. But still, is there a chance of eliminating this redundancy?
Jyotirmoy
On Wed, Jul 30, 2014 at 1:54 PM, Jyotirmoy Bhattacharya
wrote: Dear All,
I am new to Haskell so please forgive me if I am asking about something already well-understood.
I was trying to understand the performance of my Haskell program compiled with the LLVM backend. I used -ddump-llvm to dump the LLVM assembly and then ran llc -O3 on the resulting file to look at the native assembly.
One of the generated function starts off with s5BH_info: # @s5BH_info # BB#0: subq $208, %rsp movq %r13, 200(%rsp) movq %rbp, 192(%rsp) movq %r12, 184(%rsp) movq %rbx, 176(%rsp) movq %r14, 168(%rsp) movq %rsi, 160(%rsp) movq %rdi, 152(%rsp) movq %r8, 144(%rsp) movq %r9, 136(%rsp) movq %r15, 128(%rsp) movss %xmm1, 124(%rsp) movss %xmm2, 120(%rsp) movss %xmm3, 116(%rsp) movss %xmm4, 112(%rsp) movsd %xmm5, 104(%rsp) movsd %xmm6, 96(%rsp)
At some point down the line the function makes a tail call to itself and this is the code generated movq %r14, 168(%rsp) movq 200(%rsp), %r13 movq 192(%rsp), %rbp movq 184(%rsp), %r12 movq 176(%rsp), %rbx movq 128(%rsp), %r15 movsd 104(%rsp), %xmm5 addq $208, %rsp jmp s5BH_info
So it looks like some values are being moved from registers to the stack only to be immediately moved from the stack to the register on entry to the function. It should be possible to eliminate both the load and the stores.
Is this behaviour due to LLVM or GHC? If it is GHC, it this an optimization a newcomer can attempt to implement or are there deep issues here?
Jyotirmoy Bhattacharya
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org (mailto:Haskell-Cafe@haskell.org) http://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org (mailto:Haskell-Cafe@haskell.org) http://www.haskell.org/mailman/listinfo/haskell-cafe
participants (3)
-
Johan Tibell
-
Jyotirmoy Bhattacharya
-
Lemmih