Re: [GHC] #14619: Output value of program changes upon compiling with -O optimizations

15 Jan 2018

      #14619: Output value of program changes upon compiling with -O optimizations
-------------------------------------+-------------------------------------
        Reporter:  sheaf             |                Owner:  (none)
            Type:  bug               |               Status:  new
        Priority:  highest           |            Milestone:  8.4.1
       Component:  Compiler          |              Version:  8.2.2
      Resolution:                    |             Keywords:
Operating System:  Windows           |         Architecture:  x86_64
 Type of failure:  Incorrect result  |  (amd64)
  at runtime                         |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:                    |  Differential Rev(s):
       Wiki Page:                    |
-------------------------------------+-------------------------------------

Comment (by Phyx-):

 I don't think it's a register allocation issue. I think it's a genuine bug
 in a Core2Core pass:

 Following the code fom `sphereIntersection`, the first interesting
 location is
 `0x0000000000401E72` (it's all statically linked). At this address the
 first 6 doubles are loaded from the
 stack:

 {{{
    0x401e72 :      movsd
 %xmm1,-0x30(%rbp)
    0x401e77 :      movsd
 %xmm2,-0x28(%rbp)
    0x401e7c :      movsd
 %xmm3,-0x20(%rbp)
    0x401e81 :      movsd
 %xmm4,-0x18(%rbp)
    0x401e86 :      movsd
 %xmm5,-0x10(%rbp)
    0x401e8b :      movsd
 %xmm6,-0x8(%rbp)
 }}}

 Here:

 {{{
 xmm1 = 0
 xmm2 = 0
 xmm3 = 0

 xmm4 = 1.1
 xmm5 = 2.2
 xmm6 = 3.3
 }}}

 So far so good.

 The first operation to get done is `b  = oc <.> dir`. oc we already know
 since `(<+>)` seems to have been inlined
 and folded away (I assume GHC does constant folding since I can't find any
 code for this).

 so the code for `(<.>)` is at `0x0000000000401C14`:

 {{{
    0x401c14 :       movsd
 0x10(%rbp),%xmm0 (= 200)
    0x401c19 :       addsd  %xmm3,%xmm0
    0x401c1d :       mulsd  %xmm6,%xmm0
    0x401c21 :       movsd
 0x8(%rbp),%xmm6 (= 0)
    0x401c26 :       addsd  %xmm2,%xmm6
    0x401c2a :       mulsd  %xmm5,%xmm6
    0x401c2e :       movsd
 0x0(%rbp),%xmm7 (= 0)
    0x401c33 :       addsd  %xmm1,%xmm7
    0x401c37 :       mulsd  %xmm4,%xmm7
    0x401c3b :       addsd  %xmm6,%xmm7
    0x401c3f :       addsd  %xmm0,%xmm7
 }}}

 So this performed `oc <.> dir` and `xmm7` now contains `b`. Also notice we
 clobbed `xmm6` here. It now contains `0`.

 The next thing we must do is calculate `sqrtDisc` and calculate `t1`.

 t1 is at `0000000000401C9B`

 {{{
    0x401c9b :      movsd
 0x68(%rsp),%xmm1
    0x401ca1 :      movsd  %xmm1,%xmm2
    0x401ca5 :      subsd  %xmm0,%xmm2
    0x401ca9 :      xorpd  %xmm3,%xmm3
    0x401cad :      ucomisd
 %xmm3,%xmm2
    0x401cb1 :      ja     0x401cd8
  (t1 > 0)
    0x401cb3 :      addsd  %xmm0,%xmm1
    0x401cb7 :      xorpd  %xmm0,%xmm0
    0x401cbb :      ucomisd
 %xmm0,%xmm1
    0x401cbf :      ja     0x401d9a
  (t2 > 0)
 }}}

 we take the branch to `0x401cd8` which is `t1 > 0` and then must evaluate
 `(*>)` which is at `0x0000000000401CD8`
 `t1` is stored in `xmm2`.

 {{{
    0x401cd8 :      movq
 $0x498cd8,-0x80(%r12)
    0x401ce1 :      movsd
 %xmm6,-0x78(%r12)
    0x401ce8 :      movq
 $0x498cd8,-0x70(%r12)
    0x401cf1 :      movsd  %xmm2,%xmm0
    0x401cf5 :      mulsd  %xmm6,%xmm0
    0x401cf9 :      movsd
 %xmm0,-0x68(%r12)
    0x401d00 :      movq
 $0x498cd8,-0x60(%r12)
    0x401d09 :      movsd  %xmm2,%xmm0
    0x401d0d :      movsd
 0x60(%rsp),%xmm1
    0x401d13 :      mulsd  %xmm1,%xmm0
    0x401d17 :      movsd
 %xmm0,-0x58(%r12)
    0x401d1e :      movq
 $0x498cd8,-0x50(%r12)
    0x401d27 :      movsd
 0x58(%rsp),%xmm0
    0x401d2d :      mulsd  %xmm0,%xmm2
    0x401d31 :      movsd
 %xmm2,-0x48(%r12)
    0x401d38 :      movq
 $0x498b18,-0x40(%r12)
 }}}

 Notice a couple of weird things here.

 `xmm6` is still clobbered and has no meaning, yet we still spill it but
 never load it again (that I could find).

  Then we do the multiplication of `a*x'` without ever restoring `x'`

 {{{
     0x401cf5 :      mulsd
 %xmm6,%xmm0
 }}}

 Weirdly, we then restore `y'` and `z'` which are stored at `0x60(%rsp)`
 and `0x58(%rsp)`.

 Inspecting `%rsp` I see `xmm6` (3.3) was never spilled to begin with.

 {{{
 0000000000B6DBB8                        0                       0
 0000000000B6DBC8                        0                     1.1
 0000000000B6DBD8                      2.2                     660
 }}}

 Now that we know what's happening, let's compare `-O0` and `-O2`.

 At `-O0` where it works, we have the following sequence for `(<.>)`:

 {{{
 .Ln4nu:
         movsd (%rbp),%xmm0
         movsd 8(%rbp),%xmm7
         movsd 16(%rbp),%xmm8
  ...
 .Ln4nw:
         addsd %xmm3,%xmm8
         mulsd %xmm6,%xmm8
         addsd %xmm2,%xmm7
         mulsd %xmm5,%xmm7
         addsd %xmm1,%xmm0
         mulsd %xmm4,%xmm0
         addsd %xmm7,%xmm0
         addsd %xmm8,%xmm0
         xorpd %xmm7,%xmm7
         ucomisd %xmm7,%xmm0
 }}}

 Notice that `xmm6` is not clobbered here.

 The `-O2` version is:

 {{{
         movsd 16(%rbp),%xmm0
         addsd %xmm3,%xmm0
         mulsd %xmm6,%xmm0
         movsd 8(%rbp),%xmm6
         addsd %xmm2,%xmm6
         mulsd %xmm5,%xmm6
         movsd (%rbp),%xmm7
         addsd %xmm1,%xmm7
         mulsd %xmm4,%xmm7
         addsd %xmm6,%xmm7
         addsd %xmm0,%xmm7
         xorpd %xmm0,%xmm0
         ucomisd %xmm0,%xmm7
 }}}

 At `-O0` because it's not clobbered later it correctly spills `xmm6`:

 {{{
 .Ln4o8:
         movl $1,%eax
         movsd %xmm1,104(%rsp)
         movsd %xmm2,112(%rsp)
         movsd %xmm3,120(%rsp)
         movsd %xmm4,128(%rsp)
         movsd %xmm5,136(%rsp)
         movsd %xmm6,144(%rsp)
         movsd %xmm8,152(%rsp)
 }}}

 Whereas `-O2` thinks it doesn't need the value and spills one register too
 few.

 {{{
 .Ln4os:
         movl $1,%eax
         movsd %xmm1,104(%rsp)
         movsd %xmm2,112(%rsp)
         movsd %xmm3,120(%rsp)
         movsd %xmm4,128(%rsp)
         movsd %xmm5,136(%rsp)
         movsd %xmm7,144(%rsp)
 }}}

 My guess is, at `-O2` it thinks it has enough registers to not need to
 spill `xmm6`.
 But it then later clobbers without spilling and reloading it!

 However I'm too tired to look at Core tonight, so I'll continue next week.
 I think it's a Core pass eliminating a value it shouldn't.

-- 
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/14619#comment:25
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler