[GHC] #13629: sqrt should use machine instruction on x86_64

#13629: sqrt should use machine instruction on x86_64 -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.4.1 Component: Compiler | Version: 8.0.1 (NCG) | Keywords: | Operating System: Unknown/Multiple Architecture: | Type of failure: Runtime Unknown/Multiple | performance bug Test Case: | Blocked By: Blocking: | Related Tickets: Differential Rev(s): Phab:D3265 | Wiki Page: -------------------------------------+------------------------------------- The NCG currently implements the square root MachOp with an out-of-line call. This resulted in a large performance regression in `n-body` (see #13570). Fix this. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13629 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13629: sqrt should use machine instruction on x86_64 -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.4.1 Component: Compiler (NCG) | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D3508 Wiki Page: | -------------------------------------+------------------------------------- Changes (by bgamari): * differential: Phab:D3265 => Phab:D3508 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13629#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13629: sqrt should use machine instruction on x86_64 -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.4.1 Component: Compiler (NCG) | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #13570 | Differential Rev(s): Phab:D3508 Wiki Page: | -------------------------------------+------------------------------------- Changes (by bgamari): * related: => #13570 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13629#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13629: sqrt should use machine instruction on x86_64 -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.4.1 Component: Compiler (NCG) | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #13570 | Differential Rev(s): Phab:D3508 Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj):
The NCG currently implements the square root MachOp with an out-of-line call
Where is this choice made? E.g. I think the `sin` `MachOp` really generates a `sin` instruction. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13629#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13629: sqrt should use machine instruction on x86_64 -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.4.1 Component: Compiler (NCG) | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #13570 | Differential Rev(s): Phab:D3508 Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): It's made in the native code generator, `genCCall` in `compiler/nativeGen/X86/CodeGen.hs`. While we use the `fsin` instruction on i386, we don't on x86_64 (and i386 with `-msse2`). See Phab:D3508 for a quick attempt at fixing the `sqrt` situation. In the case of trigonometric functions I'm afraid we have little choice but to fall back on `libm` since SSE provides no trig instructions. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13629#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13629: sqrt should use machine instruction on x86_64 -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.4.1 Component: Compiler (NCG) | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #13570 | Differential Rev(s): Phab:D3508 Wiki Page: | -------------------------------------+------------------------------------- Description changed by kavon: @@ -4,0 +4,3 @@ + + See discussion [https://mail.haskell.org/pipermail/ghc- + devs/2017-April/014168.html here] as well. New description: The NCG currently implements the square root MachOp with an out-of-line call. This resulted in a large performance regression in `n-body` (see #13570). Fix this. See discussion [https://mail.haskell.org/pipermail/ghc- devs/2017-April/014168.html here] as well. -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13629#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13629: sqrt should use machine instruction on x86_64 -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.4.1 Component: Compiler (NCG) | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: Runtime | Test Case: performance bug | numeric/num009 Blocked By: | Blocking: Related Tickets: #13570 | Differential Rev(s): Phab:D3508 Wiki Page: | -------------------------------------+------------------------------------- Changes (by bgamari): * testcase: => numeric/num009 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13629#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13629: sqrt should use machine instruction on x86_64
-------------------------------------+-------------------------------------
Reporter: bgamari | Owner: (none)
Type: bug | Status: new
Priority: normal | Milestone: 8.4.1
Component: Compiler (NCG) | Version: 8.0.1
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
| Unknown/Multiple
Type of failure: Runtime | Test Case:
performance bug | numeric/num009
Blocked By: | Blocking:
Related Tickets: #13570 | Differential Rev(s): Phab:D3508
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by Ben Gamari

#13629: sqrt should use machine instruction on x86_64 -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: closed Priority: normal | Milestone: 8.4.1 Component: Compiler (NCG) | Version: 8.0.1 Resolution: fixed | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: Runtime | Test Case: performance bug | numeric/num009 Blocked By: | Blocking: Related Tickets: #13570 | Differential Rev(s): Phab:D3508 Wiki Page: | -------------------------------------+------------------------------------- Changes (by bgamari): * status: new => closed * resolution: => fixed -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13629#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13629: sqrt should use machine instruction on x86_64 -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: closed Priority: normal | Milestone: 8.4.1 Component: Compiler (NCG) | Version: 8.0.1 Resolution: fixed | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: Runtime | Test Case: performance bug | numeric/num009 Blocked By: | Blocking: Related Tickets: #13570 | Differential Rev(s): Phab:D3508 Wiki Page: | -------------------------------------+------------------------------------- Comment (by kavon): Replying to [comment:4 bgamari]:
It's made in the native code generator, `genCCall` in `compiler/nativeGen/X86/CodeGen.hs`. While we use the `fsin` instruction on i386, we don't on x86_64 (and i386 with `-msse2`).
If there is already x87 FPU instruction support in the NCG for x86-32, it might be profitable to reuse that support for x86-64 to speed up trig functions, etc. The simplest way I see it is to expand the foreign call into an instruction sequence that moves the float from XMM registers to the x87 stack, computes the value, and moves it back to XMM registers. This way we no longer have a C call in a potentially bad place. It's worth comparing x87 on modern processors against the assembly routine backing the C function first. It seems platforms like Skylake the x87 `fsin` takes 50-120 cycles [1], but I'm not sure about the library versions. If they're roughly equivalent, there's likely a benefit to eliding the C call. [1] http://www.agner.org/optimize/instruction_tables.pdf (Page 223 for Skylake x87) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13629#comment:9 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13629: sqrt should use machine instruction on x86_64 -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: closed Priority: normal | Milestone: 8.4.1 Component: Compiler (NCG) | Version: 8.0.1 Resolution: fixed | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: Runtime | Test Case: performance bug | numeric/num009 Blocked By: | Blocking: Related Tickets: #13570 | Differential Rev(s): Phab:D3508 Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): I think I would prefer to make the C call cheaper (by, for instance, marking it as pure so that the sinker can better optimize code around it) than to bring x87 into the fold. x87's `fsin` is terribly imprecise (and consequently causes the `num009` test to fail on platforms where it is used). Moreover, as you point out, x87 performance varies widely between microarchitectures. Finally, x87's architecture is really a nuisance for the native codegen, requiring a number of hacks. It would be nice if some day we could do away with x87 support entirely (although this likely won't happen any time soon). -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13629#comment:10 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13629: sqrt should use machine instruction on x86_64
-------------------------------------+-------------------------------------
Reporter: bgamari | Owner: (none)
Type: bug | Status: closed
Priority: normal | Milestone: 8.4.1
Component: Compiler (NCG) | Version: 8.0.1
Resolution: fixed | Keywords:
Operating System: Unknown/Multiple | Architecture:
| Unknown/Multiple
Type of failure: Runtime | Test Case:
performance bug | numeric/num009
Blocked By: | Blocking:
Related Tickets: #13570 | Differential Rev(s): Phab:D3508
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by bgamari):
For the record, I did a quick, rather unscientific, comparison of SSE2 and
x87,
{{{#!c++
#include

#13629: sqrt should use machine instruction on x86_64 -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: closed Priority: normal | Milestone: 8.4.1 Component: Compiler (NCG) | Version: 8.0.1 Resolution: fixed | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: Runtime | Test Case: performance bug | numeric/num009 Blocked By: | Blocking: Related Tickets: #13570 | Differential Rev(s): Phab:D3508 Wiki Page: | -------------------------------------+------------------------------------- Comment (by kavon): This is really interesting, thanks for looking into it! I'm not sure whether there are real programs that will benefit if we inline this library call by using x87 instructions. I'll leave that to someone who has a better intuition about whether there are graphics/mathematics programs that would appreciate the attention. We're at least no worse than C/C++ in this regard, though we probably pay a slightly higher call-site penalty due to a register mismatch between C and GHC conventions. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13629#comment:12 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13629: sqrt should use machine instruction on x86_64 -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.4.1 Component: Compiler (NCG) | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: Runtime | Test Case: performance bug | numeric/num009 Blocked By: | Blocking: Related Tickets: #13570 | Differential Rev(s): Phab:D3508 Wiki Page: | -------------------------------------+------------------------------------- Changes (by dfeuer): * status: closed => new * resolution: fixed => Comment: This made `n-body` ''slower'' and did not appreciably affect other benchmarks. It would be really good to figure out what's going on here. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13629#comment:13 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13629: sqrt should use machine instruction on x86_64 -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.4.1 Component: Compiler (NCG) | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: Runtime | Test Case: performance bug | numeric/num009 Blocked By: | Blocking: Related Tickets: #13570 | Differential Rev(s): Phab:D3508 Wiki Page: | -------------------------------------+------------------------------------- Comment (by dfeuer): https://perf.haskell.org/ghc/#revision/9ac22183e405773ea7147728e593edd78f30a... -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13629#comment:14 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13629: sqrt should use machine instruction on x86_64 -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.4.1 Component: Compiler (NCG) | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: Runtime | Test Case: performance bug | numeric/num009 Blocked By: | Blocking: Related Tickets: #13570 | Differential Rev(s): Phab:D3508 Wiki Page: | -------------------------------------+------------------------------------- Comment (by nomeata): If you have reasons to doubt the results, better try to reproduce them on your machine. I see performance cliffs of that size on perf.h.o that are sometimes triggered by very insignificant changes to the runtime system, for example. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13629#comment:15 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13629: sqrt should use machine instruction on x86_64 -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.4.1 Component: Compiler (NCG) | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: Runtime | Test Case: performance bug | numeric/num009 Blocked By: | Blocking: Related Tickets: #13570 | Differential Rev(s): Phab:D3508 Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): I've reproduced the regression locally. While it's hard to measure the runtime difference with the default test configuration, if you increase the argument from 3000 to 10000 it becomes quite clear. Before the patch the runtime is 1.14 seconds, after it's 1.30 seconds. This is quite surprising since there are a few bits of the generated code that get rather significantly shorter. Namely, {{{#!asm ... subq $8,%rsp movsd %xmm0,%xmm4 movss %xmm6,%xmm0 mulss %xmm6,%xmm0 mulss %xmm6,%xmm0 movq %rax,%rbx movl $1,%eax movq %rcx,%r14 movq %rdx,72(%rsp) movq %rsi,80(%rsp) movq %rdi,88(%rsp) movsd %xmm4,96(%rsp) movsd %xmm1,104(%rsp) movsd %xmm2,112(%rsp) movsd %xmm3,120(%rsp) call sqrtf addq $8,%rsp movss _n8uc(%rip),%xmm1 divss %xmm0,%xmm1 movss %xmm1,%xmm0 movsd 112(%rsp),%xmm2 mulss %xmm2,%xmm0 movss %xmm1,%xmm2 movsd 104(%rsp),%xmm3 mulss %xmm3,%xmm2 movsd 96(%rsp),%xmm3 mulss %xmm3,%xmm1 movq 64(%rsp),%rax leaq 1(%rax),%rcx }}} Whereas after we get, {{{#!asm ... movss %xmm6,%xmm4 mulss %xmm6,%xmm4 mulss %xmm6,%xmm4 sqrtss %xmm4,%xmm4 movss _n8ud(%rip),%xmm5 divss %xmm4,%xmm5 movss %xmm5,%xmm4 mulss %xmm3,%xmm4 movss %xmm5,%xmm3 mulss %xmm2,%xmm3 mulss %xmm1,%xmm5 leaq 1(%rdx),%rbx }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13629#comment:16 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13629: sqrt should use machine instruction on x86_64 -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.4.1 Component: Compiler (NCG) | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: Runtime | Test Case: performance bug | numeric/num009 Blocked By: | Blocking: Related Tickets: #13570 | Differential Rev(s): Phab:D3508 Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): So `perf stat` reveals a rather interesting picture. With the patch we retire few instructions (9,146,722,334 vs. 12,346,134,565), perform far fewer branches (908,661,442 compared to 1,308,583,802 without, the predictor missing around 0.01% in both cases), yet see a significantly higher runtime. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13629#comment:17 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13629: sqrt should use machine instruction on x86_64 -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.4.1 Component: Compiler (NCG) | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: Runtime | Test Case: performance bug | numeric/num009 Blocked By: | Blocking: Related Tickets: #13570 | Differential Rev(s): Phab:D3508 Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): Perhaps I'll just start a table of the PMU counters before and after, ||= Metric =||= Before patch =||= After patch =|| || instructions || 12,346,134,565 || 9,146,722,334 || || branches || 1,308,583,802 || 908,661,442 || || branch predict misses || || || || cache-misses || 83,983 || 57,490 || || L1-icache-load-misses || 138,869 || 156,777 || -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13629#comment:18 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13629: sqrt should use machine instruction on x86_64 -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.4.1 Component: Compiler (NCG) | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: Runtime | Test Case: performance bug | numeric/num009 Blocked By: | Blocking: Related Tickets: #13570 | Differential Rev(s): Phab:D3508 Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): For the record, `sqrtf` is the following on my machine, {{{#!asm 0x00007ffff78892c0 <+0>: pxor %xmm1,%xmm1 0x00007ffff78892c4 <+4>: ucomiss %xmm0,%xmm1 0x00007ffff78892c7 <+7>: ja 0x7ffff78892d0 <__sqrtf+16> 0x00007ffff78892c9 <+9>: sqrtss %xmm0,%xmm0 0x00007ffff78892cd <+13>: retq 0x00007ffff78892ce <+14>: xchg %ax,%ax 0x00007ffff78892d0 <+16>: mov 0x2cccf9(%rip),%rax # 0x7ffff7b55fd0 0x00007ffff78892d7 <+23>: cmpl $0xffffffff,(%rax) 0x00007ffff78892da <+26>: je 0x7ffff78892c9 <__sqrtf+9> 0x00007ffff78892dc <+28>: movaps %xmm0,%xmm1 0x00007ffff78892df <+31>: mov $0x7e,%edi 0x00007ffff78892e4 <+36>: jmpq 0x7ffff788ed50 <__kernel_standard_f> }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13629#comment:19 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13629: sqrt should use machine instruction on x86_64 -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.4.1 Component: Compiler (NCG) | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: Runtime | Test Case: performance bug | numeric/num009 Blocked By: | Blocking: Related Tickets: #13570 | Differential Rev(s): Phab:D3508 Wiki Page: | -------------------------------------+------------------------------------- Comment (by kavon): The verbose lowering of sqrt that was there before this patch must have stumbled upon some sort of lucky interaction in the processor (arithmetic vs memory unit ports) in that particular hot patch of code in n-body, which is not something that can be relied on in the grand scheme of things. What we ''can'' tell is that this patch produces better code, more closely matching what LLVM produces as well. bgamari has also kindly quantified how much better it is. Thus, I think this ~4% drop in n-body is not something to worry about, as it seems even non-functional changes can knock a few nofib runtime numbers by that amount [1]. Ticket #13570 is also a separate issue: I noticed the poor sqrt lowering when I was investigating that ticket. [1] https://perf.haskell.org/ghc/#revision/945c45ad50ed31e3acb96fdaafb21640c4669... -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13629#comment:20 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13629: sqrt should use machine instruction on x86_64 -------------------------------------+------------------------------------- Reporter: bgamari | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.4.1 Component: Compiler (NCG) | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: Runtime | Test Case: performance bug | numeric/num009 Blocked By: | Blocking: Related Tickets: #13570 | Differential Rev(s): Phab:D3508 Wiki Page: | -------------------------------------+------------------------------------- Changes (by michalt): * cc: michalt (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13629#comment:21 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC