[GHC] #9342: Branchless arithmetic operations

#9342: Branchless arithmetic operations -------------------------------------+------------------------------------- Reporter: hvr | Owner: Type: feature request | Status: new Priority: normal | Milestone: 7.10.1 Component: Compiler | Version: 7.8.3 (CodeGen) | Differential Revisions: Keywords: | Architecture: Operating System: Unknown/Multiple | Unknown/Multiple Type of failure: None/Unknown | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------- While working on #9281 I noticed sub-optimal code generation for common arithmetic operations on `Int`. Below are some examples of code generation, where the branchless versions may be desirable on modern CPUs: = `abs` {{{#!hs -- base version absI# :: Int# -> Int# absI# i# = i'# where !(I# i'#) = abs (I# i#) -- optimized version optAbsI# :: Int# -> Int# optAbsI# i# = (i# `xorI#` nsign) -# nsign where -- nsign = i# <# 0# nsign = uncheckedIShiftRA# i# (WORD_SIZE_IN_BITS# -# 1#) }}} results in {{{#!asm absI#_info: _c1To: testq %r14,%r14 jge _c1Tx _c1Ty: movq %r14,%rbx negq %rbx jmp *(%rbp) _c1Tx: movq %r14,%rbx jmp *(%rbp) optAbsI#_info: _c1SX: movq %r14,%rax sarq $63,%rax movq %r14,%rbx xorq %rax,%rbx subq %rax,%rbx jmp *(%rbp) }}} = `signum` {{{#!hs sgnI# :: Int# -> Int# sgnI# i# = i'# where !(I# i'#) = signum (I# i#) optSgnI# :: Int# -> Int# optSgnI# x# = (x# ># 0#) -# (x# <# 0#) }}} {{{#!asm sgnI#_info: _c27W: testq %r14,%r14 jl _c283 _c284: testq %r14,%r14 jne _c28a _c28b: xorl %ebx,%ebx jmp *(%rbp) _c283: movq $-1,%rbx jmp *(%rbp) _c28a: movl $1,%ebx jmp *(%rbp) optSgnI#_info: _c26P: testq %r14,%r14 setl %al movzbl %al,%eax testq %r14,%r14 setg %bl movzbl %bl,%ebx subq %rax,%rbx jmp *(%rbp) }}} = `compare` {{{#!hs cmpI# :: Int# -> Int# -> Int# cmpI# x# y# = dataToTag# (compare (I# x#) (I# y#)) -- returns 0, 1 or 2, can be fed to tagToEnum# :: Ordering optCmpI# :: Int# -> Int# -> Int# optCmpI# x# y# = 1# +# (x# ># y#) -# (x# <# y#) }}} {{{#!asm cmpI#_info: _c25s: cmpq %rsi,%r14 jl _c25z _c25A: cmpq %rsi,%r14 je _c25M _c25N: movl $2,%ebx jmp *(%rbp) _c25z: xorl %ebx,%ebx jmp *(%rbp) _c25M: movl $1,%ebx jmp *(%rbp) optCmpI#_info: _c24N: cmpq %rsi,%r14 setl %al movzbl %al,%eax movl $1,%ebx subq %rax,%rbx cmpq %rsi,%r14 setg %al movq %rbx,%rcx movzbl %al,%ebx addq %rcx,%rbx jmp *(%rbp) }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9342 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9342: Branchless arithmetic operations -------------------------------------+------------------------------------- Reporter: hvr | Owner: Type: feature | Status: new request | Milestone: 7.10.1 Priority: normal | Version: 7.8.3 Component: Compiler | Keywords: (CodeGen) | Operating System: Unknown/Multiple Resolution: | Type of failure: None/Unknown Differential Revisions: | Test Case: Architecture: | Blocking: Unknown/Multiple | Difficulty: Unknown | Blocked By: | Related Tickets: | -------------------------------------+------------------------------------- Comment (by schyler): +1, but please benchmark. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9342#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9342: Branchless arithmetic operations -------------------------------------+------------------------------------- Reporter: hvr | Owner: Type: feature | Status: new request | Milestone: 7.10.1 Priority: normal | Version: 7.8.3 Component: Compiler | Keywords: (CodeGen) | Architecture: Unknown/Multiple Resolution: | Difficulty: Unknown Operating System: | Blocked By: Unknown/Multiple | Related Tickets: Type of failure: | None/Unknown | Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Comment (by jstolarek): Same here: +1 but benchmark. Based on my experience from #6135 I don't expect to see much performance improvement. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9342#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9342: Branchless arithmetic operations -------------------------------------+------------------------------------- Reporter: hvr | Owner: Type: feature request | Status: new Priority: normal | Milestone: 8.0.1 Component: Compiler | Version: 7.8.3 (CodeGen) | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): -------------------------------------+------------------------------------- Changes (by strake888): * Attachment "bench.hs" added. benchmark code -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9342 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9342: Branchless arithmetic operations -------------------------------------+------------------------------------- Reporter: hvr | Owner: Type: feature request | Status: new Priority: normal | Milestone: 8.0.1 Component: Compiler | Version: 7.8.3 (CodeGen) | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): -------------------------------------+------------------------------------- Comment (by strake888): On my AMD FX-8320, branchless `abs` and `sgn` seem speedier but branchless `cmp` seems slower: {{{ $ ./bench 'abs ' benchmarking abs time 37.06 ms (36.60 ms .. 37.56 ms) 1.000 R² (0.999 R² .. 1.000 R²) mean 37.99 ms (37.68 ms .. 38.37 ms) std dev 676.9 μs (490.1 μs .. 926.6 μs) $ ./bench 'abs''' benchmarking abs' time 34.41 ms (33.97 ms .. 34.77 ms) 1.000 R² (0.999 R² .. 1.000 R²) mean 34.08 ms (33.95 ms .. 34.25 ms) std dev 307.3 μs (243.0 μs .. 418.2 μs) $ ./bench 'sgn ' benchmarking sgn time 36.70 ms (36.33 ms .. 37.10 ms) 1.000 R² (0.999 R² .. 1.000 R²) mean 36.93 ms (36.79 ms .. 37.03 ms) std dev 216.7 μs (127.8 μs .. 343.1 μs) $ ./bench 'sgn''' benchmarking sgn' time 34.60 ms (34.24 ms .. 34.92 ms) 1.000 R² (0.999 R² .. 1.000 R²) mean 34.18 ms (34.02 ms .. 34.36 ms) std dev 348.1 μs (263.4 μs .. 461.5 μs) $ ./bench 'cmp ' benchmarking cmp time 35.80 ms (35.04 ms .. 36.80 ms) 0.998 R² (0.996 R² .. 0.999 R²) mean 36.79 ms (36.32 ms .. 37.15 ms) std dev 875.3 μs (665.6 μs .. 1.178 ms) $ ./bench 'cmp''' benchmarking cmp' time 36.41 ms (35.72 ms .. 37.08 ms) 0.999 R² (0.998 R² .. 1.000 R²) mean 36.44 ms (35.96 ms .. 36.74 ms) std dev 726.9 μs (513.8 μs .. 1.044 ms) $ }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9342#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9342: Branchless arithmetic operations -------------------------------------+------------------------------------- Reporter: hvr | Owner: Type: feature request | Status: new Priority: normal | Milestone: 8.0.1 Component: Compiler | Version: 7.8.3 (CodeGen) | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): -------------------------------------+------------------------------------- Comment (by svenpanne): I'd like to point out my previous email on a similar topic: https://mail.haskell.org/pipermail/ghc-devs/2015-April/008858.html In a nutshell: It's far from clear that code with branches is "sub- optimal", toy benchmarks are totally meaningless (they don't stress e.g. the very limited number of shifting units in CPUs), and benchmarks on a single (micro-)architecture are even less meaningful (what's good for one HW makes things worse on another HW). Applying rules of thumb from previous decades doesn't really work anymore today, this has already been learned the hard way in several other compilers/JITs, I just wanted to warn about this again... -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9342#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9342: Branchless arithmetic operations -------------------------------------+------------------------------------- Reporter: hvr | Owner: Type: feature request | Status: new Priority: normal | Milestone: 8.0.1 Component: Compiler | Version: 7.8.3 (CodeGen) | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): -------------------------------------+------------------------------------- Comment (by tibbe): As a data point the hashtables package makes effective use of branchless code in practice: https://github.com/gregorycollins/hashtables/search?utf8=%E2%9C%93&q=branchless -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9342#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9342: Branchless arithmetic operations -------------------------------------+------------------------------------- Reporter: hvr | Owner: Type: feature request | Status: new Priority: normal | Milestone: 8.0.1 Component: Compiler | Version: 7.8.3 (CodeGen) | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): -------------------------------------+------------------------------------- Comment (by svenpanne): Re #7: Do we have real-world benchmarks for Atom/i3/i7/Xeon in ia32/x64 modes? ARM? ARM64? With and without branches? Without that, it's unclear if that's "effective use". My point (again) is: Being branchless in itself is a non-goal. Picking e.g. just https://github.com/gregorycollins/hashtables/blob/9e477b825a98e4f574a0e889e5... : This is a perfect example which will probably make things *slower*, depending on the availability of spare registers. It uses a handful of intermediate values, while the branching code uses none, and a single spill caused by the higher register pressure will be much, much more costly than anything else, especially ia32 sucks here. Furthermore, if surrounding code uses shifts and/or complicated addressing modes a lot, one will have to wait for the shifting unit to become available. This is all not theoretical, we had to revert to the straightforward code with branches in Chrome's V8 JavaScript JIT on some platforms/CPUs in similar places. Perhaps the code patterns in GHC-generated code are different, but the only way to know is to do benchmarking on wide variety of benchmarks and CPUs. Yes, that's a lot of work and needs some infrastructure, but without that, changes like this are just a shot in the dark. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9342#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9342: Branchless arithmetic operations -------------------------------------+------------------------------------- Reporter: hvr | Owner: Type: feature request | Status: new Priority: normal | Milestone: 8.0.1 Component: Compiler | Version: 7.8.3 (CodeGen) | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): -------------------------------------+------------------------------------- Changes (by tibbe): * cc: greg@… (added) Comment: I'm sure Greg measured. I've CCed him. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9342#comment:9 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9342: Branchless arithmetic operations -------------------------------------+------------------------------------- Reporter: hvr | Owner: Type: feature request | Status: new Priority: normal | Milestone: 8.0.1 Component: Compiler | Version: 7.8.3 (CodeGen) | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): -------------------------------------+------------------------------------- Comment (by gregorycollins): I measured, yes, but not across processors: when I was working on this I optimized for 64-bit i7 (probably Sandy Bridge IIRC). The version of mask you linked to with the funky branchless code was definitively faster on that chip vs. the simpler alternative: {{{ mask# :: Int# -> Int# -> Int# mask# !a# !b# = let !(I# z#) = fromEnum (a# ==# b#) !q# = negateInt# z# in q# }}} (of course this is from when `==#` returned `Bool` rather than `Int#`). The difference was about 15-20% IIRC. Unfortunately I've lost the raw numbers, sorry, but as Sven points out they'd be useless anyways towards determining how good the change is in aggregate. Quite willing to believe that code could be a pessimization on ia32. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9342#comment:10 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC