[GHC] #12798: LLVM seeming to over optimize, producing inefficient assembly code...

#12798: LLVM seeming to over optimize, producing inefficient assembly code... -------------------------------------+------------------------------------- Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler | Version: 8.0.1 (LLVM) | Keywords: | Operating System: Unknown/Multiple Architecture: | Type of failure: Runtime Unknown/Multiple | performance bug Test Case: | Blocked By: Blocking: | Related Tickets: Differential Rev(s): | Wiki Page: -------------------------------------+------------------------------------- Since in many cases, the use of the LLVM backend is the only way to avoid the NCG's poor register allocation (ticket #8971), this is a concern that using "-fllvm" is producing overly complex code through a (seeming) failed attempt to optimize. The following code uses a very simple "odds-only" implementation of the Sieve of Eratosthenes with a very tight inner culling loop limited to using a 16 Kilobyte buffer (<= the size of most modern CPU L1 data cache size) to reproduce the problem; it uses a "twos" Look Up Table (LUT) for better speed than using a variable shift left operation for setting the composite bits in the buffer array as it (should) take the same number of registers and the array look-up instruction is easier for the CPU to fuse than a variable shift left: {{{#!hs -- GHC_EfficiencyBug.hs {-# LANGUAGE FlexibleContexts #-} {-# OPTIONS_GHC -O3 -fllvm -rtsopts -keep-s-files #-} -- or -O2 import Data.Word import Data.Bits import Data.Array.ST (runSTUArray) import Data.Array.Base numLOOPS = 10000 :: Int twos :: UArray Int Word32 twos = listArray (0, 31) [1 `shiftL` i | i <- [0 .. 31]] soe :: () -> [Word32] soe() = 2 : [fromIntegral i * 2 + 3 | (i, False) <- assocs bufb] where bufb = runSTUArray $ do let bfLmt = (256 * 1024) `div` 2 - 1 -- to 2^18 + 2 is 128 KBits - 1 = 16 KBytes cmpstsb <- newArray (0, bfLmt) False :: ST s (STUArray s Int Bool) cmpstsw <- (castSTUArray :: STUArray s Int Bool -> ST s (STUArray s Int Word32)) cmpstsb let loop n = -- cull a number of times to test timing if n <= 0 then return cmpstsb else let cullp i = let p = i + i + 3 in let s = (p * p - 3) `div` 2 in if s > bfLmt then loop (n - 1) else do isCmpst <- unsafeRead cmpstsb i if isCmpst then cullp (i + 1) else -- is Prime let cull j = -- very tight inner loop where all the time is spent if j > bfLmt then cullp (i + 1) else do let sh = unsafeAt twos (j .&. 31) -- (1 `shiftL` (j .&. 31))) let w = j `shiftR` 5 ov <- unsafeRead cmpstsw w unsafeWrite cmpstsw w (ov .|. sh) cull (j + p) in cull s in cullp 0 loop numLOOPS main = print $ length $ soe() }}} The main culling is repeated "numLOOPS" times to get a reasonable execution time for accurate timing and to make the time required to use the List comprehension to determine the number of found primes (the answer) a negligible part of the execution time. Timing results can be produced by running "./GHC_EfficiencyBug +RTS -s". The desired assembly code result for the tight inner loop is as for the Rust/LLVM compiler, in this case for x86_64 64-bit code: {{{ .p2align 4, 0x90 .LBB10_27: movq %rcx, %rdx shrq $5, %rdx movl %ecx, %esi andl $31, %esi movl (%rbp,%rsi,4), %esi orl %esi, (%r14,%rdx,4) addq %rax, %rcx .LBB10_26: cmpq %r13, %rcx jb .LBB10_27 }}} The above code is extremely efficient on a CPU that is not cache bottle necked (such as the AMD Bulldozer series are) and takes just about three clock cycles per inner composite culling loop on Intel Sky Lake; it is just as efficient for x86 code since there are only seven registers used in this inner loop. Due to this attempt at "over-optimization", the GHC/LLVM backend produces the following x86_64 64-bit code: {{{ .align 16, 0x90 .LBB34_2: # %c8T2 # =>This Inner Loop Header: Depth=1 movq %rcx, %rsi sarq $5, %rsi movl %r8d, %edi andl $124, %edi movl 16(%rax,%rdi), %edi orl %edi, 16(%r11,%rsi,4) addq %r14, %rcx addq %rdx, %r8 cmpq %r10, %rcx jle .LBB34_2 }}} As can be seen, instead of just masking the "twos" index register by 31 (0x1F), the code is using two extra separate registers to contain "(j * 4)" increment and the accumulated index, which increment is added to the "twos" index register per loop and masked by 124 (0x7C or 0x1F times 4), requiring an extra two registers and an extra instruction for the extra addition. This isn't a problem as to the number of registers for x86_64 code which has more than enough, but it adds the extra instruction execution time of one third of a CPU clock cycle (I know, only one ninth extra time). However, for 32-bit x86 code with barely enough registers previously, the use of the extra registers triggers a chain of three register reloads as can be seen in the following assembly code: {{{ .align 16, 0x90 LBB33_2: # %c8Wb # =>This Inner Loop Header: Depth=1 movl %ebx, %ebp sarl $5, %ebp movl %edi, %ecx andl $124, %ecx movl %esi, %edx movl %eax, %esi movl 36(%esp), %eax # 4-byte Reload movl 8(%eax,%ecx), %ecx movl %esi, %eax movl %edx, %esi orl %ecx, 8(%esi,%ebp,4) addl 32(%esp), %ebx # 4-byte Folded Reload addl 28(%esp), %edi # 4-byte Folded Reload cmpl %eax, %ebx jle LBB33_2 }}} '''The above code runs about 25% slower than it should on Intel Sky Lake for this 32-bit code.''' This was tested for GHC version 8.0.1 under both Windows and Linux for both 32-bit and 64-bit code with identical results for each native code width. The code was also tested for 32 and 64 bit code produced by the NCG; for this specific problem, NCG takes the simple approach and does not waste the extra register. However, due to the inefficient allocation of registers as per ticket #8971, not moving the loop completion check to the end of the loop and thus requiring an extra jump instruction, and not combining the read/modify/write into a single instruction, it is still slower (much slower for 32-bit code) than the LLVM produced code. As its problems are known, I have not documented the NCG code. Conclusion: This may seem like a nit picky type of bug as in some use cases the execution time cost is very small, but it may be an indication of problems in other use cases that cause more serious effects on execution speed. It is my feeling that for such low level somewhat imperative types of code, GHC should really produce code that is as fast as C/C++/Rust. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/12798 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#12798: LLVM seeming to over optimize, producing inefficient assembly code... -------------------------------------+------------------------------------- Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler (LLVM) | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by AlexET): On current HEAD with llvm 3.9.0, the follwing code {{{ import Data.Word import Data.Bits import Data.Array.ST (runSTUArray) import Data.Array.Base import Control.Monad.ST numLOOPS = 10000 :: Int twos :: UArray Int Word32 twos = listArray (0, 31) [1 `shiftL` i | i <- [0 .. 31]] soe :: () -> [Word32] soe() = 2 : [fromIntegral i * 2 + 3 | (i, False) <- assocs bufb] where bufb = runSTUArray $ do let bfLmt = (256 * 1024) `div` 2 - 1 -- to 2^18 + 2 is 128 KBits - 1 = 16 KBytes cmpstsb <- newArray (0, bfLmt) False :: ST s (STUArray s Int Bool) cmpstsw <- (castSTUArray :: STUArray s Int Bool -> ST s (STUArray s Int Word32)) cmpstsb return $! twos -- force evaluation of twos outside the loop. let loop n = -- cull a number of times to test timing if n <= 0 then return cmpstsb else let cullp i = let p = i + i + 3 in let s = (p * p - 3) `div` 2 in if s > bfLmt then loop (n - 1) else do isCmpst <- unsafeRead cmpstsb i if isCmpst then cullp (i + 1) else -- is Prime let cull j = -- very tight inner loop where all the time is spent if j > bfLmt then cullp (i + 1) else do let sh = unsafeAt twos (j .&. 31) -- (1 `shiftL` (j .&. 31))) let w = j `shiftR` 5 ov <- unsafeRead cmpstsw w unsafeWrite cmpstsw w (ov .|. sh) cull (j + p) in cull s in cullp 0 loop numLOOPS main = print $ length $ soe() }}} Gives the inner loop, which is almost the same as rust. {{{ .LBB28_7: movq %rcx, %rdx sarq $5, %rdx movl %ecx, %edi andl $31, %edi movl 16(%r9,%rdi,4), %edi orl %edi, 16(%rax,%rdx,4) addq %rsi, %rcx .LBB28_5: cmpq $131071, %rcx jle .LBB28_7 }}} The CMM and initial llvm code is the same as your code for 8.0.1, so it seems the difference is due to the fact that rust ships with its own more recent llvm than ghc 8.0.1 supports. The difference in the code between my version and your original version is that we force `twos` early which is needed to prevent the evaluation of that within the loop, an optimisation which seems to have been missed by HEAD. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/12798#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#12798: LLVM seeming to over optimize, producing inefficient assembly code... -------------------------------------+------------------------------------- Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler (LLVM) | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by GordonBGood): So you are saying we need to both force the evaluation of "twos" '''and''' run a newer version of LLVM (which we can't do with GHC version 8.0.1) in order to get the desired output but that the forced evaluation of "twos" would not be necessary if HEAD did not have a regression as to this optimization? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/12798#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

On current HEAD with llvm 3.9.0, the following code:
... code skipped for brevity...
The CMM and initial llvm code is the same as your code for 8.0.1, so it seems the difference is due to the fact that rust ships with its own more recent llvm than ghc 8.0.1 supports.
The difference in the code between my version and your original version is that we force `twos` early which is needed to prevent the evaluation of
#12798: LLVM seeming to over optimize, producing inefficient assembly code... -------------------------------------+------------------------------------- Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler (LLVM) | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by GordonBGood): @AlexET, Replying to [comment:1 AlexET]: that within the loop, an optimisation which seems to have been missed by HEAD. It's even worse with GHC version 8.0.1 than I thought:, with the very minor change of the inner loop to as follows: {{{ let cull j = -- very tight inner loop where all the time is spent if j > bfLmt then return () else do let sh = unsafeAt twos (j .&. 31) let w = j `shiftR` 5 ov <- unsafeRead cmpstsw w unsafeWrite cmpstsw w (ov .|. sh) -- (1 `shiftL` (j .&. 31))) cull (j + p) in do { cull s; cullp (i + 1) } in cullp 0 }}} only changed so the loop back for the next prime value is outside the `cull` loop, the assembly code is as follows: {{{ .align 16, 0x90 .LBB33_3: # %caLP.i.caLP.i_crit_edge # in Loop: Header=BB33_2 Depth=1 movq (%r12), %rdx .LBB33_2: # %caLP.i # =>This Inner Loop Header: Depth=1 movq %rsi, %rcx movl %ecx, %edi movq %rdx, %rsi addq %rcx, %rsi sarq $5, %rcx movq -24(%r12), %rdx movq -16(%r12), %rbx andl $31, %edi movl 16(%rbx,%rdi,4), %edi orl %edi, 16(%rdx,%rcx,4) cmpq -8(%r12), %rsi jle .LBB33_3 }}} with three register reloads in the loop and another version in the code also reloading the `p` increment value for a total of four reloads. I can't see that the code generated should be any different then for the original ticket code. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/12798#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#12798: LLVM seeming to over optimize, producing inefficient assembly code... -------------------------------------+------------------------------------- Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler (LLVM) | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #12808 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by bgamari): * related: => #12808 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/12798#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#12798: LLVM seeming to over optimize, producing inefficient assembly code... -------------------------------------+------------------------------------- Reporter: GordonBGood | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler (LLVM) | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #12808 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Description changed by bgamari: Old description:
Since in many cases, the use of the LLVM backend is the only way to avoid the NCG's poor register allocation (ticket #8971), this is a concern that using "-fllvm" is producing overly complex code through a (seeming) failed attempt to optimize.
The following code uses a very simple "odds-only" implementation of the Sieve of Eratosthenes with a very tight inner culling loop limited to using a 16 Kilobyte buffer (<= the size of most modern CPU L1 data cache size) to reproduce the problem; it uses a "twos" Look Up Table (LUT) for better speed than using a variable shift left operation for setting the composite bits in the buffer array as it (should) take the same number of registers and the array look-up instruction is easier for the CPU to fuse than a variable shift left:
{{{#!hs -- GHC_EfficiencyBug.hs {-# LANGUAGE FlexibleContexts #-} {-# OPTIONS_GHC -O3 -fllvm -rtsopts -keep-s-files #-} -- or -O2
import Data.Word import Data.Bits import Data.Array.ST (runSTUArray) import Data.Array.Base
numLOOPS = 10000 :: Int
twos :: UArray Int Word32 twos = listArray (0, 31) [1 `shiftL` i | i <- [0 .. 31]]
soe :: () -> [Word32] soe() = 2 : [fromIntegral i * 2 + 3 | (i, False) <- assocs bufb] where bufb = runSTUArray $ do let bfLmt = (256 * 1024) `div` 2 - 1 -- to 2^18 + 2 is 128 KBits - 1 = 16 KBytes cmpstsb <- newArray (0, bfLmt) False :: ST s (STUArray s Int Bool) cmpstsw <- (castSTUArray :: STUArray s Int Bool -> ST s (STUArray s Int Word32)) cmpstsb let loop n = -- cull a number of times to test timing if n <= 0 then return cmpstsb else let cullp i = let p = i + i + 3 in let s = (p * p - 3) `div` 2 in if s > bfLmt then loop (n - 1) else do isCmpst <- unsafeRead cmpstsb i if isCmpst then cullp (i + 1) else -- is Prime let cull j = -- very tight inner loop where all the time is spent if j > bfLmt then cullp (i + 1) else do let sh = unsafeAt twos (j .&. 31) -- (1 `shiftL` (j .&. 31))) let w = j `shiftR` 5 ov <- unsafeRead cmpstsw w unsafeWrite cmpstsw w (ov .|. sh) cull (j + p) in cull s in cullp 0 loop numLOOPS
main = print $ length $ soe() }}} The main culling is repeated "numLOOPS" times to get a reasonable execution time for accurate timing and to make the time required to use the List comprehension to determine the number of found primes (the answer) a negligible part of the execution time. Timing results can be produced by running "./GHC_EfficiencyBug +RTS -s".
The desired assembly code result for the tight inner loop is as for the Rust/LLVM compiler, in this case for x86_64 64-bit code: {{{ .p2align 4, 0x90 .LBB10_27: movq %rcx, %rdx shrq $5, %rdx movl %ecx, %esi andl $31, %esi movl (%rbp,%rsi,4), %esi orl %esi, (%r14,%rdx,4) addq %rax, %rcx .LBB10_26: cmpq %r13, %rcx jb .LBB10_27 }}} The above code is extremely efficient on a CPU that is not cache bottle necked (such as the AMD Bulldozer series are) and takes just about three clock cycles per inner composite culling loop on Intel Sky Lake; it is just as efficient for x86 code since there are only seven registers used in this inner loop.
Due to this attempt at "over-optimization", the GHC/LLVM backend produces the following x86_64 64-bit code: {{{ .align 16, 0x90 .LBB34_2: # %c8T2 # =>This Inner Loop Header: Depth=1 movq %rcx, %rsi sarq $5, %rsi movl %r8d, %edi andl $124, %edi movl 16(%rax,%rdi), %edi orl %edi, 16(%r11,%rsi,4) addq %r14, %rcx addq %rdx, %r8 cmpq %r10, %rcx jle .LBB34_2 }}} As can be seen, instead of just masking the "twos" index register by 31 (0x1F), the code is using two extra separate registers to contain "(j * 4)" increment and the accumulated index, which increment is added to the "twos" index register per loop and masked by 124 (0x7C or 0x1F times 4), requiring an extra two registers and an extra instruction for the extra addition. This isn't a problem as to the number of registers for x86_64 code which has more than enough, but it adds the extra instruction execution time of one third of a CPU clock cycle (I know, only one ninth extra time).
However, for 32-bit x86 code with barely enough registers previously, the use of the extra registers triggers a chain of three register reloads as can be seen in the following assembly code: {{{ .align 16, 0x90 LBB33_2: # %c8Wb # =>This Inner Loop Header: Depth=1 movl %ebx, %ebp sarl $5, %ebp movl %edi, %ecx andl $124, %ecx movl %esi, %edx movl %eax, %esi movl 36(%esp), %eax # 4-byte Reload movl 8(%eax,%ecx), %ecx movl %esi, %eax movl %edx, %esi orl %ecx, 8(%esi,%ebp,4) addl 32(%esp), %ebx # 4-byte Folded Reload addl 28(%esp), %edi # 4-byte Folded Reload cmpl %eax, %ebx jle LBB33_2 }}} '''The above code runs about 25% slower than it should on Intel Sky Lake for this 32-bit code.'''
This was tested for GHC version 8.0.1 under both Windows and Linux for both 32-bit and 64-bit code with identical results for each native code width.
The code was also tested for 32 and 64 bit code produced by the NCG; for this specific problem, NCG takes the simple approach and does not waste the extra register. However, due to the inefficient allocation of registers as per ticket #8971, not moving the loop completion check to the end of the loop and thus requiring an extra jump instruction, and not combining the read/modify/write into a single instruction, it is still slower (much slower for 32-bit code) than the LLVM produced code. As its problems are known, I have not documented the NCG code.
Conclusion: This may seem like a nit picky type of bug as in some use cases the execution time cost is very small, but it may be an indication of problems in other use cases that cause more serious effects on execution speed. It is my feeling that for such low level somewhat imperative types of code, GHC should really produce code that is as fast as C/C++/Rust.
New description: Since in many cases, the use of the LLVM backend is the only way to avoid the NCG's poor register allocation (ticket #8971), this is a concern that using "-fllvm" is producing overly complex code through a (seeming) failed attempt to optimize. The following code uses a very simple "odds-only" implementation of the Sieve of Eratosthenes with a very tight inner culling loop limited to using a 16 Kilobyte buffer (<= the size of most modern CPU L1 data cache size) to reproduce the problem; it uses a "twos" Look Up Table (LUT) for better speed than using a variable shift left operation for setting the composite bits in the buffer array as it (should) take the same number of registers and the array look-up instruction is easier for the CPU to fuse than a variable shift left: {{{#!hs -- GHC_EfficiencyBug.hs {-# LANGUAGE FlexibleContexts #-} {-# OPTIONS_GHC -O3 -fllvm -rtsopts -keep-s-files #-} -- or -O2 import Control.Monad.ST import Data.Word import Data.Bits import Data.Array.ST (runSTUArray) import Data.Array.Base numLOOPS = 10000 :: Int twos :: UArray Int Word32 twos = listArray (0, 31) [1 `shiftL` i | i <- [0 .. 31]] soe :: () -> [Word32] soe() = 2 : [fromIntegral i * 2 + 3 | (i, False) <- assocs bufb] where bufb = runSTUArray $ do let bfLmt = (256 * 1024) `div` 2 - 1 -- to 2^18 + 2 is 128 KBits - 1 = 16 KBytes cmpstsb <- newArray (0, bfLmt) False :: ST s (STUArray s Int Bool) cmpstsw <- (castSTUArray :: STUArray s Int Bool -> ST s (STUArray s Int Word32)) cmpstsb let loop n = -- cull a number of times to test timing if n <= 0 then return cmpstsb else let cullp i = let p = i + i + 3 in let s = (p * p - 3) `div` 2 in if s > bfLmt then loop (n - 1) else do isCmpst <- unsafeRead cmpstsb i if isCmpst then cullp (i + 1) else -- is Prime let cull j = -- very tight inner loop where all the time is spent if j > bfLmt then cullp (i + 1) else do let sh = unsafeAt twos (j .&. 31) -- (1 `shiftL` (j .&. 31))) let w = j `shiftR` 5 ov <- unsafeRead cmpstsw w unsafeWrite cmpstsw w (ov .|. sh) cull (j + p) in cull s in cullp 0 loop numLOOPS main = print $ length $ soe() }}} The main culling is repeated "numLOOPS" times to get a reasonable execution time for accurate timing and to make the time required to use the List comprehension to determine the number of found primes (the answer) a negligible part of the execution time. Timing results can be produced by running "./GHC_EfficiencyBug +RTS -s". The desired assembly code result for the tight inner loop is as for the Rust/LLVM compiler, in this case for x86_64 64-bit code: {{{ .p2align 4, 0x90 .LBB10_27: movq %rcx, %rdx shrq $5, %rdx movl %ecx, %esi andl $31, %esi movl (%rbp,%rsi,4), %esi orl %esi, (%r14,%rdx,4) addq %rax, %rcx .LBB10_26: cmpq %r13, %rcx jb .LBB10_27 }}} The above code is extremely efficient on a CPU that is not cache bottle necked (such as the AMD Bulldozer series are) and takes just about three clock cycles per inner composite culling loop on Intel Sky Lake; it is just as efficient for x86 code since there are only seven registers used in this inner loop. Due to this attempt at "over-optimization", the GHC/LLVM backend produces the following x86_64 64-bit code: {{{ .align 16, 0x90 .LBB34_2: # %c8T2 # =>This Inner Loop Header: Depth=1 movq %rcx, %rsi sarq $5, %rsi movl %r8d, %edi andl $124, %edi movl 16(%rax,%rdi), %edi orl %edi, 16(%r11,%rsi,4) addq %r14, %rcx addq %rdx, %r8 cmpq %r10, %rcx jle .LBB34_2 }}} As can be seen, instead of just masking the "twos" index register by 31 (0x1F), the code is using two extra separate registers to contain "(j * 4)" increment and the accumulated index, which increment is added to the "twos" index register per loop and masked by 124 (0x7C or 0x1F times 4), requiring an extra two registers and an extra instruction for the extra addition. This isn't a problem as to the number of registers for x86_64 code which has more than enough, but it adds the extra instruction execution time of one third of a CPU clock cycle (I know, only one ninth extra time). However, for 32-bit x86 code with barely enough registers previously, the use of the extra registers triggers a chain of three register reloads as can be seen in the following assembly code: {{{ .align 16, 0x90 LBB33_2: # %c8Wb # =>This Inner Loop Header: Depth=1 movl %ebx, %ebp sarl $5, %ebp movl %edi, %ecx andl $124, %ecx movl %esi, %edx movl %eax, %esi movl 36(%esp), %eax # 4-byte Reload movl 8(%eax,%ecx), %ecx movl %esi, %eax movl %edx, %esi orl %ecx, 8(%esi,%ebp,4) addl 32(%esp), %ebx # 4-byte Folded Reload addl 28(%esp), %edi # 4-byte Folded Reload cmpl %eax, %ebx jle LBB33_2 }}} '''The above code runs about 25% slower than it should on Intel Sky Lake for this 32-bit code.''' This was tested for GHC version 8.0.1 under both Windows and Linux for both 32-bit and 64-bit code with identical results for each native code width. The code was also tested for 32 and 64 bit code produced by the NCG; for this specific problem, NCG takes the simple approach and does not waste the extra register. However, due to the inefficient allocation of registers as per ticket #8971, not moving the loop completion check to the end of the loop and thus requiring an extra jump instruction, and not combining the read/modify/write into a single instruction, it is still slower (much slower for 32-bit code) than the LLVM produced code. As its problems are known, I have not documented the NCG code. Conclusion: This may seem like a nit picky type of bug as in some use cases the execution time cost is very small, but it may be an indication of problems in other use cases that cause more serious effects on execution speed. It is my feeling that for such low level somewhat imperative types of code, GHC should really produce code that is as fast as C/C++/Rust. -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/12798#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#12798: LLVM seeming to over optimize, producing inefficient assembly code... -------------------------------------+------------------------------------- Reporter: GordonBGood | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler (LLVM) | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #12808 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Description changed by bgamari: Old description:
Since in many cases, the use of the LLVM backend is the only way to avoid the NCG's poor register allocation (ticket #8971), this is a concern that using "-fllvm" is producing overly complex code through a (seeming) failed attempt to optimize.
The following code uses a very simple "odds-only" implementation of the Sieve of Eratosthenes with a very tight inner culling loop limited to using a 16 Kilobyte buffer (<= the size of most modern CPU L1 data cache size) to reproduce the problem; it uses a "twos" Look Up Table (LUT) for better speed than using a variable shift left operation for setting the composite bits in the buffer array as it (should) take the same number of registers and the array look-up instruction is easier for the CPU to fuse than a variable shift left:
{{{#!hs -- GHC_EfficiencyBug.hs {-# LANGUAGE FlexibleContexts #-} {-# OPTIONS_GHC -O3 -fllvm -rtsopts -keep-s-files #-} -- or -O2
import Control.Monad.ST import Data.Word import Data.Bits import Data.Array.ST (runSTUArray) import Data.Array.Base
numLOOPS = 10000 :: Int
twos :: UArray Int Word32 twos = listArray (0, 31) [1 `shiftL` i | i <- [0 .. 31]]
soe :: () -> [Word32] soe() = 2 : [fromIntegral i * 2 + 3 | (i, False) <- assocs bufb] where bufb = runSTUArray $ do let bfLmt = (256 * 1024) `div` 2 - 1 -- to 2^18 + 2 is 128 KBits - 1 = 16 KBytes cmpstsb <- newArray (0, bfLmt) False :: ST s (STUArray s Int Bool) cmpstsw <- (castSTUArray :: STUArray s Int Bool -> ST s (STUArray s Int Word32)) cmpstsb let loop n = -- cull a number of times to test timing if n <= 0 then return cmpstsb else let cullp i = let p = i + i + 3 in let s = (p * p - 3) `div` 2 in if s > bfLmt then loop (n - 1) else do isCmpst <- unsafeRead cmpstsb i if isCmpst then cullp (i + 1) else -- is Prime let cull j = -- very tight inner loop where all the time is spent if j > bfLmt then cullp (i + 1) else do let sh = unsafeAt twos (j .&. 31) -- (1 `shiftL` (j .&. 31))) let w = j `shiftR` 5 ov <- unsafeRead cmpstsw w unsafeWrite cmpstsw w (ov .|. sh) cull (j + p) in cull s in cullp 0 loop numLOOPS
main = print $ length $ soe() }}} The main culling is repeated "numLOOPS" times to get a reasonable execution time for accurate timing and to make the time required to use the List comprehension to determine the number of found primes (the answer) a negligible part of the execution time. Timing results can be produced by running "./GHC_EfficiencyBug +RTS -s".
The desired assembly code result for the tight inner loop is as for the Rust/LLVM compiler, in this case for x86_64 64-bit code: {{{ .p2align 4, 0x90 .LBB10_27: movq %rcx, %rdx shrq $5, %rdx movl %ecx, %esi andl $31, %esi movl (%rbp,%rsi,4), %esi orl %esi, (%r14,%rdx,4) addq %rax, %rcx .LBB10_26: cmpq %r13, %rcx jb .LBB10_27 }}} The above code is extremely efficient on a CPU that is not cache bottle necked (such as the AMD Bulldozer series are) and takes just about three clock cycles per inner composite culling loop on Intel Sky Lake; it is just as efficient for x86 code since there are only seven registers used in this inner loop.
Due to this attempt at "over-optimization", the GHC/LLVM backend produces the following x86_64 64-bit code: {{{ .align 16, 0x90 .LBB34_2: # %c8T2 # =>This Inner Loop Header: Depth=1 movq %rcx, %rsi sarq $5, %rsi movl %r8d, %edi andl $124, %edi movl 16(%rax,%rdi), %edi orl %edi, 16(%r11,%rsi,4) addq %r14, %rcx addq %rdx, %r8 cmpq %r10, %rcx jle .LBB34_2 }}} As can be seen, instead of just masking the "twos" index register by 31 (0x1F), the code is using two extra separate registers to contain "(j * 4)" increment and the accumulated index, which increment is added to the "twos" index register per loop and masked by 124 (0x7C or 0x1F times 4), requiring an extra two registers and an extra instruction for the extra addition. This isn't a problem as to the number of registers for x86_64 code which has more than enough, but it adds the extra instruction execution time of one third of a CPU clock cycle (I know, only one ninth extra time).
However, for 32-bit x86 code with barely enough registers previously, the use of the extra registers triggers a chain of three register reloads as can be seen in the following assembly code: {{{ .align 16, 0x90 LBB33_2: # %c8Wb # =>This Inner Loop Header: Depth=1 movl %ebx, %ebp sarl $5, %ebp movl %edi, %ecx andl $124, %ecx movl %esi, %edx movl %eax, %esi movl 36(%esp), %eax # 4-byte Reload movl 8(%eax,%ecx), %ecx movl %esi, %eax movl %edx, %esi orl %ecx, 8(%esi,%ebp,4) addl 32(%esp), %ebx # 4-byte Folded Reload addl 28(%esp), %edi # 4-byte Folded Reload cmpl %eax, %ebx jle LBB33_2 }}} '''The above code runs about 25% slower than it should on Intel Sky Lake for this 32-bit code.'''
This was tested for GHC version 8.0.1 under both Windows and Linux for both 32-bit and 64-bit code with identical results for each native code width.
The code was also tested for 32 and 64 bit code produced by the NCG; for this specific problem, NCG takes the simple approach and does not waste the extra register. However, due to the inefficient allocation of registers as per ticket #8971, not moving the loop completion check to the end of the loop and thus requiring an extra jump instruction, and not combining the read/modify/write into a single instruction, it is still slower (much slower for 32-bit code) than the LLVM produced code. As its problems are known, I have not documented the NCG code.
Conclusion: This may seem like a nit picky type of bug as in some use cases the execution time cost is very small, but it may be an indication of problems in other use cases that cause more serious effects on execution speed. It is my feeling that for such low level somewhat imperative types of code, GHC should really produce code that is as fast as C/C++/Rust.
New description: Since in many cases, the use of the LLVM backend is the only way to avoid the NCG's poor register allocation (ticket #8971), this is a concern that using "-fllvm" is producing overly complex code through a (seeming) failed attempt to optimize. The following code uses a very simple "odds-only" implementation of the Sieve of Eratosthenes with a very tight inner culling loop limited to using a 16 Kilobyte buffer (<= the size of most modern CPU L1 data cache size) to reproduce the problem; it uses a "twos" Look Up Table (LUT) for better speed than using a variable shift left operation for setting the composite bits in the buffer array as it (should) take the same number of registers and the array look-up instruction is easier for the CPU to fuse than a variable shift left: {{{#!hs -- GHC_EfficiencyBug.hs {-# LANGUAGE FlexibleContexts #-} {-# OPTIONS_GHC -O3 -fllvm -rtsopts -keep-s-files #-} -- or -O2 import Control.Monad.ST import Data.Word import Data.Bits import Data.Array.ST (runSTUArray) import Data.Array.Base numLOOPS = 10000 :: Int twos :: UArray Int Word32 twos = listArray (0, 31) [1 `shiftL` i | i <- [0 .. 31]] soe :: () -> [Word32] soe() = 2 : [fromIntegral i * 2 + 3 | (i, False) <- assocs bufb] where bufb = runSTUArray $ do let bfLmt = (256 * 1024) `div` 2 - 1 -- to 2^18 + 2 is 128 KBits - 1 = 16 KBytes cmpstsb <- newArray (0, bfLmt) False :: ST s (STUArray s Int Bool) cmpstsw <- (castSTUArray :: STUArray s Int Bool -> ST s (STUArray s Int Word32)) cmpstsb let loop n = -- cull a number of times to test timing if n <= 0 then return cmpstsb else let cullp i = let p = i + i + 3 in let s = (p * p - 3) `div` 2 in if s > bfLmt then loop (n - 1) else do isCmpst <- unsafeRead cmpstsb i if isCmpst then cullp (i + 1) else -- is Prime let cull j = -- very tight inner loop where all the time is spent if j > bfLmt then cullp (i + 1) else do let sh = unsafeAt twos (j .&. 31) -- (1 `shiftL` (j .&. 31))) let w = j `shiftR` 5 ov <- unsafeRead cmpstsw w unsafeWrite cmpstsw w (ov .|. sh) cull (j + p) in cull s in cullp 0 loop numLOOPS main = print $ length $ soe() }}} The main culling is repeated "numLOOPS" times to get a reasonable execution time for accurate timing and to make the time required to use the List comprehension to determine the number of found primes (the answer) a negligible part of the execution time. Timing results can be produced by running "./GHC_EfficiencyBug +RTS -s". The desired assembly code result for the tight inner loop is as for the Rust/LLVM compiler, in this case for x86_64 64-bit code: {{{ .p2align 4, 0x90 .LBB10_27: movq %rcx, %rdx shrq $5, %rdx movl %ecx, %esi andl $31, %esi movl (%rbp,%rsi,4), %esi orl %esi, (%r14,%rdx,4) addq %rax, %rcx .LBB10_26: cmpq %r13, %rcx jb .LBB10_27 }}} The above code is extremely efficient on a CPU that is not cache bottle necked (such as the AMD Bulldozer series are) and takes just about three clock cycles per inner composite culling loop on Intel Sky Lake; it is just as efficient for x86 code since there are only seven registers used in this inner loop. Due to this attempt at "over-optimization", the GHC/LLVM backend produces the following x86_64 64-bit code: {{{ .align 16, 0x90 .LBB34_2: # %c8T2 # =>This Inner Loop Header: Depth=1 movq %rcx, %rsi sarq $5, %rsi movl %r8d, %edi andl $124, %edi movl 16(%rax,%rdi), %edi orl %edi, 16(%r11,%rsi,4) addq %r14, %rcx addq %rdx, %r8 cmpq %r10, %rcx jle .LBB34_2 }}} As can be seen, instead of just masking the "twos" index register by 31 (0x1F), the code is using two extra separate registers to contain "(j * 4)" increment and the accumulated index, which increment is added to the "twos" index register per loop and masked by 124 (0x7C or 0x1F times 4), requiring an extra two registers and an extra instruction for the extra addition. This isn't a problem as to the number of registers for x86_64 code which has more than enough, but it adds the extra instruction execution time of one third of a CPU clock cycle (I know, only one ninth extra time). However, for 32-bit x86 code with barely enough registers previously, the use of the extra registers triggers a chain of three register reloads as can be seen in the following assembly code: {{{ .align 16, 0x90 LBB33_2: # %c8Wb # =>This Inner Loop Header: Depth=1 movl %ebx, %ebp sarl $5, %ebp movl %edi, %ecx andl $124, %ecx movl %esi, %edx movl %eax, %esi movl 36(%esp), %eax # 4-byte Reload movl 8(%eax,%ecx), %ecx movl %esi, %eax movl %edx, %esi orl %ecx, 8(%esi,%ebp,4) addl 32(%esp), %ebx # 4-byte Folded Reload addl 28(%esp), %edi # 4-byte Folded Reload cmpl %eax, %ebx jle LBB33_2 }}} '''The above code runs about 25% slower than it should on Intel Sky Lake for this 32-bit code.''' This was tested for GHC version 8.0.1 under both Windows and Linux for both 32-bit and 64-bit code with identical results for each native code width. The code was also tested for 32 and 64 bit code produced by the NCG; for this specific problem, NCG takes the simple approach and does not waste the extra register. However, due to the inefficient allocation of registers as per ticket #8971, not moving the loop completion check to the end of the loop and thus requiring an extra jump instruction, and not combining the read/modify/write into a single instruction, it is still slower (much slower for 32-bit code) than the LLVM produced code. As its problems are known, I have not documented the NCG code. Conclusion: This may seem like a nit picky type of bug as in some use cases the execution time cost is very small, but it may be an indication of problems in other use cases that cause more serious effects on execution speed. It is my feeling that for such low level somewhat imperative types of code, GHC should really produce code that is as fast as C/C++/Rust. -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/12798#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#12798: LLVM seeming to over optimize, producing inefficient assembly code... -------------------------------------+------------------------------------- Reporter: GordonBGood | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.4.1 Component: Compiler (LLVM) | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #12808 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by bgamari): * milestone: 8.2.1 => 8.4.1 Comment: It doesn't look like anything will happen on this for 8.2. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/12798#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC