[GHC] #8971: Native Code Generator 7.8.1 RC2 is not as optimized as 7.6.3...

#8971: Native Code Generator 7.8.1 RC2 is not as optimized as 7.6.3... ------------------------------+-------------------------------------------- Reporter: | Owner: GordonBGood | Status: new Type: bug | Milestone: Priority: normal | Version: 7.8.1-rc2 Component: Compiler | Operating System: Unknown/Multiple Keywords: | Type of failure: Runtime performance bug Architecture: | Test Case: Unknown/Multiple | Blocking: Difficulty: Unknown | Blocked By: | Related Tickets: | ------------------------------+-------------------------------------------- The output assembly code is not as optimized for the Windows 32-bit version 7.8.1 RC2 compiler as the Windows 7.6.3 compiler (32-bit) when the option switches are exactly the same although it may not be limited to only the Windows platform; this has a negative impact on execution time for tight loops of about a factor of two times slower. The following code will reproduce the problem: {{{#!haskell -- GHC_NCG_OptimizationBug.hs -- it seems the Haskell GHC 7.8.1 NCG Native Code Generator (NCG) doesn't -- optimize as well for (at least) the x86 target as version 7.6.3 {-# OPTIONS_GHC -O3 -rtsopts -v -dcore-lint -ddump-asm -ddump-to-file -dumpdir . #-} -- or O2 import Data.Bits import Control.Monad.ST (runST,ST(..)) import Data.Array.Base -- Uses a very simple Sieve of Eratosthenes to 2 ^ 18 to prove it. accNumPrimes :: Int -> Int accNumPrimes acc = acc `seq` runST $ do let bfSz = (256 * 1024 - 3) `div` 2 bfLmtWrds = (bfSz + 1) `div` 32 bufw <- newArray (0, bfLmtWrds) (-1) :: ST s (STUArray s Int Int) -- to clear the last "uneven" bit(s) unsafeWrite bufw bfLmtWrds (complement ((-2) `shiftL` (bfSz .&. 31))) bufb <- (castSTUArray :: STUArray s Int Int -> ST s (STUArray s Int Bool)) bufw let cullp i = let p = i + i + 3 in let s = (p * p - 3) `div` 2 in if s > bfSz then let count i sm = do sm `seq` if i > bfLmtWrds then return (acc + sm) else do wd <- unsafeRead bufw i count (i + 1) (sm + (popCount wd)) in count 0 1 -- use '1' for the '2' prime not in the array else do v <- unsafeRead bufb i if v then let cull j = do -- very tight inner loop if j > bfSz then cullp (i + 1) else do unsafeWrite bufb j False cull (j + p) in cull s else cullp (i + 1) cullp 0 main = -- run the program a number of times to get a reasonable time... let numloops = 2000 in let loop n acc = acc `seq` if n <= 0 then acc else loop (n - 1) (accNumPrimes acc) in print $ loop numloops 0 }}} The above code takes almost twice as long to run when compiled under 7.8.1 RC2 for Windows (32-bit) as it does for the version 7.6.3 compiler (both 32-bit compilers). The -ddump-simpl Core dump is almost identical between the two, which is also evidenced by that using the -fllvm LLVM compiler back end switch for each results in code that runs at about the same speed for each compiler run (which would use the same Core output as used for NCG, right?). Under Windows, the compilation and run for 7.8.1 RC2 goes like this: {{{ *Main> :! E:\ghc-7.8.0.20140228_32\bin\ghc --make -pgmlo "E:\llvm32\build\Release\bin\opt" -pgmlc "E:\llvm32\build\Release\bin\llc" "GHC_NCG_OptimizationBug.hs" compile: input file WindowsVsLinuxNCG.hs Created temporary directory: C:\Users\Gordon\AppData\Local\Temp\ghc15460_0 *** Checking old interface for main:Main: *** Parser: *** Renamer/typechecker: [1 of 1] Compiling Main ( GHC_NCG_OptimizationBug.hs, GHC_NCG_OptimizationBug.o ) *** Desugar: Result size of Desugar (after optimization) = {terms: 260, types: 212, coercions: 0} *** Core Linted result of Desugar (after optimization): *** Simplifier: Result size of Simplifier iteration=1 = {terms: 213, types: 136, coercions: 52} *** Core Linted result of Simplifier: Result size of Simplifier iteration=2 = {terms: 215, types: 148, coercions: 67} *** Core Linted result of Simplifier: Result size of Simplifier iteration=3 = {terms: 209, types: 135, coercions: 51} *** Core Linted result of Simplifier: Result size of Simplifier = {terms: 209, types: 135, coercions: 42} *** Core Linted result of Simplifier: *** Specialise: Result size of Specialise = {terms: 209, types: 135, coercions: 42} *** Core Linted result of Specialise: *** Float out(FOS {Lam = Just 0, Consts = True, PAPs = False}): Result size of Float out(FOS {Lam = Just 0, Consts = True, PAPs = False}) = {terms: 286, types: 185, coercions: 42} *** Core Linted result of Float out(FOS {Lam = Just 0, Consts = True, PAPs = False}): *** Float inwards: Result size of Float inwards = {terms: 286, types: 185, coercions: 42} *** Core Linted result of Float inwards: *** Simplifier: Result size of Simplifier iteration=1 = {terms: 502, types: 393, coercions: 103} *** Core Linted result of Simplifier: Result size of Simplifier iteration=2 = {terms: 428, types: 326, coercions: 29} *** Core Linted result of Simplifier: Result size of Simplifier iteration=3 = {terms: 420, types: 321, coercions: 29} *** Core Linted result of Simplifier: Result size of Simplifier = {terms: 420, types: 321, coercions: 29} *** Core Linted result of Simplifier: *** Simplifier: Result size of Simplifier iteration=1 = {terms: 418, types: 318, coercions: 29} *** Core Linted result of Simplifier: Result size of Simplifier = {terms: 418, types: 318, coercions: 29} *** Core Linted result of Simplifier: *** Simplifier: Result size of Simplifier iteration=1 = {terms: 475, types: 383, coercions: 32} *** Core Linted result of Simplifier: Result size of Simplifier iteration=2 = {terms: 444, types: 336, coercions: 9} *** Core Linted result of Simplifier: Result size of Simplifier = {terms: 444, types: 336, coercions: 9} *** Core Linted result of Simplifier: *** Demand analysis: Result size of Demand analysis = {terms: 444, types: 336, coercions: 9} *** Core Linted result of Demand analysis: *** Worker Wrapper binds: Result size of Worker Wrapper binds = {terms: 579, types: 457, coercions: 9} *** Core Linted result of Worker Wrapper binds: *** Simplifier: Result size of Simplifier iteration=1 = {terms: 510, types: 415, coercions: 9} *** Core Linted result of Simplifier: Result size of Simplifier = {terms: 420, types: 322, coercions: 9} *** Core Linted result of Simplifier: *** Float out(FOS {Lam = Just 0, Consts = True, PAPs = True}): Result size of Float out(FOS {Lam = Just 0, Consts = True, PAPs = True}) = {terms: 426, types: 326, coercions: 9} *** Core Linted result of Float out(FOS {Lam = Just 0, Consts = True, PAPs = True}): *** Common sub-expression: Result size of Common sub-expression = {terms: 424, types: 326, coercions: 9} *** Core Linted result of Common sub-expression: *** Float inwards: Result size of Float inwards = {terms: 424, types: 326, coercions: 9} *** Core Linted result of Float inwards: *** Liberate case: Result size of Liberate case = {terms: 1,824, types: 1,259, coercions: 9} *** Core Linted result of Liberate case: *** Simplifier: Result size of Simplifier iteration=1 = {terms: 608, types: 422, coercions: 9} *** Core Linted result of Simplifier: Result size of Simplifier iteration=2 = {terms: 604, types: 413, coercions: 9} *** Core Linted result of Simplifier: Result size of Simplifier iteration=3 = {terms: 604, types: 413, coercions: 9} *** Core Linted result of Simplifier: Result size of Simplifier = {terms: 604, types: 413, coercions: 9} *** Core Linted result of Simplifier: *** SpecConstr: Result size of SpecConstr = {terms: 708, types: 505, coercions: 9} *** Core Linted result of SpecConstr: *** Simplifier: Result size of Simplifier iteration=1 = {terms: 702, types: 499, coercions: 9} *** Core Linted result of Simplifier: Result size of Simplifier = {terms: 608, types: 405, coercions: 9} *** Core Linted result of Simplifier: *** Tidy Core: Result size of Tidy Core = {terms: 608, types: 405, coercions: 9} *** Core Linted result of Tidy Core: *** CorePrep: Result size of CorePrep = {terms: 825, types: 489, coercions: 9} *** Core Linted result of CorePrep: *** Stg2Stg: *** CodeOutput: *** New CodeGen: *** CPSZ: *** CPSZ: *** CPSZ: *** CPSZ: *** CPSZ: *** CPSZ: *** CPSZ: *** CPSZ: *** CPSZ: *** CPSZ: *** CPSZ: *** CPSZ: *** CPSZ: *** CPSZ: *** CPSZ: *** Assembler: "E:\ghc-7.8.0.20140228_32\lib/../mingw/bin/gcc.exe" "-U__i686" "-fno- stack-protector" "-DTABLES_NEXT_TO_CODE" "-I." "-x" "assembler-with-cpp" "-c" "C:\Users\Gordon\AppData\Local\Temp\ghc15460_0\ghc15460_2.s" "-o" "GHC_NCG_OptimizationBug.o" Linking GHC_NCG_OptimizationBug.exe ... *Main> :! GHC_NCG_OptimizationBug +RTS -s 46000000 32,965,096 bytes allocated in the heap 7,032 bytes copied during GC 41,756 bytes maximum residency (2 sample(s)) 19,684 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 61 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s Gen 1 2 colls, 0 par 0.00s 0.00s 0.0001s 0.0001s INIT time 0.00s ( 0.00s elapsed) MUT time 1.73s ( 1.73s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 1.73s ( 1.73s elapsed) %GC time 0.0% (0.0% elapsed) Alloc rate 19,006,902 bytes per MUT second Productivity 100.0% of total user, 100.2% of total elapsed }}} whereas under version 7.6.3 goes like this: {{{ *Main> :! E:\ghc-7.6.3_32\bin\ghc --make -pgmlo "E:\llvm32\build\Release\bin\opt" -pgmlc "E:\llvm32\build\Release\bin\llc" "GHC_NCG_OptimizationBug.hs" compile: input file GHC_NCG_OptimizationBug.hs Created temporary directory: C:\Users\Gordon\AppData\Local\Temp\ghc28200_0 *** Checking old interface for main:Main: *** Parser: *** Renamer/typechecker: [1 of 1] Compiling Main ( GHC_NCG_OptimizationBug.hs, GHC_NCG_OptimizationBug.o ) *** Desugar: Result size of Desugar (after optimization) = {terms: 247, types: 212, coercions: 0} *** Core Linted result of Desugar (after optimization): *** Simplifier: Result size of Simplifier iteration=1 = {terms: 198, types: 132, coercions: 35} *** Core Linted result of Simplifier: Result size of Simplifier iteration=2 = {terms: 200, types: 144, coercions: 43} *** Core Linted result of Simplifier: Result size of Simplifier iteration=3 = {terms: 194, types: 131, coercions: 57} *** Core Linted result of Simplifier: Result size of Simplifier = {terms: 194, types: 131, coercions: 39} *** Core Linted result of Simplifier: *** Specialise: Result size of Specialise = {terms: 194, types: 131, coercions: 39} *** Core Linted result of Specialise: *** Float out(FOS {Lam = Just 0, Consts = True, PAPs = False}): Result size of Float out(FOS {Lam = Just 0, Consts = True, PAPs = False}) = {terms: 277, types: 191, coercions: 39} *** Core Linted result of Float out(FOS {Lam = Just 0, Consts = True, PAPs = False}): *** Float inwards: Result size of Float inwards = {terms: 277, types: 191, coercions: 39} *** Core Linted result of Float inwards: *** Simplifier: Result size of Simplifier iteration=1 = {terms: 514, types: 403, coercions: 103} *** Core Linted result of Simplifier: Result size of Simplifier iteration=2 = {terms: 420, types: 317, coercions: 29} *** Core Linted result of Simplifier: Result size of Simplifier iteration=3 = {terms: 412, types: 312, coercions: 29} *** Core Linted result of Simplifier: Result size of Simplifier = {terms: 412, types: 312, coercions: 29} *** Core Linted result of Simplifier: *** Simplifier: Result size of Simplifier iteration=1 = {terms: 410, types: 309, coercions: 29} *** Core Linted result of Simplifier: Result size of Simplifier = {terms: 410, types: 309, coercions: 29} *** Core Linted result of Simplifier: *** Simplifier: Result size of Simplifier iteration=1 = {terms: 455, types: 364, coercions: 32} *** Core Linted result of Simplifier: Result size of Simplifier iteration=2 = {terms: 422, types: 317, coercions: 9} *** Core Linted result of Simplifier: Result size of Simplifier = {terms: 422, types: 317, coercions: 9} *** Core Linted result of Simplifier: *** Demand analysis: Result size of Demand analysis = {terms: 422, types: 317, coercions: 9} *** Core Linted result of Demand analysis: *** Worker Wrapper binds: Result size of Worker Wrapper binds = {terms: 536, types: 427, coercions: 9} *** Core Linted result of Worker Wrapper binds: *** Simplifier: Result size of Simplifier iteration=1 = {terms: 480, types: 391, coercions: 9} *** Core Linted result of Simplifier: Result size of Simplifier = {terms: 400, types: 306, coercions: 9} *** Core Linted result of Simplifier: *** Float out(FOS {Lam = Just 0, Consts = True, PAPs = True}): Result size of Float out(FOS {Lam = Just 0, Consts = True, PAPs = True}) = {terms: 408, types: 311, coercions: 9} *** Core Linted result of Float out(FOS {Lam = Just 0, Consts = True, PAPs = True}): *** Common sub-expression: Result size of Common sub-expression = {terms: 406, types: 311, coercions: 9} *** Core Linted result of Common sub-expression: *** Float inwards: Result size of Float inwards = {terms: 406, types: 311, coercions: 9} *** Core Linted result of Float inwards: *** Liberate case: Result size of Liberate case = {terms: 1,186, types: 824, coercions: 9} *** Core Linted result of Liberate case: *** Simplifier: Result size of Simplifier iteration=1 = {terms: 585, types: 411, coercions: 9} *** Core Linted result of Simplifier: Result size of Simplifier iteration=2 = {terms: 569, types: 392, coercions: 9} *** Core Linted result of Simplifier: Result size of Simplifier iteration=3 = {terms: 569, types: 392, coercions: 9} *** Core Linted result of Simplifier: Result size of Simplifier = {terms: 569, types: 392, coercions: 9} *** Core Linted result of Simplifier: *** SpecConstr: Result size of SpecConstr = {terms: 746, types: 566, coercions: 9} *** Core Linted result of SpecConstr: *** Simplifier: Result size of Simplifier iteration=1 = {terms: 739, types: 560, coercions: 9} *** Core Linted result of Simplifier: Result size of Simplifier iteration=2 = {terms: 762, types: 546, coercions: 9} *** Core Linted result of Simplifier: Result size of Simplifier = {terms: 642, types: 402, coercions: 9} *** Core Linted result of Simplifier: *** Tidy Core: Result size of Tidy Core = {terms: 642, types: 402, coercions: 9} *** Core Linted result of Tidy Core: writeBinIface: 10 Names writeBinIface: 34 dict entries *** CorePrep: Result size of CorePrep = {terms: 779, types: 483, coercions: 9} *** Core Linted result of CorePrep: *** Stg2Stg: *** CodeOutput: *** CodeGen: *** Assembler: "E:\ghc-7.6.3_32\lib/../mingw/bin/gcc.exe" "-fno-stack-protector" "-Wl ,--hash-size=31" "-Wl,--reduce-memory-overheads" "-I." "-c" "C:\Users\Gordon\AppData\Local\Temp\ghc28200_0\ghc28200_0.s" "-o" "GHC_NCG_OptimizationBug.o" Linking GHC_NCG_OptimizationBug.exe ... *Main> :! GHC_NCG_OptimizationBug +RTS -s 46000000 32,989,396 bytes allocated in the heap 4,976 bytes copied during GC 41,860 bytes maximum residency (2 sample(s)) 19,580 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 61 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s Gen 1 2 colls, 0 par 0.00s 0.00s 0.0001s 0.0001s INIT time 0.00s ( 0.00s elapsed) MUT time 0.64s ( 0.64s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.66s ( 0.64s elapsed) %GC time 0.0% (0.1% elapsed) Alloc rate 51,495,642 bytes per MUT second Productivity 100.0% of total user, 102.3% of total elapsed }}} Looking at the ASM dump for the innermost tight culling loop reveals the problem, with 7.8.1 RC2 outputting as follow: {{{ _n3nx: movl 76(%esp),%ecx _c3gf: cmpl %ecx,%eax jg _c3jB _c3jC: movl %eax,%edx sarl $5,%edx movl %ecx,76(%esp) movl $1,%ecx movl %ecx,280(%esp) movl %eax,%ecx andl $31,%ecx movl %eax,292(%esp) movl 280(%esp),%eax shll %cl,%eax xorl $-1,%eax movl 64(%esp),%ecx addl $8,%ecx movl (%ecx,%edx,4),%ecx andl %eax,%ecx movl 64(%esp),%eax addl $8,%eax movl %ecx,(%eax,%edx,4) movl 292(%esp),%eax addl $3,%eax jmp _n3nx }}} and 7.6.3 outputting as follows: {{{ .text .align 4,0x90 .long 1894 .long 32 s1GZ_info: _c1YB: cmpl 16(%ebp),%esi jg _c1YE movl %esi,%edx sarl $5,%edx movl $1,%eax movl %esi,%ecx andl $31,%ecx shll %cl,%eax xorl $-1,%eax movl 12(%ebp),%ecx movl 8(%ecx,%edx,4),%ecx andl %eax,%ecx movl 12(%ebp),%eax movl %ecx,8(%eax,%edx,4) addl 4(%ebp),%esi jmp s1GZ_info _c1YE: movl 8(%ebp),%esi addl $8,%ebp jmp s1GB_info }}} The second code is clearly much more efficient, with the only memory access reading/writing the sieve buffer array and one register reload of the prime value to add to the current position index, whereas the first (7.8.1 RC2) code has three register spills and five register re-loads, almost as if debugging were still turned on. This bug was tested under Windows, but likely applies to other platforms, at least for 32-bit versions but also possibly to others. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8971 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8971: Native Code Generator 7.8.1 RC2 is not as optimized as 7.6.3... --------------------------------------------+------------------------------ Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler (NCG) | Version: 7.8.1-rc2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Changes (by simonmar): * cc: simonmar (added) * component: Compiler => Compiler (NCG) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8971#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8971: Native Code Generator 7.8.1 RC2 is not as optimized as 7.6.3... --------------------------------------------+------------------------------ Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler (NCG) | Version: 7.8.1-rc2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Changes (by gidyn): * cc: gideon@… (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8971#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8971: Native Code Generator 7.8.1 RC2 is not as optimized as 7.6.3... --------------------------------------------+------------------------------ Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler (NCG) | Version: 7.8.1-rc2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by ezyang): Doesn't seem to be a problem on 64-bit: {{{ .Lc4cz: cmpq %r8,%rbx jg .Lc4cJ .Lc4cK: movq %rbx,%rdx sarq $6,%rdx movl $1,%edi movq %rcx,%r9 movq %rbx,%rcx andq $63,%rcx shlq %cl,%rdi xorq $-1,%rdi leaq 16(%rax),%rcx movq (%rcx,%rdx,8),%rcx andq %rdi,%rcx leaq 16(%rax),%rdi movq %rcx,(%rdi,%rdx,8) addq %r9,%rbx .Ln4g8: movq %r9,%rcx jmp .Lc4cz }}} What happens if you use `-freg-graph`? If you could post C-- dumps that would also be helpful. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8971#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8971: Native Code Generator 7.8.1 RC2 is not as optimized as 7.6.3... --------------------------------------------+------------------------------ Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler (NCG) | Version: 7.8.1-rc2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by awson): I didn't go into details but it looks `-fregs-graph` doesn't help. 7.8.1-compiled exec is still much slower than 7.6.3-compiled. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8971#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8971: Native Code Generator 7.8.1 RC2 is not as optimized as 7.6.3... --------------------------------------------+------------------------------ Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler (NCG) | Version: 7.8.1-rc2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by awson): I've checked things on 32-bit linux. It suffers too. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8971#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8971: Native Code Generator 7.8.1 RC2 is not as optimized as 7.6.3... --------------------------------------------+------------------------------ Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler (NCG) | Version: 7.8.1-rc2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by awson): Also, when on 32-bit the difference between NCG and LLVM is *huge*, the difference on 64-bit (Windows) is also pretty much big - about 55% on Yorkfield, and 40% on Sandy Bridge. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8971#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8971: Native Code Generator 7.8.1 RC2 is not as optimized as 7.6.3... --------------------------------------------+------------------------------ Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler (NCG) | Version: 7.8.1-rc2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by awson): The same picture is for 64-bit linux ghc-7.8.1 and LLVM 3.4. NCG loses badly. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8971#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8971: Native Code Generator 7.8.1 RC2 is not as optimized as 7.6.3... --------------------------------------------+------------------------------ Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler (NCG) | Version: 7.8.1-rc2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by GordonBGood): This bug has nothing to do with the difference between the LLVM and the NCG compiler backends; it it well known that NCG isn't that efficient for tight loops. For this bug use case code, NCG has never combined the memory read, the 'and' operation, and the memory write into one read/modify/write instruction whereas LLVM does, which explains the difference between them. This bug has to do with '''why the GHC version 7.8.1 NCG has taken a step backward''' from the previous stable version 7.6.3 in performance due to more register spills and reloads. I only referred to LLVM as proof that the problem seems to be limited to NCG as both NCG and LLVM will share the same C-- output (or at least I think so???) yet NCG shows this step backwards whereas LLVM does not. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8971#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

I only referred to LLVM as proof that the problem seems to be limited to NCG as both NCG and LLVM will share the same C-- output (or at least I
#8971: Native Code Generator 7.8.1 RC2 is not as optimized as 7.6.3... --------------------------------------------+------------------------------ Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler (NCG) | Version: 7.8.1-rc2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by tibbe): Replying to [comment:8 GordonBGood]: think so???) yet NCG shows this step backwards whereas LLVM does not. That's a good indication that the NGC is to blame, but it's also possible that the Cmm codegen has regressed but some LLVM optimizations make up for the regression. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8971#comment:9 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

Replying to [comment:8 GordonBGood]:
I only referred to LLVM as proof that the problem seems to be limited to NCG as both NCG and LLVM will share the same C-- output (or at least I
#8971: Native Code Generator 7.8.1 RC2 is not as optimized as 7.6.3... --------------------------------------------+------------------------------ Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler (NCG) | Version: 7.8.1-rc2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by GordonBGood): Replying to [comment:9 tibbe]: think so???) yet NCG shows this step backwards whereas LLVM does not.
That's a good indication that the NGC is to blame, but it's also
possible that the Cmm codegen has regressed but some LLVM optimizations make up for the regression. Alright, as requested by "ezyang" earlier, I added the -ddump-cmm and -ddump-opt-cmm switches to the compilation, with the following results: For version 7.6.3, the CMM code for the loops looks like this: {{{ s1Hq_ret() { label: s1Hq_info rep:StackRep [True, False, True, True, True] } c1XX: _c1Tr::I32 = %MO_S_Gt_W32(R1, I32[Sp + 4]); ; if (_c1Tr::I32 >= 1) goto c1XZ; _s1H9::I32 = %MO_S_Shr_W32(R1, 5); _s1He::I32 = I32[I32[Sp + 8] + 8 + (_s1H9::I32 << 2)]; _s1KJ::I32 = R1; _s1Hh::I32 = _s1KJ::I32 & 31; _s1Hj::I32 = _s1Hh::I32; _s1KI::I32 = 1 << _s1Hj::I32; _s1Hm::I32 = _s1KI::I32 ^ 18446744073709551615; _s1KH::I32 = _s1He::I32 & _s1Hm::I32; I32[I32[Sp + 8] + 8 + (_s1H9::I32 << 2)] = _s1KH::I32; _s1KG::I32 = R1 + 3; R1 = _s1KG::I32; jump s1Hq_info; // [R1] c1XZ: R1 = 1; jump s1HY_info; // [R1] }, }}} and the opt-cmm code for about the same area looks like this: {{{ s1Hq_ret() { Just s1Hq_info: const 933; const 32; } c1XX: ; if (%MO_S_Gt_W32(R1, I32[Sp + 4])) goto c1XZ; _s1H9::I32 = %MO_S_Shr_W32(R1, 5); I32[I32[Sp + 8] + ((_s1H9::I32 << 2) + 8)] = I32[I32[Sp + 8] + ((_s1H9::I32 << 2) + 8)] & (1 << R1 & 31) ^ 18446744073709551615; R1 = R1 + 3; jump s1Hq_info; // [R1] c1XZ: R1 = 1; jump s1HY_info; // [R1] } }}} For version 7.8.1 RC2 the cmm code is about eight times larger with about eight times the number of lines and looks like this {{{ c3jO: _c3jQ::I32 = %MO_S_Gt_W32(_s33c::I32, _s32t::I32); _s33e::I32 = _c3jQ::I32; if (_s33e::I32 >= 1) goto c3jY; else goto c3jZ; c3jY: _s32S::I32 = _s335::I32; goto c3gp; c3jZ: _c3kn::I32 = %MO_S_Shr_W32(_s33c::I32, 5); _s33g::I32 = _c3kn::I32; _s33j::I32 = I32[(_s327::P32 + 8) + (_s33g::I32 << 2)]; _s33j::I32 = _s33j::I32; _c3kq::I32 = _s33c::I32; _s33k::I32 = _c3kq::I32; _c3kt::I32 = _s33k::I32 & 31; _s33l::I32 = _c3kt::I32; _c3kw::I32 = _s33l::I32; _s33m::I32 = _c3kw::I32; _c3kz::I32 = 1 << _s33m::I32; _s33n::I32 = _c3kz::I32; _c3kC::I32 = _s33n::I32 ^ 4294967295; _s33o::I32 = _c3kC::I32; _c3kF::I32 = _s33j::I32 & _s33o::I32; _s33p::I32 = _c3kF::I32; I32[(_s327::P32 + 8) + (_s33g::I32 << 2)] = _s33p::I32; _c3kK::I32 = _s33c::I32 + _s32U::I32; _s33r::I32 = _c3kK::I32; _s33c::I32 = _s33r::I32; goto c3jO; }}} the the optimized opt-cmm code looks like this: {{{ c3jO: if (%MO_S_Gt_W32(_s33c::I32, _s32t::I32)) goto c3jY; else goto c3jZ; c3jY: Sp = Sp + 8; _s32S::I32 = _s335::I32; goto c3gp; c3jZ: _s33g::I32 = %MO_S_Shr_W32(_s33c::I32, 5); I32[(_s327::P32 + 8) + (_s33g::I32 << 2)] = I32[(_s327::P32 + 8) + (_s33g::I32 << 2)] & (1 << _s33c::I32 & 31) ^ 4294967295; _s33c::I32 = _s33c::I32 + _s32U::I32; goto c3jO; }}} It appears to me that the opt-cmm code is about the same but the straight cmm dump has regressed to have lost even the basic optimizations that were there with the older version. It may be that the NCG is using the non-optimized version of CMM as a source where the LLVM generator is using the optimized CMM version, which would explain why LLVM backend generated code is still efficient where NCG generated code is not. Anticipating your next request to look at the STG output, I used the -ddump-stg compiler option to examine that code. The code is too difficult to post here (wordy and lines very long), but a quick examination shows it to be about the same for both versions except that internal constants are recorded as 32-bit numbers in 7.8.1 whereas they were recorded as 64-bit numbers even when referring to 32-bit registers for the older 7.6.3 code; this corresponds to the way they are recorded in both the optimized and non-optimized CMM files by version as listed above. Thus, the bug/regression appears to be go further back than just the new NCG (which is likely using the non-optimized CMM code as input) but also to the CMM code generator in that it is producing much less efficient CMM code. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8971#comment:10 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8971: Native Code Generator 7.8.1 RC2 is not as optimized as 7.6.3... --------------------------------------------+------------------------------ Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler (NCG) | Version: 7.8.1-rc2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by jstolarek): I know nothing about 7.6 Cmm generation, but I worked on 7.8 Cmm pipeline this summer so I can offer some guidance. First I'd like to clarify a few things:
both NCG and LLVM will share the same C-- output
No, they will not. LLVM backend requires "proc-point splitting". This means we need to turn every Cmm block that is succcesor of more than one block into a separate procedure (at least that is my understanding). See [https://github.com/ghc/ghc/blob/f8e12e2b396e0c475e1403ab8ac3fc4d63c1681e/com... here] and [https://github.com/ghc/ghc/blob/f8e12e2b396e0c475e1403ab8ac3fc4d63c1681e/com... here] to see source of differences between Cmm generated for both backends. [https://github.com/ghc/ghc/blob/f8e12e2b396e0c475e1403ab8ac3fc4d63c1681e/com... Here] you'll find more on proc-points.
straight cmm dump has regressed to have lost even the basic optimizations that were there with the older version.
It may be that the NCG is using the non-optimized version of CMM as a
I'm not sure what was the philosophy behind 7.6 Cmm geneartion but in 7.8 we just generate Cmm from STG in the simplest possible way and then optimize that Cmm. This is similar to generating Core from Haskell and then doing a series of core-to-core transformations. So you need to look at the final Cmm, not the one that comes out from the Cmm->STG pass. source It is using the optimized version. You can see for yourself in the [https://github.com/ghc/ghc/blob/f8e12e2b396e0c475e1403ab8ac3fc4d63c1681e/com... tryNewCodeGen function] and its [https://github.com/ghc/ghc/blob/f8e12e2b396e0c475e1403ab8ac3fc4d63c1681e/com... call site].
Anticipating your next request to look at the STG output
Bad anticipation :-) Look at generated assembly. Only this can tell you what is the real difference between generated code. I wouldn't be surprised to see something in the lines of #8048.
Thus, the bug/regression appears to be go further back than just the new NCG (which is likely using the non-optimized CMM code as input) but also to the CMM code generator in that it is producing much less efficient CMM code.
Let me repeat: a) NCG is using optimized Cmm (look at the code); b) don't look at the un-optimized Cmm - it's irrelevant. Finally, I you want to learn more about Cmm pipeline and Cmm debugging see [wiki:Commentary/Compiler/CodeGen]. Hope that helps. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8971#comment:11 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8971: Native Code Generator 7.8.1 RC2 is not as optimized as 7.6.3... --------------------------------------------+------------------------------ Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler (NCG) | Version: 7.8.1-rc2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by awson): Replying to [comment:8 GordonBGood]:
This bug has nothing to do with the difference between the LLVM and the NCG compiler backends; it it well known that NCG isn't that efficient for tight loops. For this bug use case code, NCG has never combined the memory read, the 'and' operation, and the memory write into one read/modify/write instruction whereas LLVM does, which explains the difference between them.
Oh, I know this. But in the last days I fought with a pile of bugs in 64-bit Windows GHC and I saw a lot of NCG-produced assembly that was bad far beyond what you've described above. Off the top of my head, when I worked with #8974, I saw this: {{{ call suspendThread addq $40,%rsp subq $8,%rsp subq $32,%rsp movq %rax,%rbx movl $0,%eax call performMajorGC addq $40,%rsp movq %rbx,%rcx subq $8,%rsp subq $32,%rsp movl $0,%eax call resumeThread }}} produced by NCG, versus {{{ callq suspendThread movq %rax, %rsi callq performMajorGC movq %rsi, %rcx callq resumeThread }}} produced by LLVM. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8971#comment:12 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

Replying to [comment:8 GordonBGood]:
This bug has nothing to do with the difference between the LLVM and
#8971: Native Code Generator 7.8.1 RC2 is not as optimized as 7.6.3... --------------------------------------------+------------------------------ Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler (NCG) | Version: 7.8.1-rc2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by GordonBGood): Replying to [comment:12 awson]: the NCG compiler backends; it it well known that NCG isn't that efficient for tight loops. For this bug use case code, NCG has never combined the memory read, the 'and' operation, and the memory write into one read/modify/write instruction whereas LLVM does, which explains the difference between them.
Oh, I know this. But in the last days I fought with a pile of bugs in
64-bit Windows GHC and I saw a lot of NCG-produced assembly that was bad far beyond what you've described above.
Off the top of my head, when I worked on #8974, I saw this: {{{ call suspendThread addq $40,%rsp subq $8,%rsp subq $32,%rsp movq %rax,%rbx movl $0,%eax call performMajorGC addq $40,%rsp movq %rbx,%rcx subq $8,%rsp subq $32,%rsp movl $0,%eax call resumeThread }}} produced by NCG, versus {{{ callq suspendThread movq %rax, %rsi callq performMajorGC movq %rsi, %rcx callq resumeThread }}} produced by LLVM.
Thanks for your reply and explanation; I've seen the same types of things with 7.6.3 comparing tight loops, but more generally just much less efficient use of registers by NCG as compared to LLVM with many more register spills and reloads; however, the 32-bit version of 7.8.1 RC2's new NCG is even worse as per my bug report. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8971#comment:13 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

I know nothing about 7.6 Cmm generation, but I worked on 7.8 Cmm
#8971: Native Code Generator 7.8.1 RC2 is not as optimized as 7.6.3... --------------------------------------------+------------------------------ Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler (NCG) | Version: 7.8.1-rc2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by GordonBGood): Replying to [comment:11 jstolarek]: pipeline this summer so I can offer some guidance. First I'd like to clarify a few things: Thanks for your explanations and clarifications, some of which I have dug out for myself in the last few minutes - I had already edit added something to the post to which you are replying:
both NCG and LLVM will share the same C-- output
No, they will not. LLVM backend requires "proc-point splitting". This means we need to turn every Cmm block that is succcesor of more than one block into a separate procedure (at least that is my understanding). See [https://github.com/ghc/ghc/blob/f8e12e2b396e0c475e1403ab8ac3fc4d63c1681e/com... here] and [https://github.com/ghc/ghc/blob/f8e12e2b396e0c475e1403ab8ac3fc4d63c1681e/com... here] to see source of differences between Cmm generated for both backends. [https://github.com/ghc/ghc/blob/f8e12e2b396e0c475e1403ab8ac3fc4d63c1681e/com... Here] you'll find more on proc-points.
straight cmm dump has regressed to have lost even the basic optimizations that were there with the older version.
I'm not sure what was the philosophy behind 7.6 Cmm geneartion but in 7.8 we just generate Cmm from STG in the simplest possible way and then optimize that Cmm. This is similar to generating Core from Haskell and
It may be that the NCG is using the non-optimized version of CMM as a
Learned something new here. OK, end result is that using LLVM as a proof that the problem is in NCG because they use the same CMM code is invalid, so it was worth checking the CMM code. then doing a series of core-to-core transformations. So you need to look at the final Cmm, not the one that comes out from the Cmm->STG pass. OK, I dug that out for myself and now see that in the final pass the CMM code is almost identical for the code that triggers the symptom (other than for the constants recorded as 32-bit depth for 32-bit registers where they used to be recorded as 64-bit even though only 32-bits were used). source
It is using the optimized version. You can see for yourself in the
[https://github.com/ghc/ghc/blob/f8e12e2b396e0c475e1403ab8ac3fc4d63c1681e/com... tryNewCodeGen function] and its [https://github.com/ghc/ghc/blob/f8e12e2b396e0c475e1403ab8ac3fc4d63c1681e/com... call site]. OK, so given (almost) identical final CMM code, the problem is (likely) not in the CMM code but in the NCG.
Anticipating your next request to look at the STG output
Wrong anticipation :-) Look at generated assembly. Only this can tell you what is the real difference between generated code. I wouldn't be surprised to see something in the lines of #8048.
Thus, the bug/regression appears to be go further back than just the new NCG (which is likely using the non-optimized CMM code as input) but also to the CMM code generator in that it is producing much less efficient CMM code.
I had already posted the inner loop from the assembly as part of the original bug report; someone requested that I post CMM output, which shows it is not (likely) the problem and we are back to NCG as you confirm here.
Let me repeat: a) NCG is using optimized Cmm (look at the code); b) don't look at the un-optimized Cmm - it's irrelevant.
Yup, you have made that clear: I never want to look at CMM code ever again.
Finally, I you want to learn more about Cmm pipeline and Cmm debugging see [wiki:Commentary/Compiler/CodeGen].
Hope that helps.
Yes, I found that link on my own which is how I came to a fuller understanding of the contents of the new CMM file and its relationship to the final opt-cmm file. But you have been most helpful in explaining that LLVM and NCG do not (can not) use the same CMM source. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8971#comment:14 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8971: Native Code Generator 7.8.1 RC2 is not as optimized as 7.6.3... --------------------------------------------+------------------------------ Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler (NCG) | Version: 7.8.1-rc2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by GordonBGood): The bug is still there with 7.8.2 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8971#comment:15 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8971: Native Code Generator for 7.8 is not as optimized as 7.6.3... --------------------------------------------+------------------------------ Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler (NCG) | Version: 7.8.1-rc2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8971#comment:16 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8971: Native Code Generator for 7.8 is not as optimized as 7.6.3... --------------------------------------------+------------------------------ Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: 7.10.1 Component: Compiler (NCG) | Version: 7.8.1-rc2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Changes (by simonmar): * milestone: => 7.10.1 Comment: I looked at this. The short version is that this isn't going to be easy to fix, and I recommend using LLVM. The new code generator generates better code for let-no-escapes than the old code generator. Whereas previously a let-no-escape would be a whole function with a fixed calling convention, with the new code generator it is just a label, and to call it we assign local variables for the arguments. All things being equal, this is a better way to do things, because we're not fixing the arguments to particular registers, and we're giving the later phases more flexibility to optimise. However, the register allocator in our NCG isn't very good, particularly with loops, and it ends up doing a worse job with the new code than the old, even though previously most of the arguments were actually passed on the Haskell stack. I don't think there are any easy tweaks to improve things, unfortunately. On 64-bit we get away with it because we have more registers. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8971#comment:17 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8971: Native Code Generator for 7.8 is not as optimized as 7.6.3... --------------------------------------------+------------------------------ Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: 7.10.1 Component: Compiler (NCG) | Version: 7.8.1-rc2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by carter): @simon, so the right path forward is someone putting some engineering time into the new ncg register allocator? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8971#comment:18 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8971: Native Code Generator for 7.8 is not as optimized as 7.6.3... --------------------------------------------+------------------------------ Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: 7.10.1 Component: Compiler (NCG) | Version: 7.8.1-rc2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by GordonBGood): Replying to [comment:17 simonmar]:
The short version is that this isn't going to be easy to fix, and I recommend using LLVM.
In fact, I prefer LLVM because it is more consistent for timing, but of course that brings us to that other bug for 64-bit Windows where 64-bit GHC 7.8.2 still doesn't work with LLVM (3.4/3.5) at least for mingw as per the distrubuted binary, although it does under 64-bit Linux. The above test program runs under LLVM for a 32-bit compile but crashes on run for 64-bit, both on Windows. Or perhaps there is something wrong with my compile of 64-bit LLVM even though I followed the exact same procedure as generating the 32-bit version from trunk. Don Stewart's entry that working LLVM should really be distributed with GHC is a very good one. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8971#comment:19 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

but of course that brings us to that other bug for 64-bit Windows where 64-bit GHC 7.8.2 still doesn't work with LLVM (3.4/3.5) at least for mingw as per the distrubuted binary, although it does under 64-bit Linux. The above test program runs under LLVM for a 32-bit compile but crashes on run for 64-bit, both on Windows.
Or perhaps there is something wrong with my compile of 64-bit LLVM even
#8971: Native Code Generator for 7.8 is not as optimized as 7.6.3... --------------------------------------------+------------------------------ Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: 7.10.1 Component: Compiler (NCG) | Version: 7.8.1-rc2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by awson): Replying to [comment:19 GordonBGood]: though I followed the exact same procedure as generating the 32-bit version from trunk. Don Stewart's entry that working LLVM should really be distributed with GHC is a very good one. This is #8974. My patch there makes your program work on 64-bit Windows (although I don't like it because it uses mangler). It is compatible with existing 64-bit LLVM-3.4, distributed by MSYS2 (mingw-w64) project. We could pack this LLVM distro along with 64-bit GHC binary and configure ghc to use it (then also #7143 would be solved, and I have the patch for this ready) - just like we go with gcc/binutils/perl in all windows distros. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8971#comment:20 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8971: Native Code Generator for 7.8 is not as optimized as 7.6.3... --------------------------------------------+------------------------------ Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: 7.10.1 Component: Compiler (NCG) | Version: 7.8.1-rc2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by simonmar): Replying to [comment:18 carter]:
@simon, so the right path forward is someone putting some engineering time into the new ncg register allocator?
If you mean `-fregs-graph`, then yes, working on that would be a good way to fix this bug. Right now I believe it has regressed since the switch to the new code generator: #7679 Another angle would be to change the calling convention so that we use `%ebx` for R1 on 32-bit platforms. Right now we use `%esi`, which doesn't have an 8-bit version so we can't let the register allocator use it on 32-bit x86 (because neither register allocator can handle the constraint that this reg can't be used for 8-bit calculations). Implementing that is more work than changing the calling convention, but changing the calling convention means doing the same thing in LLVM (and we get a dependency on a new version of LLVM, ugh). Yet another angle would be to make the linear scan allocator sensitive to loops. This might not be too hard, actually: the general plan would be to (a) identify the loops, (b) identify the important loop-carried variables, and (c) assign registers on entry to the loop so that the important variables go into regs. Still, I find it hard to get too excited about any of this when it doesn't affect 64-bit platforms and in any case we can use LLVM. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8971#comment:21 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8971: Native Code Generator for 7.8 is not as optimized as 7.6.3... --------------------------------------------+------------------------------ Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: 7.10.1 Component: Compiler (NCG) | Version: 7.8.1-rc2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by carter): Which of these plans of attack are more likely (well, until benchmarks say otherwise) to also have a positive impact on 64-bit systems? (i'm similarly neutral about 32bit performance, but if theres some work that could have positive fallout for both, that'd be a good starting point) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8971#comment:22 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8971: Native Code Generator for 7.8 is not as optimized as 7.6.3... -------------------------------------+------------------------------------- Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: 7.12.1 Component: Compiler (NCG) | Version: 7.8.1-rc2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: -------------------------------------+------------------------------------- Changes (by gidyn): * cc: gideon@… (removed) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8971#comment:24 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8971: Native Code Generator for 7.8 is not as optimized as 7.6.3... -------------------------------------+------------------------------------- Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler (NCG) | Version: 7.8.1-rc2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by GordonBGood): This bug has not been assigned to GHC Milestone 8.2 so I don't suppose any action will be taken on it. I just checked with GHC 8.0.1 and the problem is still there, with the NCG produced code for these type of tight loops running about four times slower with x86 code than for x64 code. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8971#comment:27 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8971: Native Code Generator for 7.8 is not as optimized as 7.6.3... -------------------------------------+------------------------------------- Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler (NCG) | Version: 7.8.1-rc2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by GordonBGood): Fortunately for x86 code we can use the LLVM back end, even though LLVM still doesn't work reliably for 64 bit code on Windows. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8971#comment:28 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8971: Native Code Generator for 8.0.1 is not as optimized as 7.6.3... -------------------------------------+------------------------------------- Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler (NCG) | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by GordonBGood): * milestone: => 8.2.1 * version: 7.8.1-rc2 => 8.0.1 @@ -2,1 +2,1 @@ - version 7.8.1 RC2 compiler as the Windows 7.6.3 compiler (32-bit) when the + version 8.0.1 compiler as the Windows 7.6.3 compiler (32-bit) when the New description: The output assembly code is not as optimized for the Windows 32-bit version 8.0.1 compiler as the Windows 7.6.3 compiler (32-bit) when the option switches are exactly the same although it may not be limited to only the Windows platform; this has a negative impact on execution time for tight loops of about a factor of two times slower. The following code will reproduce the problem: {{{#!haskell -- GHC_NCG_OptimizationBug.hs -- it seems the Haskell GHC 7.8.1 NCG Native Code Generator (NCG) doesn't -- optimize as well for (at least) the x86 target as version 7.6.3 {-# OPTIONS_GHC -O3 -rtsopts -v -dcore-lint -ddump-asm -ddump-to-file -dumpdir . #-} -- or O2 import Data.Bits import Control.Monad.ST (runST,ST(..)) import Data.Array.Base -- Uses a very simple Sieve of Eratosthenes to 2 ^ 18 to prove it. accNumPrimes :: Int -> Int accNumPrimes acc = acc `seq` runST $ do let bfSz = (256 * 1024 - 3) `div` 2 bfLmtWrds = (bfSz + 1) `div` 32 bufw <- newArray (0, bfLmtWrds) (-1) :: ST s (STUArray s Int Int) -- to clear the last "uneven" bit(s) unsafeWrite bufw bfLmtWrds (complement ((-2) `shiftL` (bfSz .&. 31))) bufb <- (castSTUArray :: STUArray s Int Int -> ST s (STUArray s Int Bool)) bufw let cullp i = let p = i + i + 3 in let s = (p * p - 3) `div` 2 in if s > bfSz then let count i sm = do sm `seq` if i > bfLmtWrds then return (acc + sm) else do wd <- unsafeRead bufw i count (i + 1) (sm + (popCount wd)) in count 0 1 -- use '1' for the '2' prime not in the array else do v <- unsafeRead bufb i if v then let cull j = do -- very tight inner loop if j > bfSz then cullp (i + 1) else do unsafeWrite bufb j False cull (j + p) in cull s else cullp (i + 1) cullp 0 main = -- run the program a number of times to get a reasonable time... let numloops = 2000 in let loop n acc = acc `seq` if n <= 0 then acc else loop (n - 1) (accNumPrimes acc) in print $ loop numloops 0 }}} The above code takes almost twice as long to run when compiled under 7.8.1 RC2 for Windows (32-bit) as it does for the version 7.6.3 compiler (both 32-bit compilers). The -ddump-simpl Core dump is almost identical between the two, which is also evidenced by that using the -fllvm LLVM compiler back end switch for each results in code that runs at about the same speed for each compiler run (which would use the same Core output as used for NCG, right?). Under Windows, the compilation and run for 7.8.1 RC2 goes like this: {{{ *Main> :! E:\ghc-7.8.0.20140228_32\bin\ghc --make -pgmlo "E:\llvm32\build\Release\bin\opt" -pgmlc "E:\llvm32\build\Release\bin\llc" "GHC_NCG_OptimizationBug.hs" compile: input file WindowsVsLinuxNCG.hs Created temporary directory: C:\Users\Gordon\AppData\Local\Temp\ghc15460_0 *** Checking old interface for main:Main: *** Parser: *** Renamer/typechecker: [1 of 1] Compiling Main ( GHC_NCG_OptimizationBug.hs, GHC_NCG_OptimizationBug.o ) *** Desugar: Result size of Desugar (after optimization) = {terms: 260, types: 212, coercions: 0} *** Core Linted result of Desugar (after optimization): *** Simplifier: Result size of Simplifier iteration=1 = {terms: 213, types: 136, coercions: 52} *** Core Linted result of Simplifier: Result size of Simplifier iteration=2 = {terms: 215, types: 148, coercions: 67} *** Core Linted result of Simplifier: Result size of Simplifier iteration=3 = {terms: 209, types: 135, coercions: 51} *** Core Linted result of Simplifier: Result size of Simplifier = {terms: 209, types: 135, coercions: 42} *** Core Linted result of Simplifier: *** Specialise: Result size of Specialise = {terms: 209, types: 135, coercions: 42} *** Core Linted result of Specialise: *** Float out(FOS {Lam = Just 0, Consts = True, PAPs = False}): Result size of Float out(FOS {Lam = Just 0, Consts = True, PAPs = False}) = {terms: 286, types: 185, coercions: 42} *** Core Linted result of Float out(FOS {Lam = Just 0, Consts = True, PAPs = False}): *** Float inwards: Result size of Float inwards = {terms: 286, types: 185, coercions: 42} *** Core Linted result of Float inwards: *** Simplifier: Result size of Simplifier iteration=1 = {terms: 502, types: 393, coercions: 103} *** Core Linted result of Simplifier: Result size of Simplifier iteration=2 = {terms: 428, types: 326, coercions: 29} *** Core Linted result of Simplifier: Result size of Simplifier iteration=3 = {terms: 420, types: 321, coercions: 29} *** Core Linted result of Simplifier: Result size of Simplifier = {terms: 420, types: 321, coercions: 29} *** Core Linted result of Simplifier: *** Simplifier: Result size of Simplifier iteration=1 = {terms: 418, types: 318, coercions: 29} *** Core Linted result of Simplifier: Result size of Simplifier = {terms: 418, types: 318, coercions: 29} *** Core Linted result of Simplifier: *** Simplifier: Result size of Simplifier iteration=1 = {terms: 475, types: 383, coercions: 32} *** Core Linted result of Simplifier: Result size of Simplifier iteration=2 = {terms: 444, types: 336, coercions: 9} *** Core Linted result of Simplifier: Result size of Simplifier = {terms: 444, types: 336, coercions: 9} *** Core Linted result of Simplifier: *** Demand analysis: Result size of Demand analysis = {terms: 444, types: 336, coercions: 9} *** Core Linted result of Demand analysis: *** Worker Wrapper binds: Result size of Worker Wrapper binds = {terms: 579, types: 457, coercions: 9} *** Core Linted result of Worker Wrapper binds: *** Simplifier: Result size of Simplifier iteration=1 = {terms: 510, types: 415, coercions: 9} *** Core Linted result of Simplifier: Result size of Simplifier = {terms: 420, types: 322, coercions: 9} *** Core Linted result of Simplifier: *** Float out(FOS {Lam = Just 0, Consts = True, PAPs = True}): Result size of Float out(FOS {Lam = Just 0, Consts = True, PAPs = True}) = {terms: 426, types: 326, coercions: 9} *** Core Linted result of Float out(FOS {Lam = Just 0, Consts = True, PAPs = True}): *** Common sub-expression: Result size of Common sub-expression = {terms: 424, types: 326, coercions: 9} *** Core Linted result of Common sub-expression: *** Float inwards: Result size of Float inwards = {terms: 424, types: 326, coercions: 9} *** Core Linted result of Float inwards: *** Liberate case: Result size of Liberate case = {terms: 1,824, types: 1,259, coercions: 9} *** Core Linted result of Liberate case: *** Simplifier: Result size of Simplifier iteration=1 = {terms: 608, types: 422, coercions: 9} *** Core Linted result of Simplifier: Result size of Simplifier iteration=2 = {terms: 604, types: 413, coercions: 9} *** Core Linted result of Simplifier: Result size of Simplifier iteration=3 = {terms: 604, types: 413, coercions: 9} *** Core Linted result of Simplifier: Result size of Simplifier = {terms: 604, types: 413, coercions: 9} *** Core Linted result of Simplifier: *** SpecConstr: Result size of SpecConstr = {terms: 708, types: 505, coercions: 9} *** Core Linted result of SpecConstr: *** Simplifier: Result size of Simplifier iteration=1 = {terms: 702, types: 499, coercions: 9} *** Core Linted result of Simplifier: Result size of Simplifier = {terms: 608, types: 405, coercions: 9} *** Core Linted result of Simplifier: *** Tidy Core: Result size of Tidy Core = {terms: 608, types: 405, coercions: 9} *** Core Linted result of Tidy Core: *** CorePrep: Result size of CorePrep = {terms: 825, types: 489, coercions: 9} *** Core Linted result of CorePrep: *** Stg2Stg: *** CodeOutput: *** New CodeGen: *** CPSZ: *** CPSZ: *** CPSZ: *** CPSZ: *** CPSZ: *** CPSZ: *** CPSZ: *** CPSZ: *** CPSZ: *** CPSZ: *** CPSZ: *** CPSZ: *** CPSZ: *** CPSZ: *** CPSZ: *** Assembler: "E:\ghc-7.8.0.20140228_32\lib/../mingw/bin/gcc.exe" "-U__i686" "-fno- stack-protector" "-DTABLES_NEXT_TO_CODE" "-I." "-x" "assembler-with-cpp" "-c" "C:\Users\Gordon\AppData\Local\Temp\ghc15460_0\ghc15460_2.s" "-o" "GHC_NCG_OptimizationBug.o" Linking GHC_NCG_OptimizationBug.exe ... *Main> :! GHC_NCG_OptimizationBug +RTS -s 46000000 32,965,096 bytes allocated in the heap 7,032 bytes copied during GC 41,756 bytes maximum residency (2 sample(s)) 19,684 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 61 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s Gen 1 2 colls, 0 par 0.00s 0.00s 0.0001s 0.0001s INIT time 0.00s ( 0.00s elapsed) MUT time 1.73s ( 1.73s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 1.73s ( 1.73s elapsed) %GC time 0.0% (0.0% elapsed) Alloc rate 19,006,902 bytes per MUT second Productivity 100.0% of total user, 100.2% of total elapsed }}} whereas under version 7.6.3 goes like this: {{{ *Main> :! E:\ghc-7.6.3_32\bin\ghc --make -pgmlo "E:\llvm32\build\Release\bin\opt" -pgmlc "E:\llvm32\build\Release\bin\llc" "GHC_NCG_OptimizationBug.hs" compile: input file GHC_NCG_OptimizationBug.hs Created temporary directory: C:\Users\Gordon\AppData\Local\Temp\ghc28200_0 *** Checking old interface for main:Main: *** Parser: *** Renamer/typechecker: [1 of 1] Compiling Main ( GHC_NCG_OptimizationBug.hs, GHC_NCG_OptimizationBug.o ) *** Desugar: Result size of Desugar (after optimization) = {terms: 247, types: 212, coercions: 0} *** Core Linted result of Desugar (after optimization): *** Simplifier: Result size of Simplifier iteration=1 = {terms: 198, types: 132, coercions: 35} *** Core Linted result of Simplifier: Result size of Simplifier iteration=2 = {terms: 200, types: 144, coercions: 43} *** Core Linted result of Simplifier: Result size of Simplifier iteration=3 = {terms: 194, types: 131, coercions: 57} *** Core Linted result of Simplifier: Result size of Simplifier = {terms: 194, types: 131, coercions: 39} *** Core Linted result of Simplifier: *** Specialise: Result size of Specialise = {terms: 194, types: 131, coercions: 39} *** Core Linted result of Specialise: *** Float out(FOS {Lam = Just 0, Consts = True, PAPs = False}): Result size of Float out(FOS {Lam = Just 0, Consts = True, PAPs = False}) = {terms: 277, types: 191, coercions: 39} *** Core Linted result of Float out(FOS {Lam = Just 0, Consts = True, PAPs = False}): *** Float inwards: Result size of Float inwards = {terms: 277, types: 191, coercions: 39} *** Core Linted result of Float inwards: *** Simplifier: Result size of Simplifier iteration=1 = {terms: 514, types: 403, coercions: 103} *** Core Linted result of Simplifier: Result size of Simplifier iteration=2 = {terms: 420, types: 317, coercions: 29} *** Core Linted result of Simplifier: Result size of Simplifier iteration=3 = {terms: 412, types: 312, coercions: 29} *** Core Linted result of Simplifier: Result size of Simplifier = {terms: 412, types: 312, coercions: 29} *** Core Linted result of Simplifier: *** Simplifier: Result size of Simplifier iteration=1 = {terms: 410, types: 309, coercions: 29} *** Core Linted result of Simplifier: Result size of Simplifier = {terms: 410, types: 309, coercions: 29} *** Core Linted result of Simplifier: *** Simplifier: Result size of Simplifier iteration=1 = {terms: 455, types: 364, coercions: 32} *** Core Linted result of Simplifier: Result size of Simplifier iteration=2 = {terms: 422, types: 317, coercions: 9} *** Core Linted result of Simplifier: Result size of Simplifier = {terms: 422, types: 317, coercions: 9} *** Core Linted result of Simplifier: *** Demand analysis: Result size of Demand analysis = {terms: 422, types: 317, coercions: 9} *** Core Linted result of Demand analysis: *** Worker Wrapper binds: Result size of Worker Wrapper binds = {terms: 536, types: 427, coercions: 9} *** Core Linted result of Worker Wrapper binds: *** Simplifier: Result size of Simplifier iteration=1 = {terms: 480, types: 391, coercions: 9} *** Core Linted result of Simplifier: Result size of Simplifier = {terms: 400, types: 306, coercions: 9} *** Core Linted result of Simplifier: *** Float out(FOS {Lam = Just 0, Consts = True, PAPs = True}): Result size of Float out(FOS {Lam = Just 0, Consts = True, PAPs = True}) = {terms: 408, types: 311, coercions: 9} *** Core Linted result of Float out(FOS {Lam = Just 0, Consts = True, PAPs = True}): *** Common sub-expression: Result size of Common sub-expression = {terms: 406, types: 311, coercions: 9} *** Core Linted result of Common sub-expression: *** Float inwards: Result size of Float inwards = {terms: 406, types: 311, coercions: 9} *** Core Linted result of Float inwards: *** Liberate case: Result size of Liberate case = {terms: 1,186, types: 824, coercions: 9} *** Core Linted result of Liberate case: *** Simplifier: Result size of Simplifier iteration=1 = {terms: 585, types: 411, coercions: 9} *** Core Linted result of Simplifier: Result size of Simplifier iteration=2 = {terms: 569, types: 392, coercions: 9} *** Core Linted result of Simplifier: Result size of Simplifier iteration=3 = {terms: 569, types: 392, coercions: 9} *** Core Linted result of Simplifier: Result size of Simplifier = {terms: 569, types: 392, coercions: 9} *** Core Linted result of Simplifier: *** SpecConstr: Result size of SpecConstr = {terms: 746, types: 566, coercions: 9} *** Core Linted result of SpecConstr: *** Simplifier: Result size of Simplifier iteration=1 = {terms: 739, types: 560, coercions: 9} *** Core Linted result of Simplifier: Result size of Simplifier iteration=2 = {terms: 762, types: 546, coercions: 9} *** Core Linted result of Simplifier: Result size of Simplifier = {terms: 642, types: 402, coercions: 9} *** Core Linted result of Simplifier: *** Tidy Core: Result size of Tidy Core = {terms: 642, types: 402, coercions: 9} *** Core Linted result of Tidy Core: writeBinIface: 10 Names writeBinIface: 34 dict entries *** CorePrep: Result size of CorePrep = {terms: 779, types: 483, coercions: 9} *** Core Linted result of CorePrep: *** Stg2Stg: *** CodeOutput: *** CodeGen: *** Assembler: "E:\ghc-7.6.3_32\lib/../mingw/bin/gcc.exe" "-fno-stack-protector" "-Wl ,--hash-size=31" "-Wl,--reduce-memory-overheads" "-I." "-c" "C:\Users\Gordon\AppData\Local\Temp\ghc28200_0\ghc28200_0.s" "-o" "GHC_NCG_OptimizationBug.o" Linking GHC_NCG_OptimizationBug.exe ... *Main> :! GHC_NCG_OptimizationBug +RTS -s 46000000 32,989,396 bytes allocated in the heap 4,976 bytes copied during GC 41,860 bytes maximum residency (2 sample(s)) 19,580 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 61 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s Gen 1 2 colls, 0 par 0.00s 0.00s 0.0001s 0.0001s INIT time 0.00s ( 0.00s elapsed) MUT time 0.64s ( 0.64s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.66s ( 0.64s elapsed) %GC time 0.0% (0.1% elapsed) Alloc rate 51,495,642 bytes per MUT second Productivity 100.0% of total user, 102.3% of total elapsed }}} Looking at the ASM dump for the innermost tight culling loop reveals the problem, with 7.8.1 RC2 outputting as follow: {{{ _n3nx: movl 76(%esp),%ecx _c3gf: cmpl %ecx,%eax jg _c3jB _c3jC: movl %eax,%edx sarl $5,%edx movl %ecx,76(%esp) movl $1,%ecx movl %ecx,280(%esp) movl %eax,%ecx andl $31,%ecx movl %eax,292(%esp) movl 280(%esp),%eax shll %cl,%eax xorl $-1,%eax movl 64(%esp),%ecx addl $8,%ecx movl (%ecx,%edx,4),%ecx andl %eax,%ecx movl 64(%esp),%eax addl $8,%eax movl %ecx,(%eax,%edx,4) movl 292(%esp),%eax addl $3,%eax jmp _n3nx }}} and 7.6.3 outputting as follows: {{{ .text .align 4,0x90 .long 1894 .long 32 s1GZ_info: _c1YB: cmpl 16(%ebp),%esi jg _c1YE movl %esi,%edx sarl $5,%edx movl $1,%eax movl %esi,%ecx andl $31,%ecx shll %cl,%eax xorl $-1,%eax movl 12(%ebp),%ecx movl 8(%ecx,%edx,4),%ecx andl %eax,%ecx movl 12(%ebp),%eax movl %ecx,8(%eax,%edx,4) addl 4(%ebp),%esi jmp s1GZ_info _c1YE: movl 8(%ebp),%esi addl $8,%ebp jmp s1GB_info }}} The second code is clearly much more efficient, with the only memory access reading/writing the sieve buffer array and one register reload of the prime value to add to the current position index, whereas the first (7.8.1 RC2) code has three register spills and five register re-loads, almost as if debugging were still turned on. This bug was tested under Windows, but likely applies to other platforms, at least for 32-bit versions but also possibly to others. -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8971#comment:29 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8971: Native Code Generator for 8.0.1 is not as optimized as 7.6.3... -------------------------------------+------------------------------------- Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler (NCG) | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by bgamari): * milestone: 8.2.1 => Comment: I'm de-milestoning this due to a lack of developer interest. Currently we have many bugs and few developers; many of these bugs affect platforms far more common than i386 and these are the bugs we will focus on. However, don't let this stop you, the motivated reader, from picking this up. There are several ideas for fixing this issue described in this ticket. Please do pick one and implement it! -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8971#comment:30 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8971: Native Code Generator for 8.0.1 is not as optimized as 7.6.3... -------------------------------------+------------------------------------- Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler (NCG) | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by GordonBGood): Replying to [comment:30 bgamari]:
I'm de-milestoning this due to a lack of developer interest. Currently we have many bugs and few developers; many of these bugs affect platforms far more common than i386 and these are the bugs we will focus on.
I agree that this issue isn't all that important in the Great Scheme Of Things given that the Native Code Generator (NCG) is gradually being relegated to just a builder for quick-cycle debugging purposes as the (slower compiling, but producing generally faster code) LLVM back end becomes more and more stable (and usable, especially if the required LLVM files, including patches, are distributed with/made available for each GHC version starting with 8.2.1, as promised[https://ghc.haskell.org/trac/ghc/wiki/ImprovedLLVMBackend]). Much more important are performance bugs that affect '''both''' the NCG and LLVM back ends such as the (I'm beginning to think related to each other) #12798 and #12808 pair of issues that although only slowing performance by up to 40% or so (although for all platforms instead of only x86 by 100% as here), affect the perception that GHC code is "slower than Cee" and for which there is no workaround such as choosing an alternate back end. I'll monitor this and will probably either close the issue or at least reduce its priority if version 8.2.1 integrates the LLVM back end as well as it promises. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8971#comment:31 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8971: Native Code Generator for 8.0.1 is not as optimized as 7.6.3... -------------------------------------+------------------------------------- Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler (NCG) | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): To be clear, currently the improved LLVM backend plan is in need of someone to implement it. See #10074 for details. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8971#comment:32 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8971: Native Code Generator for 8.0.1 is not as optimized as 7.6.3... -------------------------------------+------------------------------------- Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler (NCG) | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by GordonBGood): Replying to [comment:32 bgamari]:
To be clear, currently the improved LLVM backend plan is in need of someone to implement it. See #10074 for details.
That's a pity, but not a problem as long as the version of LLVM that the version of GHC requires (opt and llc from LLVM) are readily available and the patches for the underlying system support (Mingw binutils in the case of Windows, especially patched for Windows 64 to make them work with LLVM on Windows) are also readily available. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8971#comment:33 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC