
#8971: Native Code Generator 7.8.1 RC2 is not as optimized as 7.6.3... ------------------------------+-------------------------------------------- Reporter: | Owner: GordonBGood | Status: new Type: bug | Milestone: Priority: normal | Version: 7.8.1-rc2 Component: Compiler | Operating System: Unknown/Multiple Keywords: | Type of failure: Runtime performance bug Architecture: | Test Case: Unknown/Multiple | Blocking: Difficulty: Unknown | Blocked By: | Related Tickets: | ------------------------------+-------------------------------------------- The output assembly code is not as optimized for the Windows 32-bit version 7.8.1 RC2 compiler as the Windows 7.6.3 compiler (32-bit) when the option switches are exactly the same although it may not be limited to only the Windows platform; this has a negative impact on execution time for tight loops of about a factor of two times slower. The following code will reproduce the problem: {{{#!haskell -- GHC_NCG_OptimizationBug.hs -- it seems the Haskell GHC 7.8.1 NCG Native Code Generator (NCG) doesn't -- optimize as well for (at least) the x86 target as version 7.6.3 {-# OPTIONS_GHC -O3 -rtsopts -v -dcore-lint -ddump-asm -ddump-to-file -dumpdir . #-} -- or O2 import Data.Bits import Control.Monad.ST (runST,ST(..)) import Data.Array.Base -- Uses a very simple Sieve of Eratosthenes to 2 ^ 18 to prove it. accNumPrimes :: Int -> Int accNumPrimes acc = acc `seq` runST $ do let bfSz = (256 * 1024 - 3) `div` 2 bfLmtWrds = (bfSz + 1) `div` 32 bufw <- newArray (0, bfLmtWrds) (-1) :: ST s (STUArray s Int Int) -- to clear the last "uneven" bit(s) unsafeWrite bufw bfLmtWrds (complement ((-2) `shiftL` (bfSz .&. 31))) bufb <- (castSTUArray :: STUArray s Int Int -> ST s (STUArray s Int Bool)) bufw let cullp i = let p = i + i + 3 in let s = (p * p - 3) `div` 2 in if s > bfSz then let count i sm = do sm `seq` if i > bfLmtWrds then return (acc + sm) else do wd <- unsafeRead bufw i count (i + 1) (sm + (popCount wd)) in count 0 1 -- use '1' for the '2' prime not in the array else do v <- unsafeRead bufb i if v then let cull j = do -- very tight inner loop if j > bfSz then cullp (i + 1) else do unsafeWrite bufb j False cull (j + p) in cull s else cullp (i + 1) cullp 0 main = -- run the program a number of times to get a reasonable time... let numloops = 2000 in let loop n acc = acc `seq` if n <= 0 then acc else loop (n - 1) (accNumPrimes acc) in print $ loop numloops 0 }}} The above code takes almost twice as long to run when compiled under 7.8.1 RC2 for Windows (32-bit) as it does for the version 7.6.3 compiler (both 32-bit compilers). The -ddump-simpl Core dump is almost identical between the two, which is also evidenced by that using the -fllvm LLVM compiler back end switch for each results in code that runs at about the same speed for each compiler run (which would use the same Core output as used for NCG, right?). Under Windows, the compilation and run for 7.8.1 RC2 goes like this: {{{ *Main> :! E:\ghc-7.8.0.20140228_32\bin\ghc --make -pgmlo "E:\llvm32\build\Release\bin\opt" -pgmlc "E:\llvm32\build\Release\bin\llc" "GHC_NCG_OptimizationBug.hs" compile: input file WindowsVsLinuxNCG.hs Created temporary directory: C:\Users\Gordon\AppData\Local\Temp\ghc15460_0 *** Checking old interface for main:Main: *** Parser: *** Renamer/typechecker: [1 of 1] Compiling Main ( GHC_NCG_OptimizationBug.hs, GHC_NCG_OptimizationBug.o ) *** Desugar: Result size of Desugar (after optimization) = {terms: 260, types: 212, coercions: 0} *** Core Linted result of Desugar (after optimization): *** Simplifier: Result size of Simplifier iteration=1 = {terms: 213, types: 136, coercions: 52} *** Core Linted result of Simplifier: Result size of Simplifier iteration=2 = {terms: 215, types: 148, coercions: 67} *** Core Linted result of Simplifier: Result size of Simplifier iteration=3 = {terms: 209, types: 135, coercions: 51} *** Core Linted result of Simplifier: Result size of Simplifier = {terms: 209, types: 135, coercions: 42} *** Core Linted result of Simplifier: *** Specialise: Result size of Specialise = {terms: 209, types: 135, coercions: 42} *** Core Linted result of Specialise: *** Float out(FOS {Lam = Just 0, Consts = True, PAPs = False}): Result size of Float out(FOS {Lam = Just 0, Consts = True, PAPs = False}) = {terms: 286, types: 185, coercions: 42} *** Core Linted result of Float out(FOS {Lam = Just 0, Consts = True, PAPs = False}): *** Float inwards: Result size of Float inwards = {terms: 286, types: 185, coercions: 42} *** Core Linted result of Float inwards: *** Simplifier: Result size of Simplifier iteration=1 = {terms: 502, types: 393, coercions: 103} *** Core Linted result of Simplifier: Result size of Simplifier iteration=2 = {terms: 428, types: 326, coercions: 29} *** Core Linted result of Simplifier: Result size of Simplifier iteration=3 = {terms: 420, types: 321, coercions: 29} *** Core Linted result of Simplifier: Result size of Simplifier = {terms: 420, types: 321, coercions: 29} *** Core Linted result of Simplifier: *** Simplifier: Result size of Simplifier iteration=1 = {terms: 418, types: 318, coercions: 29} *** Core Linted result of Simplifier: Result size of Simplifier = {terms: 418, types: 318, coercions: 29} *** Core Linted result of Simplifier: *** Simplifier: Result size of Simplifier iteration=1 = {terms: 475, types: 383, coercions: 32} *** Core Linted result of Simplifier: Result size of Simplifier iteration=2 = {terms: 444, types: 336, coercions: 9} *** Core Linted result of Simplifier: Result size of Simplifier = {terms: 444, types: 336, coercions: 9} *** Core Linted result of Simplifier: *** Demand analysis: Result size of Demand analysis = {terms: 444, types: 336, coercions: 9} *** Core Linted result of Demand analysis: *** Worker Wrapper binds: Result size of Worker Wrapper binds = {terms: 579, types: 457, coercions: 9} *** Core Linted result of Worker Wrapper binds: *** Simplifier: Result size of Simplifier iteration=1 = {terms: 510, types: 415, coercions: 9} *** Core Linted result of Simplifier: Result size of Simplifier = {terms: 420, types: 322, coercions: 9} *** Core Linted result of Simplifier: *** Float out(FOS {Lam = Just 0, Consts = True, PAPs = True}): Result size of Float out(FOS {Lam = Just 0, Consts = True, PAPs = True}) = {terms: 426, types: 326, coercions: 9} *** Core Linted result of Float out(FOS {Lam = Just 0, Consts = True, PAPs = True}): *** Common sub-expression: Result size of Common sub-expression = {terms: 424, types: 326, coercions: 9} *** Core Linted result of Common sub-expression: *** Float inwards: Result size of Float inwards = {terms: 424, types: 326, coercions: 9} *** Core Linted result of Float inwards: *** Liberate case: Result size of Liberate case = {terms: 1,824, types: 1,259, coercions: 9} *** Core Linted result of Liberate case: *** Simplifier: Result size of Simplifier iteration=1 = {terms: 608, types: 422, coercions: 9} *** Core Linted result of Simplifier: Result size of Simplifier iteration=2 = {terms: 604, types: 413, coercions: 9} *** Core Linted result of Simplifier: Result size of Simplifier iteration=3 = {terms: 604, types: 413, coercions: 9} *** Core Linted result of Simplifier: Result size of Simplifier = {terms: 604, types: 413, coercions: 9} *** Core Linted result of Simplifier: *** SpecConstr: Result size of SpecConstr = {terms: 708, types: 505, coercions: 9} *** Core Linted result of SpecConstr: *** Simplifier: Result size of Simplifier iteration=1 = {terms: 702, types: 499, coercions: 9} *** Core Linted result of Simplifier: Result size of Simplifier = {terms: 608, types: 405, coercions: 9} *** Core Linted result of Simplifier: *** Tidy Core: Result size of Tidy Core = {terms: 608, types: 405, coercions: 9} *** Core Linted result of Tidy Core: *** CorePrep: Result size of CorePrep = {terms: 825, types: 489, coercions: 9} *** Core Linted result of CorePrep: *** Stg2Stg: *** CodeOutput: *** New CodeGen: *** CPSZ: *** CPSZ: *** CPSZ: *** CPSZ: *** CPSZ: *** CPSZ: *** CPSZ: *** CPSZ: *** CPSZ: *** CPSZ: *** CPSZ: *** CPSZ: *** CPSZ: *** CPSZ: *** CPSZ: *** Assembler: "E:\ghc-7.8.0.20140228_32\lib/../mingw/bin/gcc.exe" "-U__i686" "-fno- stack-protector" "-DTABLES_NEXT_TO_CODE" "-I." "-x" "assembler-with-cpp" "-c" "C:\Users\Gordon\AppData\Local\Temp\ghc15460_0\ghc15460_2.s" "-o" "GHC_NCG_OptimizationBug.o" Linking GHC_NCG_OptimizationBug.exe ... *Main> :! GHC_NCG_OptimizationBug +RTS -s 46000000 32,965,096 bytes allocated in the heap 7,032 bytes copied during GC 41,756 bytes maximum residency (2 sample(s)) 19,684 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 61 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s Gen 1 2 colls, 0 par 0.00s 0.00s 0.0001s 0.0001s INIT time 0.00s ( 0.00s elapsed) MUT time 1.73s ( 1.73s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 1.73s ( 1.73s elapsed) %GC time 0.0% (0.0% elapsed) Alloc rate 19,006,902 bytes per MUT second Productivity 100.0% of total user, 100.2% of total elapsed }}} whereas under version 7.6.3 goes like this: {{{ *Main> :! E:\ghc-7.6.3_32\bin\ghc --make -pgmlo "E:\llvm32\build\Release\bin\opt" -pgmlc "E:\llvm32\build\Release\bin\llc" "GHC_NCG_OptimizationBug.hs" compile: input file GHC_NCG_OptimizationBug.hs Created temporary directory: C:\Users\Gordon\AppData\Local\Temp\ghc28200_0 *** Checking old interface for main:Main: *** Parser: *** Renamer/typechecker: [1 of 1] Compiling Main ( GHC_NCG_OptimizationBug.hs, GHC_NCG_OptimizationBug.o ) *** Desugar: Result size of Desugar (after optimization) = {terms: 247, types: 212, coercions: 0} *** Core Linted result of Desugar (after optimization): *** Simplifier: Result size of Simplifier iteration=1 = {terms: 198, types: 132, coercions: 35} *** Core Linted result of Simplifier: Result size of Simplifier iteration=2 = {terms: 200, types: 144, coercions: 43} *** Core Linted result of Simplifier: Result size of Simplifier iteration=3 = {terms: 194, types: 131, coercions: 57} *** Core Linted result of Simplifier: Result size of Simplifier = {terms: 194, types: 131, coercions: 39} *** Core Linted result of Simplifier: *** Specialise: Result size of Specialise = {terms: 194, types: 131, coercions: 39} *** Core Linted result of Specialise: *** Float out(FOS {Lam = Just 0, Consts = True, PAPs = False}): Result size of Float out(FOS {Lam = Just 0, Consts = True, PAPs = False}) = {terms: 277, types: 191, coercions: 39} *** Core Linted result of Float out(FOS {Lam = Just 0, Consts = True, PAPs = False}): *** Float inwards: Result size of Float inwards = {terms: 277, types: 191, coercions: 39} *** Core Linted result of Float inwards: *** Simplifier: Result size of Simplifier iteration=1 = {terms: 514, types: 403, coercions: 103} *** Core Linted result of Simplifier: Result size of Simplifier iteration=2 = {terms: 420, types: 317, coercions: 29} *** Core Linted result of Simplifier: Result size of Simplifier iteration=3 = {terms: 412, types: 312, coercions: 29} *** Core Linted result of Simplifier: Result size of Simplifier = {terms: 412, types: 312, coercions: 29} *** Core Linted result of Simplifier: *** Simplifier: Result size of Simplifier iteration=1 = {terms: 410, types: 309, coercions: 29} *** Core Linted result of Simplifier: Result size of Simplifier = {terms: 410, types: 309, coercions: 29} *** Core Linted result of Simplifier: *** Simplifier: Result size of Simplifier iteration=1 = {terms: 455, types: 364, coercions: 32} *** Core Linted result of Simplifier: Result size of Simplifier iteration=2 = {terms: 422, types: 317, coercions: 9} *** Core Linted result of Simplifier: Result size of Simplifier = {terms: 422, types: 317, coercions: 9} *** Core Linted result of Simplifier: *** Demand analysis: Result size of Demand analysis = {terms: 422, types: 317, coercions: 9} *** Core Linted result of Demand analysis: *** Worker Wrapper binds: Result size of Worker Wrapper binds = {terms: 536, types: 427, coercions: 9} *** Core Linted result of Worker Wrapper binds: *** Simplifier: Result size of Simplifier iteration=1 = {terms: 480, types: 391, coercions: 9} *** Core Linted result of Simplifier: Result size of Simplifier = {terms: 400, types: 306, coercions: 9} *** Core Linted result of Simplifier: *** Float out(FOS {Lam = Just 0, Consts = True, PAPs = True}): Result size of Float out(FOS {Lam = Just 0, Consts = True, PAPs = True}) = {terms: 408, types: 311, coercions: 9} *** Core Linted result of Float out(FOS {Lam = Just 0, Consts = True, PAPs = True}): *** Common sub-expression: Result size of Common sub-expression = {terms: 406, types: 311, coercions: 9} *** Core Linted result of Common sub-expression: *** Float inwards: Result size of Float inwards = {terms: 406, types: 311, coercions: 9} *** Core Linted result of Float inwards: *** Liberate case: Result size of Liberate case = {terms: 1,186, types: 824, coercions: 9} *** Core Linted result of Liberate case: *** Simplifier: Result size of Simplifier iteration=1 = {terms: 585, types: 411, coercions: 9} *** Core Linted result of Simplifier: Result size of Simplifier iteration=2 = {terms: 569, types: 392, coercions: 9} *** Core Linted result of Simplifier: Result size of Simplifier iteration=3 = {terms: 569, types: 392, coercions: 9} *** Core Linted result of Simplifier: Result size of Simplifier = {terms: 569, types: 392, coercions: 9} *** Core Linted result of Simplifier: *** SpecConstr: Result size of SpecConstr = {terms: 746, types: 566, coercions: 9} *** Core Linted result of SpecConstr: *** Simplifier: Result size of Simplifier iteration=1 = {terms: 739, types: 560, coercions: 9} *** Core Linted result of Simplifier: Result size of Simplifier iteration=2 = {terms: 762, types: 546, coercions: 9} *** Core Linted result of Simplifier: Result size of Simplifier = {terms: 642, types: 402, coercions: 9} *** Core Linted result of Simplifier: *** Tidy Core: Result size of Tidy Core = {terms: 642, types: 402, coercions: 9} *** Core Linted result of Tidy Core: writeBinIface: 10 Names writeBinIface: 34 dict entries *** CorePrep: Result size of CorePrep = {terms: 779, types: 483, coercions: 9} *** Core Linted result of CorePrep: *** Stg2Stg: *** CodeOutput: *** CodeGen: *** Assembler: "E:\ghc-7.6.3_32\lib/../mingw/bin/gcc.exe" "-fno-stack-protector" "-Wl ,--hash-size=31" "-Wl,--reduce-memory-overheads" "-I." "-c" "C:\Users\Gordon\AppData\Local\Temp\ghc28200_0\ghc28200_0.s" "-o" "GHC_NCG_OptimizationBug.o" Linking GHC_NCG_OptimizationBug.exe ... *Main> :! GHC_NCG_OptimizationBug +RTS -s 46000000 32,989,396 bytes allocated in the heap 4,976 bytes copied during GC 41,860 bytes maximum residency (2 sample(s)) 19,580 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 61 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s Gen 1 2 colls, 0 par 0.00s 0.00s 0.0001s 0.0001s INIT time 0.00s ( 0.00s elapsed) MUT time 0.64s ( 0.64s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.66s ( 0.64s elapsed) %GC time 0.0% (0.1% elapsed) Alloc rate 51,495,642 bytes per MUT second Productivity 100.0% of total user, 102.3% of total elapsed }}} Looking at the ASM dump for the innermost tight culling loop reveals the problem, with 7.8.1 RC2 outputting as follow: {{{ _n3nx: movl 76(%esp),%ecx _c3gf: cmpl %ecx,%eax jg _c3jB _c3jC: movl %eax,%edx sarl $5,%edx movl %ecx,76(%esp) movl $1,%ecx movl %ecx,280(%esp) movl %eax,%ecx andl $31,%ecx movl %eax,292(%esp) movl 280(%esp),%eax shll %cl,%eax xorl $-1,%eax movl 64(%esp),%ecx addl $8,%ecx movl (%ecx,%edx,4),%ecx andl %eax,%ecx movl 64(%esp),%eax addl $8,%eax movl %ecx,(%eax,%edx,4) movl 292(%esp),%eax addl $3,%eax jmp _n3nx }}} and 7.6.3 outputting as follows: {{{ .text .align 4,0x90 .long 1894 .long 32 s1GZ_info: _c1YB: cmpl 16(%ebp),%esi jg _c1YE movl %esi,%edx sarl $5,%edx movl $1,%eax movl %esi,%ecx andl $31,%ecx shll %cl,%eax xorl $-1,%eax movl 12(%ebp),%ecx movl 8(%ecx,%edx,4),%ecx andl %eax,%ecx movl 12(%ebp),%eax movl %ecx,8(%eax,%edx,4) addl 4(%ebp),%esi jmp s1GZ_info _c1YE: movl 8(%ebp),%esi addl $8,%ebp jmp s1GB_info }}} The second code is clearly much more efficient, with the only memory access reading/writing the sieve buffer array and one register reload of the prime value to add to the current position index, whereas the first (7.8.1 RC2) code has three register spills and five register re-loads, almost as if debugging were still turned on. This bug was tested under Windows, but likely applies to other platforms, at least for 32-bit versions but also possibly to others. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8971 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler