[GHC] #8971: Native Code Generator 7.8.1 RC2 is not as optimized as 7.6.3...

8 Apr 2014

      #8971: Native Code Generator 7.8.1 RC2 is not as optimized as 7.6.3...
------------------------------+--------------------------------------------
       Reporter:              |             Owner:
  GordonBGood                 |            Status:  new
           Type:  bug         |         Milestone:
       Priority:  normal      |           Version:  7.8.1-rc2
      Component:  Compiler    |  Operating System:  Unknown/Multiple
       Keywords:              |   Type of failure:  Runtime performance bug
   Architecture:              |         Test Case:
  Unknown/Multiple            |          Blocking:
     Difficulty:  Unknown     |
     Blocked By:              |
Related Tickets:              |
------------------------------+--------------------------------------------
 The output assembly code is not as optimized for the Windows 32-bit
 version 7.8.1 RC2 compiler as the Windows 7.6.3 compiler (32-bit) when the
 option switches are exactly the same although it may not be limited to
 only the Windows platform; this has a negative impact on execution time
 for tight loops of about a factor of two times slower.

 The following code will reproduce the problem:

 {{{#!haskell
 -- GHC_NCG_OptimizationBug.hs
 -- it seems the Haskell GHC 7.8.1 NCG  Native Code Generator (NCG) doesn't
 -- optimize as well for (at least) the x86 target as version 7.6.3

 {-# OPTIONS_GHC -O3 -rtsopts -v -dcore-lint -ddump-asm -ddump-to-file
 -dumpdir . #-} -- or O2

 import Data.Bits
 import Control.Monad.ST (runST,ST(..))
 import Data.Array.Base

 -- Uses a very simple Sieve of Eratosthenes to 2 ^ 18 to prove it.
 accNumPrimes :: Int -> Int
 accNumPrimes acc = acc `seq` runST $ do
   let bfSz = (256 * 1024 - 3) `div` 2
       bfLmtWrds = (bfSz + 1) `div` 32
   bufw <- newArray (0, bfLmtWrds) (-1) :: ST s (STUArray s Int Int)
   -- to clear the last "uneven" bit(s)
   unsafeWrite bufw bfLmtWrds (complement ((-2) `shiftL` (bfSz .&. 31)))
   bufb <- (castSTUArray :: STUArray s Int Int -> ST s (STUArray s Int
 Bool)) bufw
   let cullp i =
         let p = i + i + 3 in
         let s = (p * p - 3) `div` 2 in
         if s > bfSz then
           let count i sm = do
                 sm `seq` if i > bfLmtWrds then return (acc + sm) else do
                   wd <- unsafeRead bufw i
                   count (i + 1) (sm + (popCount wd)) in
           count 0 1 -- use '1' for the '2' prime not in the array
         else do
           v <- unsafeRead bufb i
           if v then
             let cull j = do -- very tight inner loop
                   if j > bfSz then cullp (i + 1) else do
                     unsafeWrite bufb j False
                     cull (j + p) in
             cull s
           else cullp (i + 1)
   cullp 0

 main =
   -- run the program a number of times to get a reasonable time...
   let numloops = 2000 in
   let loop n acc =
         acc `seq` if n <= 0 then acc else
         loop (n - 1) (accNumPrimes acc) in
   print $ loop numloops 0
 }}}

 The above code takes almost twice as long to run when compiled under 7.8.1
 RC2 for Windows (32-bit) as it does for the version 7.6.3 compiler (both
 32-bit compilers).

 The -ddump-simpl Core dump is almost identical between the two, which is
 also evidenced by that using the -fllvm LLVM compiler back end switch for
 each results in code that runs at about the same speed for each compiler
 run (which would use the same Core output as used for NCG, right?).

 Under Windows, the compilation and run for 7.8.1 RC2 goes like this:
 {{{
 *Main> :! E:\ghc-7.8.0.20140228_32\bin\ghc --make -pgmlo
 "E:\llvm32\build\Release\bin\opt" -pgmlc "E:\llvm32\build\Release\bin\llc"
 "GHC_NCG_OptimizationBug.hs"
 compile: input file WindowsVsLinuxNCG.hs
 Created temporary directory: C:\Users\Gordon\AppData\Local\Temp\ghc15460_0
 *** Checking old interface for main:Main:
 *** Parser:
 *** Renamer/typechecker:
 [1 of 1] Compiling Main             ( GHC_NCG_OptimizationBug.hs,
 GHC_NCG_OptimizationBug.o )
 *** Desugar:
 Result size of Desugar (after optimization)
   = {terms: 260, types: 212, coercions: 0}
 *** Core Linted result of Desugar (after optimization):
 *** Simplifier:
 Result size of Simplifier iteration=1
   = {terms: 213, types: 136, coercions: 52}
 *** Core Linted result of Simplifier:
 Result size of Simplifier iteration=2
   = {terms: 215, types: 148, coercions: 67}
 *** Core Linted result of Simplifier:
 Result size of Simplifier iteration=3
   = {terms: 209, types: 135, coercions: 51}
 *** Core Linted result of Simplifier:
 Result size of Simplifier = {terms: 209, types: 135, coercions: 42}
 *** Core Linted result of Simplifier:
 *** Specialise:
 Result size of Specialise = {terms: 209, types: 135, coercions: 42}
 *** Core Linted result of Specialise:
 *** Float out(FOS {Lam = Just 0, Consts = True, PAPs = False}):
 Result size of Float out(FOS {Lam = Just 0,
                               Consts = True,
                               PAPs = False})
   = {terms: 286, types: 185, coercions: 42}
 *** Core Linted result of Float out(FOS {Lam = Just 0, Consts = True, PAPs
 = False}):
 *** Float inwards:
 Result size of Float inwards
   = {terms: 286, types: 185, coercions: 42}
 *** Core Linted result of Float inwards:
 *** Simplifier:
 Result size of Simplifier iteration=1
   = {terms: 502, types: 393, coercions: 103}
 *** Core Linted result of Simplifier:
 Result size of Simplifier iteration=2
   = {terms: 428, types: 326, coercions: 29}
 *** Core Linted result of Simplifier:
 Result size of Simplifier iteration=3
   = {terms: 420, types: 321, coercions: 29}
 *** Core Linted result of Simplifier:
 Result size of Simplifier = {terms: 420, types: 321, coercions: 29}
 *** Core Linted result of Simplifier:
 *** Simplifier:
 Result size of Simplifier iteration=1
   = {terms: 418, types: 318, coercions: 29}
 *** Core Linted result of Simplifier:
 Result size of Simplifier = {terms: 418, types: 318, coercions: 29}
 *** Core Linted result of Simplifier:
 *** Simplifier:
 Result size of Simplifier iteration=1
   = {terms: 475, types: 383, coercions: 32}
 *** Core Linted result of Simplifier:
 Result size of Simplifier iteration=2
   = {terms: 444, types: 336, coercions: 9}
 *** Core Linted result of Simplifier:
 Result size of Simplifier = {terms: 444, types: 336, coercions: 9}
 *** Core Linted result of Simplifier:
 *** Demand analysis:
 Result size of Demand analysis
   = {terms: 444, types: 336, coercions: 9}
 *** Core Linted result of Demand analysis:
 *** Worker Wrapper binds:
 Result size of Worker Wrapper binds
   = {terms: 579, types: 457, coercions: 9}
 *** Core Linted result of Worker Wrapper binds:
 *** Simplifier:
 Result size of Simplifier iteration=1
   = {terms: 510, types: 415, coercions: 9}
 *** Core Linted result of Simplifier:
 Result size of Simplifier = {terms: 420, types: 322, coercions: 9}
 *** Core Linted result of Simplifier:
 *** Float out(FOS {Lam = Just 0, Consts = True, PAPs = True}):
 Result size of Float out(FOS {Lam = Just 0,
                               Consts = True,
                               PAPs = True})
   = {terms: 426, types: 326, coercions: 9}
 *** Core Linted result of Float out(FOS {Lam = Just 0, Consts = True, PAPs
 = True}):
 *** Common sub-expression:
 Result size of Common sub-expression
   = {terms: 424, types: 326, coercions: 9}
 *** Core Linted result of Common sub-expression:
 *** Float inwards:
 Result size of Float inwards
   = {terms: 424, types: 326, coercions: 9}
 *** Core Linted result of Float inwards:
 *** Liberate case:
 Result size of Liberate case
   = {terms: 1,824, types: 1,259, coercions: 9}
 *** Core Linted result of Liberate case:
 *** Simplifier:
 Result size of Simplifier iteration=1
   = {terms: 608, types: 422, coercions: 9}
 *** Core Linted result of Simplifier:
 Result size of Simplifier iteration=2
   = {terms: 604, types: 413, coercions: 9}
 *** Core Linted result of Simplifier:
 Result size of Simplifier iteration=3
   = {terms: 604, types: 413, coercions: 9}
 *** Core Linted result of Simplifier:
 Result size of Simplifier = {terms: 604, types: 413, coercions: 9}
 *** Core Linted result of Simplifier:
 *** SpecConstr:
 Result size of SpecConstr = {terms: 708, types: 505, coercions: 9}
 *** Core Linted result of SpecConstr:
 *** Simplifier:
 Result size of Simplifier iteration=1
   = {terms: 702, types: 499, coercions: 9}
 *** Core Linted result of Simplifier:
 Result size of Simplifier = {terms: 608, types: 405, coercions: 9}
 *** Core Linted result of Simplifier:
 *** Tidy Core:
 Result size of Tidy Core = {terms: 608, types: 405, coercions: 9}
 *** Core Linted result of Tidy Core:
 *** CorePrep:
 Result size of CorePrep = {terms: 825, types: 489, coercions: 9}
 *** Core Linted result of CorePrep:
 *** Stg2Stg:
 *** CodeOutput:
 *** New CodeGen:
 *** CPSZ:
 *** CPSZ:
 *** CPSZ:
 *** CPSZ:
 *** CPSZ:
 *** CPSZ:
 *** CPSZ:
 *** CPSZ:
 *** CPSZ:
 *** CPSZ:
 *** CPSZ:
 *** CPSZ:
 *** CPSZ:
 *** CPSZ:
 *** CPSZ:
 *** Assembler:
 "E:\ghc-7.8.0.20140228_32\lib/../mingw/bin/gcc.exe" "-U__i686" "-fno-
 stack-protector" "-DTABLES_NEXT_TO_CODE" "-I." "-x" "assembler-with-cpp"
 "-c" "C:\Users\Gordon\AppData\Local\Temp\ghc15460_0\ghc15460_2.s" "-o"
 "GHC_NCG_OptimizationBug.o"
 Linking GHC_NCG_OptimizationBug.exe ...
 *Main> :! GHC_NCG_OptimizationBug +RTS -s
 46000000
       32,965,096 bytes allocated in the heap
            7,032 bytes copied during GC
           41,756 bytes maximum residency (2 sample(s))
           19,684 bytes maximum slop
                2 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max
 pause
   Gen  0        61 colls,     0 par    0.00s    0.00s     0.0000s
 0.0000s
   Gen  1         2 colls,     0 par    0.00s    0.00s     0.0001s
 0.0001s

   INIT    time    0.00s  (  0.00s elapsed)
   MUT     time    1.73s  (  1.73s elapsed)
   GC      time    0.00s  (  0.00s elapsed)
   EXIT    time    0.00s  (  0.00s elapsed)
   Total   time    1.73s  (  1.73s elapsed)

   %GC     time       0.0%  (0.0% elapsed)

   Alloc rate    19,006,902 bytes per MUT second

   Productivity 100.0% of total user, 100.2% of total elapsed
 }}}

 whereas under version 7.6.3 goes like this:
 {{{
 *Main> :! E:\ghc-7.6.3_32\bin\ghc --make -pgmlo
 "E:\llvm32\build\Release\bin\opt" -pgmlc "E:\llvm32\build\Release\bin\llc"
 "GHC_NCG_OptimizationBug.hs"
 compile: input file GHC_NCG_OptimizationBug.hs
 Created temporary directory: C:\Users\Gordon\AppData\Local\Temp\ghc28200_0
 *** Checking old interface for main:Main:
 *** Parser:
 *** Renamer/typechecker:
 [1 of 1] Compiling Main             ( GHC_NCG_OptimizationBug.hs,
 GHC_NCG_OptimizationBug.o )
 *** Desugar:
 Result size of Desugar (after optimization)
   = {terms: 247, types: 212, coercions: 0}
 *** Core Linted result of Desugar (after optimization):
 *** Simplifier:
 Result size of Simplifier iteration=1
   = {terms: 198, types: 132, coercions: 35}
 *** Core Linted result of Simplifier:
 Result size of Simplifier iteration=2
   = {terms: 200, types: 144, coercions: 43}
 *** Core Linted result of Simplifier:
 Result size of Simplifier iteration=3
   = {terms: 194, types: 131, coercions: 57}
 *** Core Linted result of Simplifier:
 Result size of Simplifier = {terms: 194, types: 131, coercions: 39}
 *** Core Linted result of Simplifier:
 *** Specialise:
 Result size of Specialise = {terms: 194, types: 131, coercions: 39}
 *** Core Linted result of Specialise:
 *** Float out(FOS {Lam = Just 0, Consts = True, PAPs = False}):
 Result size of Float out(FOS {Lam = Just 0,
                               Consts = True,
                               PAPs = False})
   = {terms: 277, types: 191, coercions: 39}
 *** Core Linted result of Float out(FOS {Lam = Just 0, Consts = True, PAPs
 = False}):
 *** Float inwards:
 Result size of Float inwards
   = {terms: 277, types: 191, coercions: 39}
 *** Core Linted result of Float inwards:
 *** Simplifier:
 Result size of Simplifier iteration=1
   = {terms: 514, types: 403, coercions: 103}
 *** Core Linted result of Simplifier:
 Result size of Simplifier iteration=2
   = {terms: 420, types: 317, coercions: 29}
 *** Core Linted result of Simplifier:
 Result size of Simplifier iteration=3
   = {terms: 412, types: 312, coercions: 29}
 *** Core Linted result of Simplifier:
 Result size of Simplifier = {terms: 412, types: 312, coercions: 29}
 *** Core Linted result of Simplifier:
 *** Simplifier:
 Result size of Simplifier iteration=1
   = {terms: 410, types: 309, coercions: 29}
 *** Core Linted result of Simplifier:
 Result size of Simplifier = {terms: 410, types: 309, coercions: 29}
 *** Core Linted result of Simplifier:
 *** Simplifier:
 Result size of Simplifier iteration=1
   = {terms: 455, types: 364, coercions: 32}
 *** Core Linted result of Simplifier:
 Result size of Simplifier iteration=2
   = {terms: 422, types: 317, coercions: 9}
 *** Core Linted result of Simplifier:
 Result size of Simplifier = {terms: 422, types: 317, coercions: 9}
 *** Core Linted result of Simplifier:
 *** Demand analysis:
 Result size of Demand analysis
   = {terms: 422, types: 317, coercions: 9}
 *** Core Linted result of Demand analysis:
 *** Worker Wrapper binds:
 Result size of Worker Wrapper binds
   = {terms: 536, types: 427, coercions: 9}
 *** Core Linted result of Worker Wrapper binds:
 *** Simplifier:
 Result size of Simplifier iteration=1
   = {terms: 480, types: 391, coercions: 9}
 *** Core Linted result of Simplifier:
 Result size of Simplifier = {terms: 400, types: 306, coercions: 9}
 *** Core Linted result of Simplifier:

 *** Float out(FOS {Lam = Just 0, Consts = True, PAPs = True}):
 Result size of Float out(FOS {Lam = Just 0,
                               Consts = True,
                               PAPs = True})
   = {terms: 408, types: 311, coercions: 9}
 *** Core Linted result of Float out(FOS {Lam = Just 0, Consts = True, PAPs
 = True}):
 *** Common sub-expression:
 Result size of Common sub-expression
   = {terms: 406, types: 311, coercions: 9}
 *** Core Linted result of Common sub-expression:
 *** Float inwards:
 Result size of Float inwards
   = {terms: 406, types: 311, coercions: 9}
 *** Core Linted result of Float inwards:
 *** Liberate case:
 Result size of Liberate case
   = {terms: 1,186, types: 824, coercions: 9}
 *** Core Linted result of Liberate case:
 *** Simplifier:
 Result size of Simplifier iteration=1
   = {terms: 585, types: 411, coercions: 9}
 *** Core Linted result of Simplifier:
 Result size of Simplifier iteration=2
   = {terms: 569, types: 392, coercions: 9}
 *** Core Linted result of Simplifier:
 Result size of Simplifier iteration=3
   = {terms: 569, types: 392, coercions: 9}
 *** Core Linted result of Simplifier:
 Result size of Simplifier = {terms: 569, types: 392, coercions: 9}
 *** Core Linted result of Simplifier:
 *** SpecConstr:
 Result size of SpecConstr = {terms: 746, types: 566, coercions: 9}
 *** Core Linted result of SpecConstr:
 *** Simplifier:
 Result size of Simplifier iteration=1
   = {terms: 739, types: 560, coercions: 9}
 *** Core Linted result of Simplifier:
 Result size of Simplifier iteration=2
   = {terms: 762, types: 546, coercions: 9}
 *** Core Linted result of Simplifier:
 Result size of Simplifier = {terms: 642, types: 402, coercions: 9}
 *** Core Linted result of Simplifier:
 *** Tidy Core:
 Result size of Tidy Core = {terms: 642, types: 402, coercions: 9}
 *** Core Linted result of Tidy Core:
 writeBinIface: 10 Names
 writeBinIface: 34 dict entries
 *** CorePrep:
 Result size of CorePrep = {terms: 779, types: 483, coercions: 9}
 *** Core Linted result of CorePrep:
 *** Stg2Stg:
 *** CodeOutput:
 *** CodeGen:
 *** Assembler:
 "E:\ghc-7.6.3_32\lib/../mingw/bin/gcc.exe" "-fno-stack-protector" "-Wl
 ,--hash-size=31" "-Wl,--reduce-memory-overheads" "-I." "-c"
 "C:\Users\Gordon\AppData\Local\Temp\ghc28200_0\ghc28200_0.s" "-o"
 "GHC_NCG_OptimizationBug.o"
 Linking GHC_NCG_OptimizationBug.exe ...
 *Main> :! GHC_NCG_OptimizationBug +RTS -s
 46000000
       32,989,396 bytes allocated in the heap
            4,976 bytes copied during GC
           41,860 bytes maximum residency (2 sample(s))
           19,580 bytes maximum slop
                2 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max
 pause
   Gen  0        61 colls,     0 par    0.00s    0.00s     0.0000s
 0.0000s
   Gen  1         2 colls,     0 par    0.00s    0.00s     0.0001s
 0.0001s

   INIT    time    0.00s  (  0.00s elapsed)
   MUT     time    0.64s  (  0.64s elapsed)
   GC      time    0.00s  (  0.00s elapsed)
   EXIT    time    0.00s  (  0.00s elapsed)
   Total   time    0.66s  (  0.64s elapsed)

   %GC     time       0.0%  (0.1% elapsed)

   Alloc rate    51,495,642 bytes per MUT second

   Productivity 100.0% of total user, 102.3% of total elapsed
 }}}

 Looking at the ASM dump for the innermost tight culling loop reveals the
 problem, with 7.8.1 RC2 outputting as follow:
 {{{
 _n3nx:
         movl 76(%esp),%ecx
 _c3gf:
         cmpl %ecx,%eax
         jg _c3jB
 _c3jC:
         movl %eax,%edx
         sarl $5,%edx
         movl %ecx,76(%esp)
         movl $1,%ecx
         movl %ecx,280(%esp)
         movl %eax,%ecx
         andl $31,%ecx
         movl %eax,292(%esp)
         movl 280(%esp),%eax
         shll %cl,%eax
         xorl $-1,%eax
         movl 64(%esp),%ecx
         addl $8,%ecx
         movl (%ecx,%edx,4),%ecx
         andl %eax,%ecx
         movl 64(%esp),%eax
         addl $8,%eax
         movl %ecx,(%eax,%edx,4)
         movl 292(%esp),%eax
         addl $3,%eax
         jmp _n3nx
 }}}

 and 7.6.3 outputting as follows:
 {{{
 .text
         .align 4,0x90
         .long   1894
         .long   32
 s1GZ_info:
 _c1YB:
         cmpl 16(%ebp),%esi
         jg _c1YE
         movl %esi,%edx
         sarl $5,%edx
         movl $1,%eax
         movl %esi,%ecx
         andl $31,%ecx
         shll %cl,%eax
         xorl $-1,%eax
         movl 12(%ebp),%ecx
         movl 8(%ecx,%edx,4),%ecx
         andl %eax,%ecx
         movl 12(%ebp),%eax
         movl %ecx,8(%eax,%edx,4)
         addl 4(%ebp),%esi
         jmp s1GZ_info
 _c1YE:
         movl 8(%ebp),%esi
         addl $8,%ebp
         jmp s1GB_info
 }}}

 The second code is clearly much more efficient, with the only memory
 access reading/writing the sieve buffer array and one register reload of
 the prime value to add to the current position index, whereas the first
 (7.8.1 RC2) code has three register spills and five register re-loads,
 almost as if debugging were still turned on.

 This bug was tested under Windows, but likely applies to other platforms,
 at least for 32-bit versions but also possibly to others.

--
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8971
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler