Re: [GHC] #8971: Native Code Generator 7.8.1 RC2 is not as optimized as 7.6.3...

25 Apr 2014

      ...
Replying to [comment:8 GordonBGood]:
...
I only referred to LLVM as proof that the problem seems to be limited
 to NCG as both NCG and LLVM will share the same C-- output (or at least I
#8971: Native Code Generator 7.8.1 RC2 is not as optimized as 7.6.3...
--------------------------------------------+------------------------------
        Reporter:  GordonBGood              |            Owner:
            Type:  bug                      |           Status:  new
        Priority:  normal                   |        Milestone:
       Component:  Compiler (NCG)           |          Version:  7.8.1-rc2
      Resolution:                           |         Keywords:
Operating System:  Unknown/Multiple         |     Architecture:
 Type of failure:  Runtime performance bug  |  Unknown/Multiple
       Test Case:                           |       Difficulty:  Unknown
        Blocking:                           |       Blocked By:
                                            |  Related Tickets:
--------------------------------------------+------------------------------

Comment (by GordonBGood):

 Replying to [comment:9 tibbe]:
 think so???) yet NCG shows this step backwards whereas LLVM does not.
...
That's a good indication that the NGC is to blame, but it's also
possible that the Cmm codegen has regressed but some LLVM optimizations
 make up for the regression.

 Alright, as requested by "ezyang" earlier, I added the -ddump-cmm and
 -ddump-opt-cmm switches to the compilation, with the following results:

 For version 7.6.3, the CMM code for the loops looks like this:
 {{{
  s1Hq_ret()
          { label: s1Hq_info
            rep:StackRep [True, False, True, True, True]
          }
      c1XX:
          _c1Tr::I32 = %MO_S_Gt_W32(R1, I32[Sp + 4]);
          ;
          if (_c1Tr::I32 >= 1) goto c1XZ;
          _s1H9::I32 = %MO_S_Shr_W32(R1, 5);
          _s1He::I32 = I32[I32[Sp + 8] + 8 + (_s1H9::I32 << 2)];
          _s1KJ::I32 = R1;
          _s1Hh::I32 = _s1KJ::I32 & 31;
          _s1Hj::I32 = _s1Hh::I32;
          _s1KI::I32 = 1 << _s1Hj::I32;
          _s1Hm::I32 = _s1KI::I32 ^ 18446744073709551615;
          _s1KH::I32 = _s1He::I32 & _s1Hm::I32;
          I32[I32[Sp + 8] + 8 + (_s1H9::I32 << 2)] = _s1KH::I32;
          _s1KG::I32 = R1 + 3;
          R1 = _s1KG::I32;
          jump s1Hq_info; // [R1]
      c1XZ:
          R1 = 1;
          jump s1HY_info; // [R1]
  },
 }}}
 and the opt-cmm code for about the same area looks like this:
 {{{
 s1Hq_ret()
         { Just s1Hq_info:
                    const 933;
                    const 32;
         }
     c1XX:
         ;
         if (%MO_S_Gt_W32(R1, I32[Sp + 4])) goto c1XZ;
         _s1H9::I32 = %MO_S_Shr_W32(R1, 5);
         I32[I32[Sp + 8] + ((_s1H9::I32 << 2) + 8)] = I32[I32[Sp + 8] +
 ((_s1H9::I32 << 2) + 8)] & (1 << R1 & 31) ^ 18446744073709551615;
         R1 = R1 + 3;
         jump s1Hq_info; // [R1]
     c1XZ:
         R1 = 1;
         jump s1HY_info; // [R1]
 }
 }}}
 For version 7.8.1 RC2 the cmm code is about eight times larger with about
 eight times the number of lines and looks like this
 {{{
   c3jO:
       _c3jQ::I32 = %MO_S_Gt_W32(_s33c::I32, _s32t::I32);
       _s33e::I32 = _c3jQ::I32;
       if (_s33e::I32 >= 1) goto c3jY; else goto c3jZ;
   c3jY:
       _s32S::I32 = _s335::I32;
       goto c3gp;
   c3jZ:
       _c3kn::I32 = %MO_S_Shr_W32(_s33c::I32, 5);
       _s33g::I32 = _c3kn::I32;
       _s33j::I32 = I32[(_s327::P32 + 8) + (_s33g::I32 << 2)];
       _s33j::I32 = _s33j::I32;
       _c3kq::I32 = _s33c::I32;
       _s33k::I32 = _c3kq::I32;
       _c3kt::I32 = _s33k::I32 & 31;
       _s33l::I32 = _c3kt::I32;
       _c3kw::I32 = _s33l::I32;
       _s33m::I32 = _c3kw::I32;
       _c3kz::I32 = 1 << _s33m::I32;
       _s33n::I32 = _c3kz::I32;
       _c3kC::I32 = _s33n::I32 ^ 4294967295;
       _s33o::I32 = _c3kC::I32;
       _c3kF::I32 = _s33j::I32 & _s33o::I32;
       _s33p::I32 = _c3kF::I32;
       I32[(_s327::P32 + 8) + (_s33g::I32 << 2)] = _s33p::I32;
       _c3kK::I32 = _s33c::I32 + _s32U::I32;
       _s33r::I32 = _c3kK::I32;
       _s33c::I32 = _s33r::I32;
       goto c3jO;
 }}}
 the the optimized opt-cmm code looks like this:
 {{{
       c3jO:
           if (%MO_S_Gt_W32(_s33c::I32,
                            _s32t::I32)) goto c3jY; else goto c3jZ;
       c3jY:
           Sp = Sp + 8;
           _s32S::I32 = _s335::I32;
           goto c3gp;
       c3jZ:
           _s33g::I32 = %MO_S_Shr_W32(_s33c::I32, 5);
           I32[(_s327::P32 + 8) + (_s33g::I32 << 2)] = I32[(_s327::P32 + 8)
 + (_s33g::I32 << 2)] & (1 << _s33c::I32 & 31) ^ 4294967295;
           _s33c::I32 = _s33c::I32 + _s32U::I32;
           goto c3jO;
 }}}
 It appears to me that the opt-cmm code is about the same but the straight
 cmm dump has regressed to have lost even the basic optimizations that were
 there with the older version.

 It may be that the NCG is using the non-optimized version of CMM as a
 source where the LLVM generator is using the optimized CMM version, which
 would explain why LLVM backend generated code is still efficient where NCG
 generated code is not.

 Anticipating your next request to look at the STG output, I used the
 -ddump-stg compiler option to examine that code.  The code is too
 difficult to post here (wordy and lines very long), but a quick
 examination shows it to be about the same for both versions except that
 internal constants are recorded as 32-bit numbers in 7.8.1 whereas they
 were recorded as 64-bit numbers even when referring to 32-bit registers
 for the older 7.6.3 code; this corresponds to the way they are recorded in
 both the optimized and non-optimized CMM files by version as listed above.

 Thus, the bug/regression appears to be go further back than just the new
 NCG (which is likely using the non-optimized CMM code as input) but also
 to the CMM code generator in that it is producing much less efficient CMM
 code.

--
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8971#comment:10
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler