 
            Replying to [comment:8 GordonBGood]:
I only referred to LLVM as proof that the problem seems to be limited to NCG as both NCG and LLVM will share the same C-- output (or at least I
#8971: Native Code Generator 7.8.1 RC2 is not as optimized as 7.6.3... --------------------------------------------+------------------------------ Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler (NCG) | Version: 7.8.1-rc2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime performance bug | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: --------------------------------------------+------------------------------ Comment (by GordonBGood): Replying to [comment:9 tibbe]: think so???) yet NCG shows this step backwards whereas LLVM does not.
That's a good indication that the NGC is to blame, but it's also
possible that the Cmm codegen has regressed but some LLVM optimizations make up for the regression. Alright, as requested by "ezyang" earlier, I added the -ddump-cmm and -ddump-opt-cmm switches to the compilation, with the following results: For version 7.6.3, the CMM code for the loops looks like this: {{{ s1Hq_ret() { label: s1Hq_info rep:StackRep [True, False, True, True, True] } c1XX: _c1Tr::I32 = %MO_S_Gt_W32(R1, I32[Sp + 4]); ; if (_c1Tr::I32 >= 1) goto c1XZ; _s1H9::I32 = %MO_S_Shr_W32(R1, 5); _s1He::I32 = I32[I32[Sp + 8] + 8 + (_s1H9::I32 << 2)]; _s1KJ::I32 = R1; _s1Hh::I32 = _s1KJ::I32 & 31; _s1Hj::I32 = _s1Hh::I32; _s1KI::I32 = 1 << _s1Hj::I32; _s1Hm::I32 = _s1KI::I32 ^ 18446744073709551615; _s1KH::I32 = _s1He::I32 & _s1Hm::I32; I32[I32[Sp + 8] + 8 + (_s1H9::I32 << 2)] = _s1KH::I32; _s1KG::I32 = R1 + 3; R1 = _s1KG::I32; jump s1Hq_info; // [R1] c1XZ: R1 = 1; jump s1HY_info; // [R1] }, }}} and the opt-cmm code for about the same area looks like this: {{{ s1Hq_ret() { Just s1Hq_info: const 933; const 32; } c1XX: ; if (%MO_S_Gt_W32(R1, I32[Sp + 4])) goto c1XZ; _s1H9::I32 = %MO_S_Shr_W32(R1, 5); I32[I32[Sp + 8] + ((_s1H9::I32 << 2) + 8)] = I32[I32[Sp + 8] + ((_s1H9::I32 << 2) + 8)] & (1 << R1 & 31) ^ 18446744073709551615; R1 = R1 + 3; jump s1Hq_info; // [R1] c1XZ: R1 = 1; jump s1HY_info; // [R1] } }}} For version 7.8.1 RC2 the cmm code is about eight times larger with about eight times the number of lines and looks like this {{{ c3jO: _c3jQ::I32 = %MO_S_Gt_W32(_s33c::I32, _s32t::I32); _s33e::I32 = _c3jQ::I32; if (_s33e::I32 >= 1) goto c3jY; else goto c3jZ; c3jY: _s32S::I32 = _s335::I32; goto c3gp; c3jZ: _c3kn::I32 = %MO_S_Shr_W32(_s33c::I32, 5); _s33g::I32 = _c3kn::I32; _s33j::I32 = I32[(_s327::P32 + 8) + (_s33g::I32 << 2)]; _s33j::I32 = _s33j::I32; _c3kq::I32 = _s33c::I32; _s33k::I32 = _c3kq::I32; _c3kt::I32 = _s33k::I32 & 31; _s33l::I32 = _c3kt::I32; _c3kw::I32 = _s33l::I32; _s33m::I32 = _c3kw::I32; _c3kz::I32 = 1 << _s33m::I32; _s33n::I32 = _c3kz::I32; _c3kC::I32 = _s33n::I32 ^ 4294967295; _s33o::I32 = _c3kC::I32; _c3kF::I32 = _s33j::I32 & _s33o::I32; _s33p::I32 = _c3kF::I32; I32[(_s327::P32 + 8) + (_s33g::I32 << 2)] = _s33p::I32; _c3kK::I32 = _s33c::I32 + _s32U::I32; _s33r::I32 = _c3kK::I32; _s33c::I32 = _s33r::I32; goto c3jO; }}} the the optimized opt-cmm code looks like this: {{{ c3jO: if (%MO_S_Gt_W32(_s33c::I32, _s32t::I32)) goto c3jY; else goto c3jZ; c3jY: Sp = Sp + 8; _s32S::I32 = _s335::I32; goto c3gp; c3jZ: _s33g::I32 = %MO_S_Shr_W32(_s33c::I32, 5); I32[(_s327::P32 + 8) + (_s33g::I32 << 2)] = I32[(_s327::P32 + 8) + (_s33g::I32 << 2)] & (1 << _s33c::I32 & 31) ^ 4294967295; _s33c::I32 = _s33c::I32 + _s32U::I32; goto c3jO; }}} It appears to me that the opt-cmm code is about the same but the straight cmm dump has regressed to have lost even the basic optimizations that were there with the older version. It may be that the NCG is using the non-optimized version of CMM as a source where the LLVM generator is using the optimized CMM version, which would explain why LLVM backend generated code is still efficient where NCG generated code is not. Anticipating your next request to look at the STG output, I used the -ddump-stg compiler option to examine that code. The code is too difficult to post here (wordy and lines very long), but a quick examination shows it to be about the same for both versions except that internal constants are recorded as 32-bit numbers in 7.8.1 whereas they were recorded as 64-bit numbers even when referring to 32-bit registers for the older 7.6.3 code; this corresponds to the way they are recorded in both the optimized and non-optimized CMM files by version as listed above. Thus, the bug/regression appears to be go further back than just the new NCG (which is likely using the non-optimized CMM code as input) but also to the CMM code generator in that it is producing much less efficient CMM code. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8971#comment:10 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler