Re: [GHC] #12798: LLVM seeming to over optimize, producing inefficient assembly code...

22 Nov 2016

      ...
On current HEAD with llvm 3.9.0, the following code:
... code skipped for brevity...
The CMM and initial llvm code is the same as your code for 8.0.1, so it
 seems the difference is due to the fact that rust ships with its own more
 recent llvm than ghc 8.0.1 supports.
The difference in the code between my version and your original version
 is that we force `twos` early which is needed to prevent the evaluation of
#12798: LLVM seeming to over optimize, producing inefficient assembly code...
-------------------------------------+-------------------------------------
        Reporter:  GordonBGood       |                Owner:
            Type:  bug               |               Status:  new
        Priority:  normal            |            Milestone:  8.2.1
       Component:  Compiler (LLVM)   |              Version:  8.0.1
      Resolution:                    |             Keywords:
Operating System:  Unknown/Multiple  |         Architecture:
 Type of failure:  Runtime           |  Unknown/Multiple
  performance bug                    |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:                    |  Differential Rev(s):
       Wiki Page:                    |
-------------------------------------+-------------------------------------

Comment (by GordonBGood):

 @AlexET, Replying to [comment:1 AlexET]:
 that within the loop, an optimisation which seems to have been missed by
 HEAD.

 It's even worse with GHC version 8.0.1 than I thought:, with the very
 minor change of the inner loop to as follows:

 {{{
                   let cull j = -- very tight inner loop where all the time
 is spent
                         if j > bfLmt then return () else do
                           let sh = unsafeAt twos (j .&. 31)
                           let w = j `shiftR` 5
                           ov <- unsafeRead cmpstsw w
                           unsafeWrite cmpstsw w (ov .|. sh) -- (1 `shiftL`
 (j .&. 31)))
                           cull (j + p) in do { cull s; cullp (i + 1) } in
 cullp 0

 }}}
 only changed so the loop back for the next prime value is outside the
 `cull` loop, the assembly code is as follows:

 {{{
         .align  16, 0x90
 .LBB33_3:                               # %caLP.i.caLP.i_crit_edge
                                         #   in Loop: Header=BB33_2 Depth=1
         movq    (%r12), %rdx
 .LBB33_2:                               # %caLP.i
                                         # =>This Inner Loop Header:
 Depth=1
         movq    %rsi, %rcx
         movl    %ecx, %edi
         movq    %rdx, %rsi
         addq    %rcx, %rsi
         sarq    $5, %rcx
         movq    -24(%r12), %rdx
         movq    -16(%r12), %rbx
         andl    $31, %edi
         movl    16(%rbx,%rdi,4), %edi
         orl     %edi, 16(%rdx,%rcx,4)
         cmpq    -8(%r12), %rsi
         jle     .LBB33_3
 }}}
 with three register reloads in the loop and another version in the code
 also reloading the `p` increment value for a total of four reloads.  I
 can't see that the code generated should be any different then for the
 original ticket code.

--
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/12798#comment:3
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler