
On current HEAD with llvm 3.9.0, the following code:
... code skipped for brevity...
The CMM and initial llvm code is the same as your code for 8.0.1, so it seems the difference is due to the fact that rust ships with its own more recent llvm than ghc 8.0.1 supports.
The difference in the code between my version and your original version is that we force `twos` early which is needed to prevent the evaluation of
#12798: LLVM seeming to over optimize, producing inefficient assembly code... -------------------------------------+------------------------------------- Reporter: GordonBGood | Owner: Type: bug | Status: new Priority: normal | Milestone: 8.2.1 Component: Compiler (LLVM) | Version: 8.0.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by GordonBGood): @AlexET, Replying to [comment:1 AlexET]: that within the loop, an optimisation which seems to have been missed by HEAD. It's even worse with GHC version 8.0.1 than I thought:, with the very minor change of the inner loop to as follows: {{{ let cull j = -- very tight inner loop where all the time is spent if j > bfLmt then return () else do let sh = unsafeAt twos (j .&. 31) let w = j `shiftR` 5 ov <- unsafeRead cmpstsw w unsafeWrite cmpstsw w (ov .|. sh) -- (1 `shiftL` (j .&. 31))) cull (j + p) in do { cull s; cullp (i + 1) } in cullp 0 }}} only changed so the loop back for the next prime value is outside the `cull` loop, the assembly code is as follows: {{{ .align 16, 0x90 .LBB33_3: # %caLP.i.caLP.i_crit_edge # in Loop: Header=BB33_2 Depth=1 movq (%r12), %rdx .LBB33_2: # %caLP.i # =>This Inner Loop Header: Depth=1 movq %rsi, %rcx movl %ecx, %edi movq %rdx, %rsi addq %rcx, %rsi sarq $5, %rcx movq -24(%r12), %rdx movq -16(%r12), %rbx andl $31, %edi movl 16(%rbx,%rdi,4), %edi orl %edi, 16(%rdx,%rcx,4) cmpq -8(%r12), %rsi jle .LBB33_3 }}} with three register reloads in the loop and another version in the code also reloading the `p` increment value for a total of four reloads. I can't see that the code generated should be any different then for the original ticket code. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/12798#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler