[GHC] #9041: NCG generates slow loop code

#9041: NCG generates slow loop code -------------------------+------------------------------------------------- Reporter: nh2 | Owner: Type: bug | Status: new Priority: | Milestone: normal | Version: 7.6.3 Component: | Operating System: Unknown/Multiple Compiler (NCG) | Type of failure: Compile-time performance bug Keywords: | Test Case: Architecture: | Blocking: Unknown/Multiple | Difficulty: | Unknown | Blocked By: | Related Tickets: | -------------------------+------------------------------------------------- In http://stackoverflow.com/a/23322255/263061 we compile the code {{{ main :: IO () main = do loop (maxBound :: Word32) $ \i -> do when (i `rem` 100000000 == 0) $ putStrLn "foo" }}} and compare the assembly generated by the NCG and `-fllvm`. The LLVM code is 10 times faster. While the LLVM code looks great, NCG one generated looks quite unfortunate and messy, with a lot of unnecessary `mov`ing around. I am by no means an expert in this, but some of it looks like low hanging potential improvements. This is the '''LLVM code''': {{{ Main_zdwa_info: .LBB1_1: movl $4294967295, %esi /* load the loop bound */ movabsq $-6067343680855748867, %rdi /* load a magic number for the modulus */ jmp .LBB1_2 .LBB1_4: incl %ecx /* loop core start */ .LBB1_2: cmpq %rsi, %rcx je .LBB1_6 /* check loop bound */ /* do the modulus with two multiplications, a shift and a magic number */ /* note : gcc does the same reduction */ movq %rcx, %rax mulq %rdi shrq $26, %rdx imulq $100000000, %rdx, %rax cmpq %rax, %rcx jne .LBB1_4 /* loop core end */ /* Code omitted: print, then return to loop beginning */ .LBB1_6: }}} So the core of the loop is a nice and short: {{{ incl cmpq je movq [division replaced by magic multiplication] mulq shrq imulq cmpq jne }}} Now what '''NCG''' produces: {{{ Main_zdwa_info: /* loop core start */ .Lc3JD: leaq -16(%rbp),%rax /* stack check */ cmpq %r15,%rax jb .Lc3JO /* jump not taken in the normal case */ .Lc3JP: movl $4294967295,%eax /* issue: loading the bound on every iteration */ cmpq %rax,%r14 jne .Lc3JB .Lc3JC: /* Return from main. Code omitted */ .Lc3JB: /* test the index for modulus */ movl $100000000,%eax /* issue: unnecessary moves */ movq %rax,%rbx /* and loading the modulo arg on every iteration */ movq %r14,%rax xorq %rdx,%rdx /* shorter alternative to mov $0,%rdx */ divq %rbx /* issue: doing the division (llvm and gcc avoid this) */ testq %rdx,%rdx jne .Lc3JU /* This jump is usually taken. */ .Lc3JV: /* do the printing. Code omitted. Not relevant */ .Lc3JN: /* increment index and (I guess) restore registers messed up by the printing */ movq 8(%rbp),%rax /* This code is not very relevant because we don't almost never */ incq %rax movl %eax,%r14d addq $16,%rbp jmp Main_zdwa_info .Lc3JU: leaq 1(%r14),%rax /*issue: why not just increment r14? */ movl %eax,%r14d jmp Main_zdwa_info }}} So the core of this loop is {{{ movl [stack check] cmpq jne movl movq movq xorq divq testq jne leaq movl jmp }}} Some issues here: * the stack check is inside the loop * the loop limit is loaded every time inside the loop * the modulo arg is loaded every time inside the loop * we are shuffling around registers * no strength reduction (div is not replaced by cheaper magic number imul) * weird increment using leaq It would be great if somebody with codegen insight could comment on whether these are easy targets for improvement! -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9041 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9041: NCG generates slow loop code -------------------------------------------------+------------------------- Reporter: nh2 | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler (NCG) | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Compile-time performance bug | Unknown/Multiple Test Case: | Difficulty: Blocking: | Unknown | Blocked By: | Related Tickets: -------------------------------------------------+------------------------- Changes (by carter): * cc: carter.schonwald@… (added) Comment: Some of these match what are known issues with the ncg. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9041#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9041: NCG generates slow loop code -------------------------------------+------------------------------------- Reporter: nh2 | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler (NCG) | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: -------------------------------------+------------------------------------- Changes (by thomie): * failure: Compile-time performance bug => Runtime performance bug Comment: Grouping this with the other 165 [https://ghc.haskell.org/trac/ghc/query?status=!closed&failure=Runtime+performance+bug runtime performance bugs]. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9041#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9041: NCG generates slow loop code -------------------------------------+------------------------------------- Reporter: nh2 | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler (NCG) | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by trommler): * cc: trommler (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9041#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC