
#9041: NCG generates slow loop code -------------------------+------------------------------------------------- Reporter: nh2 | Owner: Type: bug | Status: new Priority: | Milestone: normal | Version: 7.6.3 Component: | Operating System: Unknown/Multiple Compiler (NCG) | Type of failure: Compile-time performance bug Keywords: | Test Case: Architecture: | Blocking: Unknown/Multiple | Difficulty: | Unknown | Blocked By: | Related Tickets: | -------------------------+------------------------------------------------- In http://stackoverflow.com/a/23322255/263061 we compile the code {{{ main :: IO () main = do loop (maxBound :: Word32) $ \i -> do when (i `rem` 100000000 == 0) $ putStrLn "foo" }}} and compare the assembly generated by the NCG and `-fllvm`. The LLVM code is 10 times faster. While the LLVM code looks great, NCG one generated looks quite unfortunate and messy, with a lot of unnecessary `mov`ing around. I am by no means an expert in this, but some of it looks like low hanging potential improvements. This is the '''LLVM code''': {{{ Main_zdwa_info: .LBB1_1: movl $4294967295, %esi /* load the loop bound */ movabsq $-6067343680855748867, %rdi /* load a magic number for the modulus */ jmp .LBB1_2 .LBB1_4: incl %ecx /* loop core start */ .LBB1_2: cmpq %rsi, %rcx je .LBB1_6 /* check loop bound */ /* do the modulus with two multiplications, a shift and a magic number */ /* note : gcc does the same reduction */ movq %rcx, %rax mulq %rdi shrq $26, %rdx imulq $100000000, %rdx, %rax cmpq %rax, %rcx jne .LBB1_4 /* loop core end */ /* Code omitted: print, then return to loop beginning */ .LBB1_6: }}} So the core of the loop is a nice and short: {{{ incl cmpq je movq [division replaced by magic multiplication] mulq shrq imulq cmpq jne }}} Now what '''NCG''' produces: {{{ Main_zdwa_info: /* loop core start */ .Lc3JD: leaq -16(%rbp),%rax /* stack check */ cmpq %r15,%rax jb .Lc3JO /* jump not taken in the normal case */ .Lc3JP: movl $4294967295,%eax /* issue: loading the bound on every iteration */ cmpq %rax,%r14 jne .Lc3JB .Lc3JC: /* Return from main. Code omitted */ .Lc3JB: /* test the index for modulus */ movl $100000000,%eax /* issue: unnecessary moves */ movq %rax,%rbx /* and loading the modulo arg on every iteration */ movq %r14,%rax xorq %rdx,%rdx /* shorter alternative to mov $0,%rdx */ divq %rbx /* issue: doing the division (llvm and gcc avoid this) */ testq %rdx,%rdx jne .Lc3JU /* This jump is usually taken. */ .Lc3JV: /* do the printing. Code omitted. Not relevant */ .Lc3JN: /* increment index and (I guess) restore registers messed up by the printing */ movq 8(%rbp),%rax /* This code is not very relevant because we don't almost never */ incq %rax movl %eax,%r14d addq $16,%rbp jmp Main_zdwa_info .Lc3JU: leaq 1(%r14),%rax /*issue: why not just increment r14? */ movl %eax,%r14d jmp Main_zdwa_info }}} So the core of this loop is {{{ movl [stack check] cmpq jne movl movq movq xorq divq testq jne leaq movl jmp }}} Some issues here: * the stack check is inside the loop * the loop limit is loaded every time inside the loop * the modulo arg is loaded every time inside the loop * we are shuffling around registers * no strength reduction (div is not replaced by cheaper magic number imul) * weird increment using leaq It would be great if somebody with codegen insight could comment on whether these are easy targets for improvement! -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9041 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler