
#10012: Cheap-to-compute values aren't pushed into case branches inducing unnecessary register pressure -------------------------------------+------------------------------------- Reporter: bgamari | Owner: bgamari Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.4 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Description changed by bgamari: @@ -1,4 +1,4 @@ - When characterizing `bytestring`'s `Builder` [[ - https://github.com/kolmodin/binary/pull/65|interface]] I noticed that some - benchmarks involving repeated appends (e.g. `Host endian/1MB of Word8 in - chunks of 16`) perform inexplicably much worse than others. A glance at + When characterizing `bytestring`'s `Builder` + [[https://github.com/kolmodin/binary/pull/65|interface]] I noticed that + some benchmarks involving repeated appends (e.g. `Host endian/1MB of Word8 + in chunks of 16`) perform inexplicably much worse than others. A glance at New description: When characterizing `bytestring`'s `Builder` [[https://github.com/kolmodin/binary/pull/65|interface]] I noticed that some benchmarks involving repeated appends (e.g. `Host endian/1MB of Word8 in chunks of 16`) perform inexplicably much worse than others. A glance at the assembly revealed that a large number of `Word8`s are being evaluated and saved to registers, only to be later stored. E.g., {{{#!asm # First we evaluate a number of Word8s, saving them in registers movzbl %bl,%ebx leaq 1(%r14),%rcx movzbl %cl,%ecx leaq 2(%r14),%rdx movzbl %dl,%edx leaq 3(%r14),%rsi movzbl %sil,%esi leaq 4(%r14),%r9 movzbl %r9b,%r9d leaq 5(%r14),%r10 movzbl %r10b,%r10d ... # Eventually we run out of registers and start spilling to the stack movq %rax,64(%rsp) leaq 7(%r14),%rax movzbl %al,%eax movq %rbx,72(%rsp) ... # Only after evaluating all of the needed words do we actually consume # them movq %rax,-152(%r12) movq 72(%rsp),%rax movq %rax,-144(%r12) movq 80(%rsp),%rax movq %rax,-136(%r12) movq 88(%rsp),%rax movq %rax,-128(%r12) ... movq %rsi,-56(%r12) movq %r9,-48(%r12) movq %r10,-40(%r12) movq %r11,-32(%r12) movq %r14,-24(%r12) }}} This is due to the fact that the `Word`s are bound outside of a case analysis and GHC is reluctant to push them inside of the branches. The float-in pass will only float a binding into a case if the value is "small" and at least one branch doesn't use the binding. Unfortunately the case expression in question has only two branches. This is demonstrated in the attached testcase. Also of interest, https://www.haskell.org/pipermail/ghc- devs/2015-January/007997.html -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10012#comment:15 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler