[GHC] #10012: Cheap-to-compute values aren't pushed into case branches inducing unnecessary register pressure

#10012: Cheap-to-compute values aren't pushed into case branches inducing unnecessary register pressure -------------------------------------+------------------------------------- Reporter: bgamari | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.4 Keywords: | Operating System: Unknown/Multiple Architecture: | Type of failure: Runtime Unknown/Multiple | performance bug Test Case: | Blocked By: Blocking: | Related Tickets: Differential Revisions: | -------------------------------------+------------------------------------- When characterizing bytestring's `Builder` interface[1] I noticed that some benchmarks involving repeated appends (e.g. `Host endian/1MB of Word8 in chunks of 16`) perform inexplicably much worse than others. A glance at the assembly revealed that a large number of `Word8`s are being evaluated and saved to registers, only to be later stored. E.g., ```#!asm # First we evaluate a number of Word8s, saving them in registers movzbl %bl,%ebx leaq 1(%r14),%rcx movzbl %cl,%ecx leaq 2(%r14),%rdx movzbl %dl,%edx leaq 3(%r14),%rsi movzbl %sil,%esi leaq 4(%r14),%r9 movzbl %r9b,%r9d leaq 5(%r14),%r10 movzbl %r10b,%r10d ... # Eventually we run out of registers and start spilling to the stack movq %rax,64(%rsp) leaq 7(%r14),%rax movzbl %al,%eax movq %rbx,72(%rsp) ... # Only after evaluating all of the needed words do we actually consume # them movq %rax,-152(%r12) movq 72(%rsp),%rax movq %rax,-144(%r12) movq 80(%rsp),%rax movq %rax,-136(%r12) movq 88(%rsp),%rax movq %rax,-128(%r12) ... movq %rsi,-56(%r12) movq %r9,-48(%r12) movq %r10,-40(%r12) movq %r11,-32(%r12) movq %r14,-24(%r12) ``` This is due to the fact that the `Word`s are bound outside of a case analysis and GHC is reluctant to push them inside of the branches. The float-in pass will only float a binding into a case if the value is "small" and at least one branch doesn't use the binding. Unfortunately the case expression in question has only two branches. This is demonstrated in the attached testcase. [1] https://github.com/kolmodin/binary/pull/65 [2] https://www.haskell.org/pipermail/ghc-devs/2015-January/007997.html -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10012 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10012: Cheap-to-compute values aren't pushed into case branches inducing unnecessary register pressure -------------------------------------+------------------------------------- Reporter: bgamari | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.4 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: -------------------------------------+------------------------------------- Description changed by bgamari: Old description:
When characterizing bytestring's `Builder` interface[1] I noticed that some benchmarks involving repeated appends (e.g. `Host endian/1MB of Word8 in chunks of 16`) perform inexplicably much worse than others. A glance at the assembly revealed that a large number of `Word8`s are being evaluated and saved to registers, only to be later stored. E.g.,
```#!asm # First we evaluate a number of Word8s, saving them in registers movzbl %bl,%ebx leaq 1(%r14),%rcx movzbl %cl,%ecx leaq 2(%r14),%rdx movzbl %dl,%edx leaq 3(%r14),%rsi movzbl %sil,%esi leaq 4(%r14),%r9 movzbl %r9b,%r9d leaq 5(%r14),%r10 movzbl %r10b,%r10d ... # Eventually we run out of registers and start spilling to the stack movq %rax,64(%rsp) leaq 7(%r14),%rax movzbl %al,%eax movq %rbx,72(%rsp) ...
# Only after evaluating all of the needed words do we actually consume # them movq %rax,-152(%r12) movq 72(%rsp),%rax movq %rax,-144(%r12) movq 80(%rsp),%rax movq %rax,-136(%r12) movq 88(%rsp),%rax movq %rax,-128(%r12) ... movq %rsi,-56(%r12) movq %r9,-48(%r12) movq %r10,-40(%r12) movq %r11,-32(%r12) movq %r14,-24(%r12)
```
This is due to the fact that the `Word`s are bound outside of a case analysis and GHC is reluctant to push them inside of the branches. The float-in pass will only float a binding into a case if the value is "small" and at least one branch doesn't use the binding. Unfortunately the case expression in question has only two branches.
This is demonstrated in the attached testcase.
[1] https://github.com/kolmodin/binary/pull/65 [2] https://www.haskell.org/pipermail/ghc-devs/2015-January/007997.html
New description: When characterizing bytestring's `Builder` interface[1] I noticed that some benchmarks involving repeated appends (e.g. `Host endian/1MB of Word8 in chunks of 16`) perform inexplicably much worse than others. A glance at the assembly revealed that a large number of `Word8`s are being evaluated and saved to registers, only to be later stored. E.g., ``` # First we evaluate a number of Word8s, saving them in registers movzbl %bl,%ebx leaq 1(%r14),%rcx movzbl %cl,%ecx leaq 2(%r14),%rdx movzbl %dl,%edx leaq 3(%r14),%rsi movzbl %sil,%esi leaq 4(%r14),%r9 movzbl %r9b,%r9d leaq 5(%r14),%r10 movzbl %r10b,%r10d ... # Eventually we run out of registers and start spilling to the stack movq %rax,64(%rsp) leaq 7(%r14),%rax movzbl %al,%eax movq %rbx,72(%rsp) ... # Only after evaluating all of the needed words do we actually consume # them movq %rax,-152(%r12) movq 72(%rsp),%rax movq %rax,-144(%r12) movq 80(%rsp),%rax movq %rax,-136(%r12) movq 88(%rsp),%rax movq %rax,-128(%r12) ... movq %rsi,-56(%r12) movq %r9,-48(%r12) movq %r10,-40(%r12) movq %r11,-32(%r12) movq %r14,-24(%r12) ``` This is due to the fact that the `Word`s are bound outside of a case analysis and GHC is reluctant to push them inside of the branches. The float-in pass will only float a binding into a case if the value is "small" and at least one branch doesn't use the binding. Unfortunately the case expression in question has only two branches. This is demonstrated in the attached testcase. [1] https://github.com/kolmodin/binary/pull/65 [2] https://www.haskell.org/pipermail/ghc-devs/2015-January/007997.html -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10012#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10012: Cheap-to-compute values aren't pushed into case branches inducing unnecessary register pressure -------------------------------------+------------------------------------- Reporter: bgamari | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.4 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: -------------------------------------+------------------------------------- Description changed by bgamari: Old description:
When characterizing bytestring's `Builder` interface[1] I noticed that some benchmarks involving repeated appends (e.g. `Host endian/1MB of Word8 in chunks of 16`) perform inexplicably much worse than others. A glance at the assembly revealed that a large number of `Word8`s are being evaluated and saved to registers, only to be later stored. E.g.,
``` # First we evaluate a number of Word8s, saving them in registers movzbl %bl,%ebx leaq 1(%r14),%rcx movzbl %cl,%ecx leaq 2(%r14),%rdx movzbl %dl,%edx leaq 3(%r14),%rsi movzbl %sil,%esi leaq 4(%r14),%r9 movzbl %r9b,%r9d leaq 5(%r14),%r10 movzbl %r10b,%r10d ... # Eventually we run out of registers and start spilling to the stack movq %rax,64(%rsp) leaq 7(%r14),%rax movzbl %al,%eax movq %rbx,72(%rsp) ...
# Only after evaluating all of the needed words do we actually consume # them movq %rax,-152(%r12) movq 72(%rsp),%rax movq %rax,-144(%r12) movq 80(%rsp),%rax movq %rax,-136(%r12) movq 88(%rsp),%rax movq %rax,-128(%r12) ... movq %rsi,-56(%r12) movq %r9,-48(%r12) movq %r10,-40(%r12) movq %r11,-32(%r12) movq %r14,-24(%r12)
```
This is due to the fact that the `Word`s are bound outside of a case analysis and GHC is reluctant to push them inside of the branches. The float-in pass will only float a binding into a case if the value is "small" and at least one branch doesn't use the binding. Unfortunately the case expression in question has only two branches.
This is demonstrated in the attached testcase.
[1] https://github.com/kolmodin/binary/pull/65 [2] https://www.haskell.org/pipermail/ghc-devs/2015-January/007997.html
New description: When characterizing bytestring's `Builder` interface[1] I noticed that some benchmarks involving repeated appends (e.g. `Host endian/1MB of Word8 in chunks of 16`) perform inexplicably much worse than others. A glance at the assembly revealed that a large number of `Word8`s are being evaluated and saved to registers, only to be later stored. E.g., {{{#asm # First we evaluate a number of Word8s, saving them in registers movzbl %bl,%ebx leaq 1(%r14),%rcx movzbl %cl,%ecx leaq 2(%r14),%rdx movzbl %dl,%edx leaq 3(%r14),%rsi movzbl %sil,%esi leaq 4(%r14),%r9 movzbl %r9b,%r9d leaq 5(%r14),%r10 movzbl %r10b,%r10d ... # Eventually we run out of registers and start spilling to the stack movq %rax,64(%rsp) leaq 7(%r14),%rax movzbl %al,%eax movq %rbx,72(%rsp) ... # Only after evaluating all of the needed words do we actually consume # them movq %rax,-152(%r12) movq 72(%rsp),%rax movq %rax,-144(%r12) movq 80(%rsp),%rax movq %rax,-136(%r12) movq 88(%rsp),%rax movq %rax,-128(%r12) ... movq %rsi,-56(%r12) movq %r9,-48(%r12) movq %r10,-40(%r12) movq %r11,-32(%r12) movq %r14,-24(%r12) }}} This is due to the fact that the `Word`s are bound outside of a case analysis and GHC is reluctant to push them inside of the branches. The float-in pass will only float a binding into a case if the value is "small" and at least one branch doesn't use the binding. Unfortunately the case expression in question has only two branches. This is demonstrated in the attached testcase. [1] https://github.com/kolmodin/binary/pull/65 [2] https://www.haskell.org/pipermail/ghc-devs/2015-January/007997.html -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10012#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10012: Cheap-to-compute values aren't pushed into case branches inducing unnecessary register pressure -------------------------------------+------------------------------------- Reporter: bgamari | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.4 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: -------------------------------------+------------------------------------- Description changed by bgamari: Old description:
When characterizing bytestring's `Builder` interface[1] I noticed that some benchmarks involving repeated appends (e.g. `Host endian/1MB of Word8 in chunks of 16`) perform inexplicably much worse than others. A glance at the assembly revealed that a large number of `Word8`s are being evaluated and saved to registers, only to be later stored. E.g.,
{{{#asm # First we evaluate a number of Word8s, saving them in registers movzbl %bl,%ebx leaq 1(%r14),%rcx movzbl %cl,%ecx leaq 2(%r14),%rdx movzbl %dl,%edx leaq 3(%r14),%rsi movzbl %sil,%esi leaq 4(%r14),%r9 movzbl %r9b,%r9d leaq 5(%r14),%r10 movzbl %r10b,%r10d ... # Eventually we run out of registers and start spilling to the stack movq %rax,64(%rsp) leaq 7(%r14),%rax movzbl %al,%eax movq %rbx,72(%rsp) ...
# Only after evaluating all of the needed words do we actually consume # them movq %rax,-152(%r12) movq 72(%rsp),%rax movq %rax,-144(%r12) movq 80(%rsp),%rax movq %rax,-136(%r12) movq 88(%rsp),%rax movq %rax,-128(%r12) ... movq %rsi,-56(%r12) movq %r9,-48(%r12) movq %r10,-40(%r12) movq %r11,-32(%r12) movq %r14,-24(%r12) }}}
This is due to the fact that the `Word`s are bound outside of a case analysis and GHC is reluctant to push them inside of the branches. The float-in pass will only float a binding into a case if the value is "small" and at least one branch doesn't use the binding. Unfortunately the case expression in question has only two branches.
This is demonstrated in the attached testcase.
[1] https://github.com/kolmodin/binary/pull/65 [2] https://www.haskell.org/pipermail/ghc-devs/2015-January/007997.html
New description: When characterizing bytestring's `Builder` interface[1] I noticed that some benchmarks involving repeated appends (e.g. `Host endian/1MB of Word8 in chunks of 16`) perform inexplicably much worse than others. A glance at the assembly revealed that a large number of `Word8`s are being evaluated and saved to registers, only to be later stored. E.g., {{{#!asm # First we evaluate a number of Word8s, saving them in registers movzbl %bl,%ebx leaq 1(%r14),%rcx movzbl %cl,%ecx leaq 2(%r14),%rdx movzbl %dl,%edx leaq 3(%r14),%rsi movzbl %sil,%esi leaq 4(%r14),%r9 movzbl %r9b,%r9d leaq 5(%r14),%r10 movzbl %r10b,%r10d ... # Eventually we run out of registers and start spilling to the stack movq %rax,64(%rsp) leaq 7(%r14),%rax movzbl %al,%eax movq %rbx,72(%rsp) ... # Only after evaluating all of the needed words do we actually consume # them movq %rax,-152(%r12) movq 72(%rsp),%rax movq %rax,-144(%r12) movq 80(%rsp),%rax movq %rax,-136(%r12) movq 88(%rsp),%rax movq %rax,-128(%r12) ... movq %rsi,-56(%r12) movq %r9,-48(%r12) movq %r10,-40(%r12) movq %r11,-32(%r12) movq %r14,-24(%r12) }}} This is due to the fact that the `Word`s are bound outside of a case analysis and GHC is reluctant to push them inside of the branches. The float-in pass will only float a binding into a case if the value is "small" and at least one branch doesn't use the binding. Unfortunately the case expression in question has only two branches. This is demonstrated in the attached testcase. [1] https://github.com/kolmodin/binary/pull/65 [2] https://www.haskell.org/pipermail/ghc-devs/2015-January/007997.html -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10012#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

I think you are talking about code like this: {{{ let t1=e1; t2=e2; ...; tn=en in if ... then do { write r1a t1; ...; write rna tn } else do { write r1b t1; ...; write rnb tn } }}} What you would prefer is to *duplicate* the ti=ei bindings, and sink
#10012: Cheap-to-compute values aren't pushed into case branches inducing unnecessary register pressure -------------------------------------+------------------------------------- Reporter: bgamari | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.4 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: -------------------------------------+------------------------------------- Comment (by simonpj): Here's some more dialogue between Ben and me: them really close to their consumers. (NB: That can sometimes *increase* register pressure if ei have lots of free variables.)
Correct?
Question: the ti=ei bindings are used in both branches of the conditional. Did they begin duplicated, and became shared by CSE? Or did
Correct. they start shared? If the latter, it becomes delicate indeed: duplicating work to (hopefully) improve (but possible worsen) register pressure seems dodgy. Unfortunately they started shared. This code is sadly a bit difficult to optimize. The case analysis in question examines whether there is enough room in a buffer to accomoate the requested data to be written; the bindings being shared are the values to be written to the buffer. If there is room in the buffer then we perform the write immediately; otherwise we return a closure so that the user can continue writing into a new buffer. This situation affects performance in a rather unfortunate way: in principle we should be able to perform better the more data we statically know needs to be written. Unfortunately, as it stands now, the more data we write, the more performance suffers due to register pressure -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10012#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10012: Cheap-to-compute values aren't pushed into case branches inducing unnecessary register pressure -------------------------------------+------------------------------------- Reporter: bgamari | Owner: bgamari Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.4 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by bgamari): * owner: => bgamari -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10012#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10012: Cheap-to-compute values aren't pushed into case branches inducing unnecessary register pressure -------------------------------------+------------------------------------- Reporter: bgamari | Owner: bgamari Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.4 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): I was thinking about this a bit more recently; to recap, the issue here is that GHC is unwilling to push bindings into case analyses when this would imply duplication. This leaves GHC no other option but to allocate the shared binding on the heap. To see an example of this let's look at `bytestring`'s `Builder`. The key branch here is `Data.ByteString.Builder.Internal.ensureFree`, {{{#!hs -- | The result of a build action data BuildSignal a = -- | We wrote all of the content we were asked to Done !(Ptr Word8) a -- | The buffer we were provided is full, but here's a continuation -- BuildStep to pick up where we left off | BufferFull !Int !(Ptr Word8) (BuildStep a) newtype Builder = Builder (forall r. BuildStep r -> BuildStep r) instance Monoid Builder where ... type BuildStep a = BufferRange -> IO (BuildSignal a) ensureFree :: Int -> Builder ensureFree minFree = builder step where step :: BuildStep r -> BuildStep r step k br@(BufferRange op ope) = | ope `minusPtr` op < minFree -> return $ bufferFull minFree op k | otherwise -> k br }}} The idea here is that we are filling a pre-allocated buffer (described by the `BufferRange`) with bytes; `ensureFree` verifies that the buffer has at least `minFree` free bytes remaining. The trouble comes when we write something like, {{{#!hs twoWord64s :: Word64 -> Word64 -> Builder twoWord64s a b = word64 (f a) <> word64 (f b) where -- just some cheap to evaluate function that we really don't want -- to build a thunk for f = (+1) }}} which in STG turns into something like, {{{#!hs twoWord64s' :: Word64 -> Word64 -> forall r. BuildStep r -> BufferRange -> IO (BuildSignal r) twoWord64s' a b cont br = let fa, fb :: Word64 fa = a + 1 fb = b + 1 in case br of BufferRange op ope -> case ope `minusPtr` op < minFree of True -> return $ bufferFull 16 {- bytes -} op cont False -> cont br }}} It seems that one interesting (albeit slightly inelegant) way to addressing this issue is to provide a means for the library author to indicate that a given `case` should be considered "cheap" to push through. That is, you might define `ensureFree` as, {{{#!hs ensureFree :: Int -> Builder ensureFree minFree = builder step where step k br@(BufferRange op ope) = case ope `minusPtr` op < minFree of True -> return $ bufferFull minFree op k False -> k br {-# INLINE_THROUGH #-} -- Telling GHC "it's okay if we lose sharing across -- the branches of this case, I would far prefer code duplication -- to allocation" }}} I don't believe there are too many places where this sort of pragma would be useful, but when it is needed I suspect it would be very useful indeed. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10012#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10012: Cheap-to-compute values aren't pushed into case branches inducing unnecessary register pressure -------------------------------------+------------------------------------- Reporter: bgamari | Owner: bgamari Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.4 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by nomeata): In comment:4 simon (or ben?) says “duplicating work”, but it seems that the question at hand is not about loss of sharing, but rather code size, right? Floating a let binding into the branches of a case does not duplicate any work, only code. A `{-# INLINE_THROUGH #-}` pragma seems to be a bit unspecific. Do we really want the compiler to inline through everything, no matter how large? Or should this just shift the heuristics a bit? But what is so special about this `case` that makes this a worthwhile thing here, and not in other cases (sic)? Maybe the heuristics can be improved in general? It seems to be more a general question of whether the user wants more speed or smaller code. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10012#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10012: Cheap-to-compute values aren't pushed into case branches inducing unnecessary register pressure -------------------------------------+------------------------------------- Reporter: bgamari | Owner: bgamari Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.4 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): There's something wrong with the example: far from being used in both branches, `fa` and `fb` are dead bindings. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10012#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10012: Cheap-to-compute values aren't pushed into case branches inducing unnecessary register pressure -------------------------------------+------------------------------------- Reporter: bgamari | Owner: bgamari Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.4 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): Replying to [comment:8 simonpj]:
There's something wrong with the example: far from being used in both branches, `fa` and `fb` are dead bindings.
Oh dear, yes, sorry about that. I was a bit hurried in submitting. This should be fixed now. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10012#comment:9 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

In comment:4 simon (or ben?) says “duplicating work”, but it seems that
#10012: Cheap-to-compute values aren't pushed into case branches inducing unnecessary register pressure -------------------------------------+------------------------------------- Reporter: bgamari | Owner: bgamari Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.4 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): Replying to [comment:7 nomeata]: the question at hand is not about loss of sharing, but rather code size, right? Floating a let binding into the branches of a case does not duplicate any work, only code.
Correct, the primary concern is code size.
A `{-# INLINE_THROUGH #-}` pragma seems to be a bit unspecific. Do we really want the compiler to inline through everything, no matter how large? Or should this just shift the heuristics a bit?
But what is so special about this `case` that makes this a worthwhile
Indeed it is intentionally unspecific because I haven't yet found enough other potential use-cases for such a pragma to know what semantics would make the most sense. My first inclination was to essentially render the `case` invisible to the simplifier and float freely into the branches but simply shifting the float-in heuristics would also be a reasonable option (albeit perhaps a bit harder to reason about). thing here, and not in other cases (sic)? Maybe the heuristics can be improved in general?
Unfortunately I haven't been able to pin this down very clearly beyond it occurring in an "inner loop".
It seems to be more a general question of whether the user wants more speed or smaller code.
Right, but not all case analyses are created equal: examples like this one which occur in an inner loop deserve more inlining than others, yet I'm not sure GHC is in a position to be able to discern this. This is why I proposed simply providing a tool for nudging the compiler in the right direction. Clearly following this approach of adding source-level knobs to guide arbitrary simplifier heuristics to its logical conclusion is a path to madness; unfortunately I don't see a more general way to robustly address the issue at hand. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10012#comment:10 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

But what is so special about this case that makes this a worthwhile
#10012: Cheap-to-compute values aren't pushed into case branches inducing unnecessary register pressure -------------------------------------+------------------------------------- Reporter: bgamari | Owner: bgamari Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.4 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): thing here, and not in other cases (sic)? Maybe the heuristics can be improved in general? I'd also like to point out that we already decide to inline the `BuildStep` `k` itself into `ensureRoom/step` (although I admittedly don't yet understand why). It seems odd that we would inline this yet not its free variables. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10012#comment:11 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10012: Cheap-to-compute values aren't pushed into case branches inducing unnecessary register pressure -------------------------------------+------------------------------------- Reporter: bgamari | Owner: bgamari Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.4 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): To supplement the attached freestanding example, here is a smaller example depending upon `bytestring`, {{{#!hs module Hi where import qualified Data.ByteString.Builder as B import Data.Monoid import Data.Word hello :: Word64 -> Word64 -> B.Builder hello a b = B.word64LE (f a) <> B.word64LE (f b) where f = (+1) }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10012#comment:12 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10012: Cheap-to-compute values aren't pushed into case branches inducing unnecessary register pressure -------------------------------------+------------------------------------- Reporter: bgamari | Owner: bgamari Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.4 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): Here's a very bare-bones example, {{{#!hs {-# LANGUAGE BangPatterns #-} {-# LANGUAGE RankNTypes #-} module TestCase3 where data Buffer = Buffer data BufferRange = BufferRange Buffer Int data BuildSignal a = BufferFull (BuildStep a) | Done a type BuildStep a = BufferRange -> IO (BuildSignal a) newtype Builder = Builder (forall r. BuildStep r -> BuildStep r) write16Bytes :: Int -> Int -> Builder write16Bytes a b = Builder $ \cont rng@(BufferRange Buffer rem) -> let !fa = f a !fb = f b -- doWrite :: BufferRange -> IO (BuildSignal r) doWrite rng' = writeInt fa >> writeInt fb >> cont rng' in if rem < 16 then return $ BufferFull doWrite else doWrite rng where f x = x + 42 writeInt :: Int -> IO () writeInt = print {-# NOINLINE writeInt #-} }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10012#comment:13 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10012: Cheap-to-compute values aren't pushed into case branches inducing unnecessary register pressure -------------------------------------+------------------------------------- Reporter: bgamari | Owner: bgamari Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.4 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Description changed by bgamari: @@ -1,3 +1,4 @@ - When characterizing bytestring's `Builder` interface[1] I noticed that - some benchmarks involving repeated appends (e.g. `Host endian/1MB of Word8 - in chunks of 16`) perform inexplicably much worse than others. A glance at + When characterizing `bytestring`'s `Builder` [[ + https://github.com/kolmodin/binary/pull/65|interface]] I noticed that some + benchmarks involving repeated appends (e.g. `Host endian/1MB of Word8 in + chunks of 16`) perform inexplicably much worse than others. A glance at @@ -53,2 +54,3 @@ - [1] https://github.com/kolmodin/binary/pull/65 - [2] https://www.haskell.org/pipermail/ghc-devs/2015-January/007997.html + + Also of interest, https://www.haskell.org/pipermail/ghc- + devs/2015-January/007997.html New description: When characterizing `bytestring`'s `Builder` [[ https://github.com/kolmodin/binary/pull/65|interface]] I noticed that some benchmarks involving repeated appends (e.g. `Host endian/1MB of Word8 in chunks of 16`) perform inexplicably much worse than others. A glance at the assembly revealed that a large number of `Word8`s are being evaluated and saved to registers, only to be later stored. E.g., {{{#!asm # First we evaluate a number of Word8s, saving them in registers movzbl %bl,%ebx leaq 1(%r14),%rcx movzbl %cl,%ecx leaq 2(%r14),%rdx movzbl %dl,%edx leaq 3(%r14),%rsi movzbl %sil,%esi leaq 4(%r14),%r9 movzbl %r9b,%r9d leaq 5(%r14),%r10 movzbl %r10b,%r10d ... # Eventually we run out of registers and start spilling to the stack movq %rax,64(%rsp) leaq 7(%r14),%rax movzbl %al,%eax movq %rbx,72(%rsp) ... # Only after evaluating all of the needed words do we actually consume # them movq %rax,-152(%r12) movq 72(%rsp),%rax movq %rax,-144(%r12) movq 80(%rsp),%rax movq %rax,-136(%r12) movq 88(%rsp),%rax movq %rax,-128(%r12) ... movq %rsi,-56(%r12) movq %r9,-48(%r12) movq %r10,-40(%r12) movq %r11,-32(%r12) movq %r14,-24(%r12) }}} This is due to the fact that the `Word`s are bound outside of a case analysis and GHC is reluctant to push them inside of the branches. The float-in pass will only float a binding into a case if the value is "small" and at least one branch doesn't use the binding. Unfortunately the case expression in question has only two branches. This is demonstrated in the attached testcase. Also of interest, https://www.haskell.org/pipermail/ghc- devs/2015-January/007997.html -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10012#comment:14 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10012: Cheap-to-compute values aren't pushed into case branches inducing unnecessary register pressure -------------------------------------+------------------------------------- Reporter: bgamari | Owner: bgamari Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.4 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Description changed by bgamari: @@ -1,4 +1,4 @@ - When characterizing `bytestring`'s `Builder` [[ - https://github.com/kolmodin/binary/pull/65|interface]] I noticed that some - benchmarks involving repeated appends (e.g. `Host endian/1MB of Word8 in - chunks of 16`) perform inexplicably much worse than others. A glance at + When characterizing `bytestring`'s `Builder` + [[https://github.com/kolmodin/binary/pull/65|interface]] I noticed that + some benchmarks involving repeated appends (e.g. `Host endian/1MB of Word8 + in chunks of 16`) perform inexplicably much worse than others. A glance at New description: When characterizing `bytestring`'s `Builder` [[https://github.com/kolmodin/binary/pull/65|interface]] I noticed that some benchmarks involving repeated appends (e.g. `Host endian/1MB of Word8 in chunks of 16`) perform inexplicably much worse than others. A glance at the assembly revealed that a large number of `Word8`s are being evaluated and saved to registers, only to be later stored. E.g., {{{#!asm # First we evaluate a number of Word8s, saving them in registers movzbl %bl,%ebx leaq 1(%r14),%rcx movzbl %cl,%ecx leaq 2(%r14),%rdx movzbl %dl,%edx leaq 3(%r14),%rsi movzbl %sil,%esi leaq 4(%r14),%r9 movzbl %r9b,%r9d leaq 5(%r14),%r10 movzbl %r10b,%r10d ... # Eventually we run out of registers and start spilling to the stack movq %rax,64(%rsp) leaq 7(%r14),%rax movzbl %al,%eax movq %rbx,72(%rsp) ... # Only after evaluating all of the needed words do we actually consume # them movq %rax,-152(%r12) movq 72(%rsp),%rax movq %rax,-144(%r12) movq 80(%rsp),%rax movq %rax,-136(%r12) movq 88(%rsp),%rax movq %rax,-128(%r12) ... movq %rsi,-56(%r12) movq %r9,-48(%r12) movq %r10,-40(%r12) movq %r11,-32(%r12) movq %r14,-24(%r12) }}} This is due to the fact that the `Word`s are bound outside of a case analysis and GHC is reluctant to push them inside of the branches. The float-in pass will only float a binding into a case if the value is "small" and at least one branch doesn't use the binding. Unfortunately the case expression in question has only two branches. This is demonstrated in the attached testcase. Also of interest, https://www.haskell.org/pipermail/ghc- devs/2015-January/007997.html -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10012#comment:15 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10012: Cheap-to-compute values aren't pushed into case branches inducing unnecessary register pressure -------------------------------------+------------------------------------- Reporter: bgamari | Owner: bgamari Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.4 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): It's worthwhile pointing out that #8048 is pointing out a rather similar and perhaps more general problem with GHC's treatment register-heavy code. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10012#comment:16 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10012: Cheap-to-compute values aren't pushed into case branches inducing unnecessary register pressure -------------------------------------+------------------------------------- Reporter: bgamari | Owner: bgamari Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.4 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8048 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by bgamari): * related: => #8048 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10012#comment:17 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10012: Cheap-to-compute values aren't pushed into case branches inducing unnecessary register pressure -------------------------------------+------------------------------------- Reporter: bgamari | Owner: bgamari Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.4 Resolution: | Keywords: CodeGen Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8048 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by simonpj): * keywords: => CodeGen -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10012#comment:18 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10012: Cheap-to-compute values aren't pushed into case branches inducing unnecessary register pressure -------------------------------------+------------------------------------- Reporter: bgamari | Owner: bgamari Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.4 Resolution: | Keywords: CodeGen Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8048 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by michalt): * cc: michalt (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10012#comment:19 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10012: Cheap-to-compute values aren't pushed into case branches inducing unnecessary register pressure -------------------------------------+------------------------------------- Reporter: bgamari | Owner: bgamari Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.4 Resolution: | Keywords: CodeGen Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #8048 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by bollu): * cc: bollu (added) Comment: CCing myself because I have been running into related issues with closure- creation-on-the-stack that's not necessary here: https://github.com/bollu/exploration/tree/master/dec-22-asm-diff -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10012#comment:20 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC