[GHC] #10062: Codegen on sequential FFI calls is not very good

#10062: Codegen on sequential FFI calls is not very good -------------------------------------+------------------------------------- Reporter: chadaustin | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.3 (CodeGen) | Operating System: Unknown/Multiple Keywords: | Type of failure: Runtime Architecture: | performance bug Unknown/Multiple | Blocked By: Test Case: | Related Tickets: Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- I'm writing a library for efficiently building up a byte buffer. The fastest approach I've found is via FFI, with restricted effects like ST. It's over twice as fast as ByteString Builder. Consider this example API usage: https://github.com/chadaustin/buffer- builder/blob/6bd0a39c56f63ab751faf29f9784ac87d52638be/bench/Bench.hs#L46 It compiles into an instruction sequence containing direct, sequenced FFI calls. For example, the last three calls work out to: addq $8,%rsp movq %rbx,%rdi movq 72(%rsp),%rax movq %rax,%rsi subq $8,%rsp movl $0,%eax call bw_append_bsz addq $8,%rsp movq %rbx,%rdi movl $35,%esi subq $8,%rsp movl $0,%eax call bw_append_byte addq $8,%rsp movq %rbx,%rdi movq 64(%rsp),%rax movq %rax,%rsi subq $8,%rsp movl $0,%eax call bw_append_bsz I don't know why rsp is being changed so much. I also can't explain the assignment to eax before the call. (It should also be xorl eax,eax, I would think.) To my reading, the above instruction sequence could be reduced to: movq %rbx,%rdi movq 64(%rsp),%rsi call bw_append_bsz movq %rbx,%rdi movl $35,%esi call bw_append_byte movq %rbx,%rdi movq 56(%rsp),%rsi call bw_append_bsz To reproduce, check out git@github.com:chadaustin/buffer-builder.git at revision 6bd0a39c56f63ab751faf29f9784ac87d52638be cabal configure --enable-benchmarks cabal bench And then look at the ./dist/build/bench/bench-tmp/bench/Bench.dump-asm file. This is specifically on OS X 64-bit with GHC 7.8.3, but I saw similar code generation on GHC 7.6 on Linux 64-bit. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10062 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10062: Codegen on sequential FFI calls is not very good -------------------------------------+------------------------------------- Reporter: chadaustin | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.3 (CodeGen) | Keywords: Resolution: | Architecture: Operating System: Unknown/Multiple | Unknown/Multiple Type of failure: Runtime | Test Case: performance bug | Blocking: Blocked By: | Differential Revisions: Related Tickets: | -------------------------------------+------------------------------------- Comment (by Yuras):
I don't know why rsp is being changed so much.
It is necessary because `ccall` calling convention requires alignment that is different from the one used in `ghc`, see [[https://github.com/ghc/ghc/blob/64678e9e8ff0107cac956f0c7b799a1dd317b963/compiler/nativeGen/X86/CodeGen.hs#L2320|here]] for details. Though all this `addq`/`subq` probably can be combined... IIRC we don't have optimization pass for asm (?)
I also can't explain the assignment to eax before the call
It is required by calling convention [[https://github.com/ghc/ghc/blob/64678e9e8ff0107cac956f0c7b799a1dd317b963/compiler/nativeGen/X86/CodeGen.hs#L2357|too]]. So, the generated code is pretty reasonable, but there definitely is space for improvements. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10062#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10062: Codegen on sequential FFI calls is not very good -------------------------------------+------------------------------------- Reporter: chadaustin | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.3 (CodeGen) | Keywords: Resolution: | Architecture: Operating System: Unknown/Multiple | Unknown/Multiple Type of failure: Runtime | Test Case: performance bug | Blocking: Blocked By: | Differential Revisions: Related Tickets: | -------------------------------------+------------------------------------- Changes (by Yuras): * cc: shumovichy@… (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10062#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10062: Codegen on sequential FFI calls is not very good -------------------------------------+------------------------------------- Reporter: chadaustin | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.3 (CodeGen) | Keywords: Resolution: | Architecture: Operating System: Unknown/Multiple | Unknown/Multiple Type of failure: Runtime | Test Case: performance bug | Blocking: Blocked By: | Differential Revisions: Related Tickets: | -------------------------------------+------------------------------------- Comment (by chadaustin): Wow! Thanks for the fast response! The ABI constraints make sense. In light of them, then yes, the improvements would be: * Change "movl $0,%eax" to "xorq %rax,%rax" (or maybe "movb $0,%al", though I don't know whether a smaller instruction is more important than breaking the data dependency on rax) * Merge the rsp additions and subtractions between FFI calls * Eliminate the extra move in "movq 72(%rsp),%rax; movq %rax,%rsi" -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10062#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10062: Codegen on sequential FFI calls is not very good -------------------------------------+------------------------------------- Reporter: chadaustin | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.3 (CodeGen) | Keywords: Resolution: | Architecture: Operating System: Unknown/Multiple | Unknown/Multiple Type of failure: Runtime | Test Case: performance bug | Blocking: Blocked By: | Differential Revisions: Related Tickets: | -------------------------------------+------------------------------------- Comment (by svenpanne): A quick remark regarding "xorq %rax,%rax" vs. "movb $0,%al": I would recommend the xorq variant, we had very bad experiences with data dependencies between partially written registers in the v8 JavaScript JIT. Intel CPUs have funny performance characteristics, and these vary vastly between architectures and even releases of the same architecture. :-/ No need to repeat the pain in GHC... -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10062#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

Change "movl $0,%eax" to "xorq %rax,%rax" (or maybe "movb $0,%al",
#10062: Codegen on sequential FFI calls is not very good -------------------------------------+------------------------------------- Reporter: chadaustin | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.3 (CodeGen) | Keywords: Resolution: | Architecture: Operating System: Unknown/Multiple | Unknown/Multiple Type of failure: Runtime | Test Case: performance bug | Blocking: Blocked By: | Differential Revisions: Related Tickets: | -------------------------------------+------------------------------------- Comment (by rwbarton): though I don't know whether a smaller instruction is more important than breaking the data dependency on rax) This one is already implemented in GHC 7.10. (xor %eax,%eax is the Officially Correct instruction to use: it is the smallest instruction to clear a register and it has special processor support to break dependency chains, even though it looks like it doesn't). -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10062#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10062: Codegen on sequential FFI calls is not very good -------------------------------------+------------------------------------- Reporter: chadaustin | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.3 (CodeGen) | Keywords: Resolution: | Architecture: Operating System: Unknown/Multiple | Unknown/Multiple Type of failure: Runtime | Test Case: performance bug | Blocking: Blocked By: | Differential Revisions: Related Tickets: | -------------------------------------+------------------------------------- Changes (by bgamari): * cc: bgamari (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10062#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10062: Codegen on sequential FFI calls is not very good -------------------------------------+------------------------------------- Reporter: chadaustin | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.3 (CodeGen) | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: -------------------------------------+------------------------------------- Description changed by bgamari: Old description:
I'm writing a library for efficiently building up a byte buffer. The fastest approach I've found is via FFI, with restricted effects like ST. It's over twice as fast as ByteString Builder.
Consider this example API usage: https://github.com/chadaustin/buffer- builder/blob/6bd0a39c56f63ab751faf29f9784ac87d52638be/bench/Bench.hs#L46
It compiles into an instruction sequence containing direct, sequenced FFI calls. For example, the last three calls work out to:
addq $8,%rsp movq %rbx,%rdi movq 72(%rsp),%rax movq %rax,%rsi subq $8,%rsp movl $0,%eax call bw_append_bsz
addq $8,%rsp movq %rbx,%rdi movl $35,%esi subq $8,%rsp movl $0,%eax call bw_append_byte
addq $8,%rsp movq %rbx,%rdi movq 64(%rsp),%rax movq %rax,%rsi subq $8,%rsp movl $0,%eax call bw_append_bsz
I don't know why rsp is being changed so much. I also can't explain the assignment to eax before the call. (It should also be xorl eax,eax, I would think.)
To my reading, the above instruction sequence could be reduced to:
movq %rbx,%rdi movq 64(%rsp),%rsi call bw_append_bsz
movq %rbx,%rdi movl $35,%esi call bw_append_byte
movq %rbx,%rdi movq 56(%rsp),%rsi call bw_append_bsz
To reproduce, check out git@github.com:chadaustin/buffer-builder.git at revision 6bd0a39c56f63ab751faf29f9784ac87d52638be
cabal configure --enable-benchmarks cabal bench
And then look at the ./dist/build/bench/bench-tmp/bench/Bench.dump-asm file.
This is specifically on OS X 64-bit with GHC 7.8.3, but I saw similar code generation on GHC 7.6 on Linux 64-bit.
New description: I'm writing a library for efficiently building up a byte buffer. The fastest approach I've found is via FFI, with restricted effects like ST. It's over twice as fast as ByteString Builder. Consider this example API usage: https://github.com/chadaustin/buffer- builder/blob/6bd0a39c56f63ab751faf29f9784ac87d52638be/bench/Bench.hs#L46 It compiles into an instruction sequence containing direct, sequenced FFI calls. For example, the last three calls work out to: {{{ addq $8,%rsp movq %rbx,%rdi movq 72(%rsp),%rax movq %rax,%rsi subq $8,%rsp movl $0,%eax call bw_append_bsz addq $8,%rsp movq %rbx,%rdi movl $35,%esi subq $8,%rsp movl $0,%eax call bw_append_byte addq $8,%rsp movq %rbx,%rdi movq 64(%rsp),%rax movq %rax,%rsi subq $8,%rsp movl $0,%eax call bw_append_bsz }}} I don't know why `rsp` is being changed so much. I also can't explain the assignment to `eax` before the call. (It should also be `xorl eax,eax`, I would think.) To my reading, the above instruction sequence could be reduced to: {{{ movq %rbx,%rdi movq 64(%rsp),%rsi call bw_append_bsz movq %rbx,%rdi movl $35,%esi call bw_append_byte movq %rbx,%rdi movq 56(%rsp),%rsi call bw_append_bsz }}} To reproduce, check out `git@github.com:chadaustin/buffer-builder.git` at revision 6bd0a39c56f63ab751faf29f9784ac87d52638be {{{ cabal configure --enable-benchmarks cabal bench }}} And then look at the `./dist/build/bench/bench-tmp/bench/Bench.dump-asm` file. This is specifically on OS X 64-bit with GHC 7.8.3, but I saw similar code generation on GHC 7.6 on Linux 64-bit. -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10062#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10062: Codegen on sequential FFI calls is not very good -------------------------------------+------------------------------------- Reporter: chadaustin | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.3 (CodeGen) | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by tjakway): * cc: tjakway (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10062#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC