
#10062: Codegen on sequential FFI calls is not very good -------------------------------------+------------------------------------- Reporter: chadaustin | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.8.3 (CodeGen) | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: -------------------------------------+------------------------------------- Description changed by bgamari: Old description:
I'm writing a library for efficiently building up a byte buffer. The fastest approach I've found is via FFI, with restricted effects like ST. It's over twice as fast as ByteString Builder.
Consider this example API usage: https://github.com/chadaustin/buffer- builder/blob/6bd0a39c56f63ab751faf29f9784ac87d52638be/bench/Bench.hs#L46
It compiles into an instruction sequence containing direct, sequenced FFI calls. For example, the last three calls work out to:
addq $8,%rsp movq %rbx,%rdi movq 72(%rsp),%rax movq %rax,%rsi subq $8,%rsp movl $0,%eax call bw_append_bsz
addq $8,%rsp movq %rbx,%rdi movl $35,%esi subq $8,%rsp movl $0,%eax call bw_append_byte
addq $8,%rsp movq %rbx,%rdi movq 64(%rsp),%rax movq %rax,%rsi subq $8,%rsp movl $0,%eax call bw_append_bsz
I don't know why rsp is being changed so much. I also can't explain the assignment to eax before the call. (It should also be xorl eax,eax, I would think.)
To my reading, the above instruction sequence could be reduced to:
movq %rbx,%rdi movq 64(%rsp),%rsi call bw_append_bsz
movq %rbx,%rdi movl $35,%esi call bw_append_byte
movq %rbx,%rdi movq 56(%rsp),%rsi call bw_append_bsz
To reproduce, check out git@github.com:chadaustin/buffer-builder.git at revision 6bd0a39c56f63ab751faf29f9784ac87d52638be
cabal configure --enable-benchmarks cabal bench
And then look at the ./dist/build/bench/bench-tmp/bench/Bench.dump-asm file.
This is specifically on OS X 64-bit with GHC 7.8.3, but I saw similar code generation on GHC 7.6 on Linux 64-bit.
New description: I'm writing a library for efficiently building up a byte buffer. The fastest approach I've found is via FFI, with restricted effects like ST. It's over twice as fast as ByteString Builder. Consider this example API usage: https://github.com/chadaustin/buffer- builder/blob/6bd0a39c56f63ab751faf29f9784ac87d52638be/bench/Bench.hs#L46 It compiles into an instruction sequence containing direct, sequenced FFI calls. For example, the last three calls work out to: {{{ addq $8,%rsp movq %rbx,%rdi movq 72(%rsp),%rax movq %rax,%rsi subq $8,%rsp movl $0,%eax call bw_append_bsz addq $8,%rsp movq %rbx,%rdi movl $35,%esi subq $8,%rsp movl $0,%eax call bw_append_byte addq $8,%rsp movq %rbx,%rdi movq 64(%rsp),%rax movq %rax,%rsi subq $8,%rsp movl $0,%eax call bw_append_bsz }}} I don't know why `rsp` is being changed so much. I also can't explain the assignment to `eax` before the call. (It should also be `xorl eax,eax`, I would think.) To my reading, the above instruction sequence could be reduced to: {{{ movq %rbx,%rdi movq 64(%rsp),%rsi call bw_append_bsz movq %rbx,%rdi movl $35,%esi call bw_append_byte movq %rbx,%rdi movq 56(%rsp),%rsi call bw_append_bsz }}} To reproduce, check out `git@github.com:chadaustin/buffer-builder.git` at revision 6bd0a39c56f63ab751faf29f9784ac87d52638be {{{ cabal configure --enable-benchmarks cabal bench }}} And then look at the `./dist/build/bench/bench-tmp/bench/Bench.dump-asm` file. This is specifically on OS X 64-bit with GHC 7.8.3, but I saw similar code generation on GHC 7.6 on Linux 64-bit. -- -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10062#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler