[GHC] #16052: Core optimizations for memset on a small range

#16052: Core optimizations for memset on a small range -------------------------------------+------------------------------------- Reporter: andrewthad | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.6.3 Keywords: | Operating System: Unknown/Multiple Architecture: | Type of failure: None/Unknown Unknown/Multiple | Test Case: | Blocked By: Blocking: | Related Tickets: Differential Rev(s): | Wiki Page: -------------------------------------+------------------------------------- I've been doing some API bindings lately that require zeroing out memory before poking values into the appropriate places. Sometimes, these are small data structures. For instance, on linux, the internet socket struct `sockaddr_in` is 16 bytes. Here's an example (not involving `sockaddr_in`) of the kind of situation that arises: {{{ {-# language MagicHash #-} {-# language UnboxedTuples #-} module FillArray ( fill ) where import GHC.Exts import GHC.IO data ByteArray = ByteArray ByteArray# fill :: IO ByteArray fill = IO $ \s0 -> case newByteArray# 24# s0 of (# s1, m #) -> case setByteArray# m 0# 24# 0# s1 of s2 -> case writeWord8Array# m 4# 14## s2 of s3 -> case writeWord8Array# m 5# 15## s3 of s4 -> case unsafeFreezeByteArray# m s4 of (# s5, r #) -> (# s5, ByteArray r #) }}} This `fill` function allocates a 24-byte array, sets everything to zero, and then writes the numbers 14 and 15 to elements 4 and 5 respectively. With `-O2`, here's the relevant part of the core we get: {{{ fill1 fill1 = \ s0_a140 -> case newByteArray# 24# s0_a140 of { (# ipv_s16i, ipv1_s16j #) -> case setByteArray# ipv1_s16j 0# 24# 0# ipv_s16i of s2_a143 { __DEFAULT -> case writeWord8Array# ipv1_s16j 4# 14## s2_a143 of s3_a144 { __DEFAULT -> case writeWord8Array# ipv1_s16j 5# 15## s3_a144 of s4_a145 { __DEFAULT -> case unsafeFreezeByteArray# ipv1_s16j s4_a145 of { (# ipv2_s16p, ipv3_s16q #) -> (# ipv2_s16p, ByteArray ipv3_s16q #) } } } } } }}} And, here's the relevant assembly: {{{ fill1_info: _c1kL: addq $56,%r12 cmpq 856(%r13),%r12 ja _c1kP _c1kO: movq $stg_ARR_WORDS_info,-48(%r12) movq $24,-40(%r12) leaq -48(%r12),%rax subq $8,%rsp leaq 16(%rax),%rdi xorl %esi,%esi movl $24,%edx movq %rax,%rbx xorl %eax,%eax call memset addq $8,%rsp movb $14,20(%rbx) movb $15,21(%rbx) movq $ByteArray_con_info,-8(%r12) movq %rbx,(%r12) leaq -7(%r12),%rbx jmp *(%rbp) _c1kP: movq $56,904(%r13) movl $fill1_closure,%ebx jmp *-8(%r13) .size fill1_info, .-fill1_info }}} What a bummer that using `memset` for something as small setting three machine words (on a 64 bit platform) results in a `call` instruction getting generated. Why not simply generate three `movb` instructions for the zero initialization instead? Currently, users can work around this by translating their `setByteArray#` call to several `writeWordArray#` calls. This optimization obscures the meaning of written code and is not portable across architectures (so you have to use `CPP` to make it work on 32 bit and 64 bit). I'd like to add a cmm-to-assembly optimization to GHC that does unrolling instead so that the user can write more natural code. Specifically, here's what I'm thinking: * This only happens when the offset into the `ByteArray#` and the length of the range are constant that are multiples of the machine word size. So, `setByteArray# arr 8# 16# x` is eligible on 32-bit and 64-bit platforms. And `setByteArray# arr 4# 8# x` is eligible only on a 32-bit platform. And `setByteArray# arr 16# y x` is not eligible on any platform. * This only happens when the `call memset` instruction has a range of 32 bytes or less. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/16052 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#16052: Core optimizations for memset on a small range -------------------------------------+------------------------------------- Reporter: andrewthad | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by andrewthad): Weird. According to https://ghc.haskell.org/trac/ghc/wiki/MemcpyOptimizations, GHC already does this. Except that I can see that it doesn’t actually. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/16052#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#16052: Core optimizations for memset on a small range -------------------------------------+------------------------------------- Reporter: andrewthad | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): Indeed it sounds like the current optimisation is broken. This is a bit surprising given that this is tested in the testsuite. Have you tried raising `-fmax-inline-memset-insns`? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/16052#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#16052: Core optimizations for memset on a small range -------------------------------------+------------------------------------- Reporter: andrewthad | Owner: (none) Type: task | Status: new Priority: normal | Milestone: 8.9 Component: Compiler | Version: 8.6.3 Resolution: | Keywords: newcomer Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by bgamari): * keywords: => newcomer * milestone: => 8.9 Comment: Actually, now that I think of it, I believe the problem is that we don't know enough about the `ByteArray#`'s alignment for the inline memset logic to fire, given the current implementation. The current implementation is (found in `compiler/nativeGen/X86/CodeGen.hs`) is too strict in its condition: {{{#!hs genCCall dflags _ (PrimTarget (MO_Memset align)) _ [dst, CmmLit (CmmInt c _), CmmLit (CmmInt n _)] _ | fromInteger insns <= maxInlineMemsetInsns dflags && align .&. 3 == 0 = do }}} Another problem is that the codegen logic (`StgCmmPrim.doSetByteArrayOp`) makes too weak a claim about the alignment. It claims that the region is merely byte-aligned, even if the offset is aligned. Given that we know that the beginning of the bytearray is aligned to 16-bytes, we should be able to do better here (and the copy primops). Fixing this would be a nice project. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/16052#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#16052: Core optimizations for memset on a small range -------------------------------------+------------------------------------- Reporter: andrewthad | Owner: (none) Type: task | Status: new Priority: normal | Milestone: 8.9 Component: Compiler | Version: 8.6.3 Resolution: | Keywords: newcomer Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by andrewthad): I’m going to try to fix this. You’ve listed two problems: 1. Condition too strong 2. Weak alignment claim I don’t understand problem the first one. As far as I can tell, the condition is that the length of the region is less than 128 and that the memory is four-byte aligned. If anything, this condition seems too weak on 64-bit platforms (but correct on 32-bit platforms). I don’t know how the NCG normally deals with the 32 bit vs 64 bit distinction in other places. But the second problem, I agree that that seems like a problem. It’s probably difficult to add a better test for this once it’s fixed. It would need to compile starting with STG, not Cmm (which is what the current test does), and go all the way down to asm. But it might be brittle. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/16052#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

As far as I can tell, the condition is that the length of the region is less than 128 and that the memory is four-byte aligned. If anything, this condition seems too weak on 64-bit platforms (but correct on 32-bit
#16052: Core optimizations for memset on a small range -------------------------------------+------------------------------------- Reporter: andrewthad | Owner: (none) Type: task | Status: new Priority: normal | Milestone: 8.9 Component: Compiler | Version: 8.6.3 Resolution: | Keywords: newcomer Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): platforms). I don’t know how the NCG normally deals with the 32 bit vs 64 bit distinction in other places. We could use sub-word-sized stores in the case of non-aligned offsets. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/16052#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#16052: Core optimizations for memset on a small range -------------------------------------+------------------------------------- Reporter: andrewthad | Owner: (none) Type: task | Status: new Priority: normal | Milestone: 8.9 Component: Compiler | Version: 8.6.3 Resolution: | Keywords: newcomer Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by AndreasK): * cc: AndreasK (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/16052#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#16052: Core optimizations for memset on a small range -------------------------------------+------------------------------------- Reporter: andrewthad | Owner: (none) Type: task | Status: new Priority: normal | Milestone: 8.9 Component: Compiler | Version: 8.6.3 Resolution: | Keywords: newcomer Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by AndreasK): I'm happy to take a look once there is something to look at.
I don’t know how the NCG normally deals with the 32 bit vs 64 bit distinction in other places.
A lot of code is in the spirit of `is32Bit then foo32 else foo64`.
It’s probably difficult to add a better test for this once it’s fixed. It would need to compile starting with STG, not Cmm (which is what the current test does), and go all the way down to asm. But it might be brittle.
Brittle in what sense? Just grepping for calls to memset should be quite stable as far as making sure we don't emit calls like this. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/16052#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#16052: Core optimizations for memset on a small range -------------------------------------+------------------------------------- Reporter: andrewthad | Owner: (none) Type: task | Status: new Priority: normal | Milestone: 8.9 Component: Compiler | Version: 8.6.3 Resolution: | Keywords: newcomer Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by andrewthad): I didn't realize we could have tests that just grepped for things. That seems like exactly what I would want. I'll wait to try doing this until after the switch to gitlab is complete. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/16052#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#16052: Core optimizations for memset on a small range -------------------------------------+------------------------------------- Reporter: andrewthad | Owner: (none) Type: task | Status: new Priority: normal | Milestone: 8.9 Component: Compiler | Version: 8.6.3 Resolution: | Keywords: newcomer Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by AndreasK): @andrewthad: Are you still interested in fixing this? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/16052#comment:9 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#16052: Core optimizations for memset on a small range -------------------------------------+------------------------------------- Reporter: andrewthad | Owner: (none) Type: task | Status: new Priority: normal | Milestone: 8.9 Component: Compiler | Version: 8.6.3 Resolution: | Keywords: newcomer Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by andrewthad): Thanks for following up. I won't be trying to do this anytime soon, so if anyone else has the desire to do this, go for it. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/16052#comment:10 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC