[GHC] #13569: Optimise code-gen for field-updates of large products

#13569: Optimise code-gen for field-updates of large products -------------------------------------+------------------------------------- Reporter: hvr | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.0.1 (CodeGen) | Keywords: | Operating System: Unknown/Multiple Architecture: | Type of failure: Other Unknown/Multiple | Test Case: | Blocked By: Blocking: | Related Tickets: Differential Rev(s): | Wiki Page: -------------------------------------+------------------------------------- Here's something we talked about last year which I wanted to look into but then got distracted (but I'm writing this one up now so it doesn't get lost again): Consider large field updates such as {{{#!hs module BigRec where data T = C { fld01 :: !Int , fld02 :: !Int , fld03 :: Int , fld04 :: Int , fld05 :: Int , fld06 :: Int , fld07 :: Int , fld08 :: Int , fld09 :: Int , fld10 :: Int , fld11 :: Int , fld12 :: Int , fld13 :: Int , fld14 :: Int , fld15 :: Int , fld16 :: Int , fld17 :: Int , fld18 :: Int , fld19 :: Int } inc01 x = x { fld01 = (fld01 x) + 1 } inc0102 x = x { fld01 = (fld01 x) + 1, fld02 = (fld02 x) + 1 } }}} These get translated into long chains of assignments, which in turn result in respectively long chains of register<->memory move instructions being generated by the NCG {{{#!c BigRec.inc0102_entry() // [R2] { [(c1em, block_c1em_info: const 0; const 31;), (c1ep, BigRec.inc0102_info: const 4294967301; const 0; const 15;)] } {offset c1ep: if ((Sp + -8) < SpLim) goto c1ez; else goto c1eA; c1ez: // nop R1 = BigRec.inc0102_closure; call (I64[BaseReg - 8])(R2, R1) args: 8, res: 0, upd: 8; c1eA: I64[Sp - 8] = block_c1em_info; R1 = R2; Sp = Sp - 8; if (R1 & 7 != 0) goto c1em; else goto c1en; c1en: call (I64[R1])(R1) returns to c1em, args: 8, res: 8, upd: 8; c1em: Hp = Hp + 160; if (Hp > I64[BaseReg + 856]) goto c1eD; else goto c1eC; c1eD: I64[BaseReg + 904] = 160; // nop call stg_gc_unpt_r1(R1) returns to c1em, args: 8, res: 8, upd: 8; c1eC: _s15F::P64 = P64[R1 + 7]; _s15G::P64 = P64[R1 + 15]; _s15H::P64 = P64[R1 + 23]; _s15I::P64 = P64[R1 + 31]; _s15J::P64 = P64[R1 + 39]; _s15K::P64 = P64[R1 + 47]; _s15L::P64 = P64[R1 + 55]; _s15M::P64 = P64[R1 + 63]; _s15N::P64 = P64[R1 + 71]; _s15O::P64 = P64[R1 + 79]; _s15P::P64 = P64[R1 + 87]; _s15Q::P64 = P64[R1 + 95]; _s15R::P64 = P64[R1 + 103]; _s15S::P64 = P64[R1 + 111]; _s15T::P64 = P64[R1 + 119]; _s15U::P64 = P64[R1 + 127]; _s15V::P64 = P64[R1 + 135]; _s15X::I64 = I64[R1 + 151] + 1; _s15W::I64 = I64[R1 + 143] + 1; I64[Hp - 152] = BigRec.C_con_info; P64[Hp - 144] = _s15F::P64; P64[Hp - 136] = _s15G::P64; P64[Hp - 128] = _s15H::P64; P64[Hp - 120] = _s15I::P64; P64[Hp - 112] = _s15J::P64; P64[Hp - 104] = _s15K::P64; P64[Hp - 96] = _s15L::P64; P64[Hp - 88] = _s15M::P64; P64[Hp - 80] = _s15N::P64; P64[Hp - 72] = _s15O::P64; P64[Hp - 64] = _s15P::P64; P64[Hp - 56] = _s15Q::P64; P64[Hp - 48] = _s15R::P64; P64[Hp - 40] = _s15S::P64; P64[Hp - 32] = _s15T::P64; P64[Hp - 24] = _s15U::P64; P64[Hp - 16] = _s15V::P64; I64[Hp - 8] = _s15W::I64; I64[Hp] = _s15X::I64; R1 = Hp - 151; }}} This happens in a few places inside GHC's source-tree (where we have large-ish products, such as DynFlags where we update single fields), which is where I noticed this while looking at generated ASM output. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13569 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13569: Optimise code-gen for field-updates of large products -------------------------------------+------------------------------------- Reporter: hvr | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.0.1 (CodeGen) | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: Other | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): OK. What do you suggest instead? `memcpy`? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13569#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13569: Optimise code-gen for field-updates of large products -------------------------------------+------------------------------------- Reporter: hvr | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.0.1 (CodeGen) | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: Other | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by osa1): * cc: osa1 (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13569#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13569: Optimise code-gen for field-updates of large products -------------------------------------+------------------------------------- Reporter: hvr | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.0.1 (CodeGen) | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: Other | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): Yes, I believe we discussed this around six months ago and the suggestion was to emit a `memcpy` instead. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13569#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13569: Optimise code-gen for field-updates of large products -------------------------------------+------------------------------------- Reporter: hvr | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.0.1 (CodeGen) | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: Other | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by nomeata): Is memcpy faster than a local loop? Also, would you memcpy the fields before, then set the field, and the fields after? Or would you rather memcpy the whole thing in one go, and then set the single field? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13569#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13569: Optimise code-gen for field-updates of large products -------------------------------------+------------------------------------- Reporter: hvr | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.0.1 (CodeGen) | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: Other | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): It's best not to view `memcpy` as a "normal" function. C compilers are generally quite eager to inline and unroll small `memcpy`s and will choose from among multiple implementation depending upon the known alignment and size of the copy. Moreover, they also generally have optimization passes which look for code that functionally resembles `memcpy` and replaces it with the builtin. My understanding is that the advantage of `memcpy` is generally felt when you have blocks of multiples of the wordsize as this allows you to leverage the target's SIMD instructions, using its full memory bandwidth. Consequently, the best case here would be that you memcpy the whole heap object and then do whatever updates are necessary. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13569#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13569: Optimise code-gen for field-updates of large products -------------------------------------+------------------------------------- Reporter: hvr | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.0.1 (CodeGen) | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: Other | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by nomeata): * cc: nomeata (added) Comment: So basically `memcpy` always, even if the object is very small, and leave the rest to the compiler? But we only run an optimizing compiler with the LLVM backend, don’t we? So this should not be done in the NGC? (Did trac stop auto-subscribing commentors?) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13569#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13569: Optimise code-gen for field-updates of large products -------------------------------------+------------------------------------- Reporter: hvr | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.0.1 (CodeGen) | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: Other | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by nomeata): The comments in http://stackoverflow.com/q/43481900/946226 also suggest that using `memcpy` always might be an option. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13569#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#13569: Optimise code-gen for field-updates of large products -------------------------------------+------------------------------------- Reporter: hvr | Owner: (none) Type: task | Status: new Priority: normal | Milestone: Component: Compiler | Version: 8.0.1 (CodeGen) | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: Other | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): Even our NCG (at least on X86, I haven't checked the others) will unroll small `memcpy`s, so I think this should always be a sensible thing to do. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/13569#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC