Re: Performance of small allocations via prim ops

7 Apr 2023


      ...
We are emitting a more efficient code when the size of the array is
smaller. And the threshold is governed by a compiler flag:
It would be good if this was documented.  Perhaps in the Haddock for
`newByteArray#`?  Or where?

S

On Fri, 7 Apr 2023 at 07:07, Harendra Kumar 
wrote:
...
Little bit of grepping in the code gave me this:
emitPrimOp cfg primop =
  let max_inl_alloc_size = fromIntegral (stgToCmmMaxInlAllocSize cfg)
  in case primop of
  NewByteArrayOp_Char -> \case
    [(CmmLit (CmmInt n w))]
      | asUnsigned w n <= max_inl_alloc_size     --
<------------------------------- see this line
      -> opIntoRegs  $ \ [res] -> doNewByteArrayOp res (fromInteger n)
    _ -> PrimopCmmEmit_External
We are emitting a more efficient code when the size of the array is
smaller. And the threshold is governed by a compiler flag:
, make_ord_flag defGhcFlag "fmax-inline-alloc-size"
      (intSuffix (\n d -> d { maxInlineAllocSize = n }))
This means allocation of smaller arrays is extremely efficient and we can
control it using `-fmax-inline-alloc-size`, the default is 128. That's a
new thing I learnt today.
Given this new finding, my original question now applies only to the case
when the array size is bigger than this configurable threshold, which is a
little less motivating. And Ben says that the call is not expensive, so we
can leave it there.
-harendra
On Fri, 7 Apr 2023 at 11:08, Harendra Kumar 
wrote:
...
Ah, some other optimization seems to be kicking in here. When I increase
the size of the array to > 128 then I see a call to stg_newByteArray# being
emitted:
{offset
       c1kb: // global
           if ((Sp + -8) < SpLim) (likely: False) goto c1kc; else goto
c1kd;
       c1kc: // global
           R1 = Main.main1_closure;
           call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8;
       c1kd: // global
           I64[Sp - 8] = c1k9;
           R1 = 129;
           Sp = Sp - 8;
           call stg_newByteArray#(R1) returns to c1k9, args: 8, res: 8,
upd: 8;
-harendra
On Fri, 7 Apr 2023 at 10:49, Harendra Kumar 
wrote:
...
Thanks Ben and Carter.
I compiled the following to Cmm:
{-# LANGUAGE MagicHash #-}
{-# LANGUAGE UnboxedTuples #-}
import GHC.IO
import GHC.Exts
data M = M (MutableByteArray# RealWorld)
main = do
     _ <- IO (\s -> case newByteArray# 1# s of (# s1, arr #) -> (# s1, M
arr #))
     return ()
It produced the following Cmm:
{offset
       c1k3: // global
           Hp = Hp + 24;
           if (Hp > HpLim) (likely: False) goto c1k7; else goto c1k6;
       c1k7: // global
           HpAlloc = 24;
           R1 = Main.main1_closure;
           call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8;
       c1k6: // global
           I64[Hp - 16] = stg_ARR_WORDS_info;
           I64[Hp - 8] = 1;
           R1 = GHC.Tuple.()_closure+1;
           call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
     }
It seems to be as good as it gets. There is absolutely no scope for
improvement in this.
-harendra
On Fri, 7 Apr 2023 at 03:32, Ben Gamari  wrote:
...
Harendra Kumar  writes:
...
I was looking at the RTS code for allocating small objects via prim
ops
e.g. newByteArray# . The code looks like:
stg_newByteArrayzh ( W_ n )
{
    MAYBE_GC_N(stg_newByteArrayzh, n);
payload_words = ROUNDUP_BYTES_TO_WDS(n);
    words = BYTES_TO_WDS(SIZEOF_StgArrBytes) + payload_words;
    ("ptr" p) = ccall allocateMightFail(MyCapability() "ptr", words);
We are making a foreign call here (ccall). I am wondering how much
overhead
a ccall adds? I guess it may have to save and restore registers.
Would it
be better to do the fast path case of allocating small objects from
the
nursery using cmm code like in stg_gc_noregs?
GHC's operational model is designed in such a way that foreign calls are
fairly cheap (e.g. we don't need to switch stacks, which can be quite
costly). Judging by the assembler produced for newByteArray# in one
random x86-64 tree that I have lying around, it's only a couple of
data-movement instructions, an %eax clear, and a stack pop:
36:       48 89 ce                mov    %rcx,%rsi
      39:       48 89 c7                mov    %rax,%rdi
      3c:       31 c0                   xor    %eax,%eax
      3e:       e8 00 00 00 00          call   43

      43:       48 83 c4 08             add    $0x8,%rsp
The data movement operations in particular are quite cheap on most
microarchitectures where GHC would run due to register renaming. I doubt
that this overhead would be noticable in anything but a synthetic
benchmark. However, it never hurts to measure.
Cheers,
- Ben
_______________________________________________
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Re: Performance of small allocations via prim ops

Simon Peyton Jones