Little bit of grepping in the code gave me this:

emitPrimOp cfg primop =
  let max_inl_alloc_size = fromIntegral (stgToCmmMaxInlAllocSize cfg)
  in case primop of
  NewByteArrayOp_Char -> \case
    [(CmmLit (CmmInt n w))]
      | asUnsigned w n <= max_inl_alloc_size     -- <------------------------------- see this line
      -> opIntoRegs  $ \ [res] -> doNewByteArrayOp res (fromInteger n)
    _ -> PrimopCmmEmit_External

We are emitting a more efficient code when the size of the array is smaller. And the threshold is governed by a compiler flag:

  , make_ord_flag defGhcFlag "fmax-inline-alloc-size"
      (intSuffix (\n d -> d { maxInlineAllocSize = n }))

This means allocation of smaller arrays is extremely efficient and we can control it using `-fmax-inline-alloc-size`, the default is 128. That's a new thing I learnt today.

Given this new finding, my original question now applies only to the case when the array size is bigger than this configurable threshold, which is a little less motivating. And Ben says that the call is not expensive, so we can leave it there.

-harendra

On Fri, 7 Apr 2023 at 11:08, Harendra Kumar <harendra.kumar@gmail.com> wrote:
Ah, some other optimization seems to be kicking in here. When I increase the size of the array to > 128 then I see a call to stg_newByteArray# being emitted:

     {offset
       c1kb: // global
           if ((Sp + -8) < SpLim) (likely: False) goto c1kc; else goto c1kd;
       c1kc: // global
           R1 = Main.main1_closure;
           call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8;
       c1kd: // global
           I64[Sp - 8] = c1k9;
           R1 = 129;
           Sp = Sp - 8;
           call stg_newByteArray#(R1) returns to c1k9, args: 8, res: 8, upd: 8;

-harendra

On Fri, 7 Apr 2023 at 10:49, Harendra Kumar <harendra.kumar@gmail.com> wrote:
Thanks Ben and Carter.

I compiled the following to Cmm:

{-# LANGUAGE MagicHash #-}
{-# LANGUAGE UnboxedTuples #-}

import GHC.IO
import GHC.Exts

data M = M (MutableByteArray# RealWorld)

main = do
     _ <- IO (\s -> case newByteArray# 1# s of (# s1, arr #) -> (# s1, M arr #))
     return ()

It produced the following Cmm:

     {offset
       c1k3: // global
           Hp = Hp + 24;
           if (Hp > HpLim) (likely: False) goto c1k7; else goto c1k6;
       c1k7: // global
           HpAlloc = 24;
           R1 = Main.main1_closure;
           call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8;
       c1k6: // global
           I64[Hp - 16] = stg_ARR_WORDS_info;
           I64[Hp - 8] = 1;
           R1 = GHC.Tuple.()_closure+1;
           call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
     }

It seems to be as good as it gets. There is absolutely no scope for improvement in this.

-harendra

On Fri, 7 Apr 2023 at 03:32, Ben Gamari <ben@smart-cactus.org> wrote:
Harendra Kumar <harendra.kumar@gmail.com> writes:

> I was looking at the RTS code for allocating small objects via prim ops
> e.g. newByteArray# . The code looks like:
>
> stg_newByteArrayzh ( W_ n )
> {
>     MAYBE_GC_N(stg_newByteArrayzh, n);
>
>     payload_words = ROUNDUP_BYTES_TO_WDS(n);
>     words = BYTES_TO_WDS(SIZEOF_StgArrBytes) + payload_words;
>     ("ptr" p) = ccall allocateMightFail(MyCapability() "ptr", words);
>
> We are making a foreign call here (ccall). I am wondering how much overhead
> a ccall adds? I guess it may have to save and restore registers. Would it
> be better to do the fast path case of allocating small objects from the
> nursery using cmm code like in stg_gc_noregs?
>
GHC's operational model is designed in such a way that foreign calls are
fairly cheap (e.g. we don't need to switch stacks, which can be quite
costly). Judging by the assembler produced for newByteArray# in one
random x86-64 tree that I have lying around, it's only a couple of
data-movement instructions, an %eax clear, and a stack pop:

      36:       48 89 ce                mov    %rcx,%rsi
      39:       48 89 c7                mov    %rax,%rdi
      3c:       31 c0                   xor    %eax,%eax
      3e:       e8 00 00 00 00          call   43 <stg_newByteArrayzh+0x43>
      43:       48 83 c4 08             add    $0x8,%rsp

The data movement operations in particular are quite cheap on most
microarchitectures where GHC would run due to register renaming. I doubt
that this overhead would be noticable in anything but a synthetic
benchmark. However, it never hurts to measure.

Cheers,

- Ben