Re: Performance of small allocations via prim ops

6 Apr 2023

      Harendra Kumar  writes:
...
I was looking at the RTS code for allocating small objects via prim ops
e.g. newByteArray# . The code looks like:
stg_newByteArrayzh ( W_ n )
{
    MAYBE_GC_N(stg_newByteArrayzh, n);
payload_words = ROUNDUP_BYTES_TO_WDS(n);
    words = BYTES_TO_WDS(SIZEOF_StgArrBytes) + payload_words;
    ("ptr" p) = ccall allocateMightFail(MyCapability() "ptr", words);
We are making a foreign call here (ccall). I am wondering how much overhead
a ccall adds? I guess it may have to save and restore registers. Would it
be better to do the fast path case of allocating small objects from the
nursery using cmm code like in stg_gc_noregs?
GHC's operational model is designed in such a way that foreign calls are
fairly cheap (e.g. we don't need to switch stacks, which can be quite
costly). Judging by the assembler produced for newByteArray# in one
random x86-64 tree that I have lying around, it's only a couple of
data-movement instructions, an %eax clear, and a stack pop:

      36:       48 89 ce                mov    %rcx,%rsi
      39:       48 89 c7                mov    %rax,%rdi
      3c:       31 c0                   xor    %eax,%eax
      3e:       e8 00 00 00 00          call   43 
      43:       48 83 c4 08             add    $0x8,%rsp

The data movement operations in particular are quite cheap on most
microarchitectures where GHC would run due to register renaming. I doubt
that this overhead would be noticable in anything but a synthetic
benchmark. However, it never hurts to measure.

Cheers,

- Ben

Re: Performance of small allocations via prim ops

Ben Gamari