
Ah, some other optimization seems to be kicking in here. When I increase
the size of the array to > 128 then I see a call to stg_newByteArray# being
emitted:
{offset
c1kb: // global
if ((Sp + -8) < SpLim) (likely: False) goto c1kc; else goto c1kd;
c1kc: // global
R1 = Main.main1_closure;
call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8;
c1kd: // global
I64[Sp - 8] = c1k9;
R1 = 129;
Sp = Sp - 8;
call stg_newByteArray#(R1) returns to c1k9, args: 8, res: 8,
upd: 8;
-harendra
On Fri, 7 Apr 2023 at 10:49, Harendra Kumar
Thanks Ben and Carter.
I compiled the following to Cmm:
{-# LANGUAGE MagicHash #-} {-# LANGUAGE UnboxedTuples #-}
import GHC.IO import GHC.Exts
data M = M (MutableByteArray# RealWorld)
main = do _ <- IO (\s -> case newByteArray# 1# s of (# s1, arr #) -> (# s1, M arr #)) return ()
It produced the following Cmm:
{offset c1k3: // global Hp = Hp + 24; if (Hp > HpLim) (likely: False) goto c1k7; else goto c1k6; c1k7: // global HpAlloc = 24; R1 = Main.main1_closure; call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8; c1k6: // global I64[Hp - 16] = stg_ARR_WORDS_info; I64[Hp - 8] = 1; R1 = GHC.Tuple.()_closure+1; call (P64[Sp])(R1) args: 8, res: 0, upd: 8; }
It seems to be as good as it gets. There is absolutely no scope for improvement in this.
-harendra
On Fri, 7 Apr 2023 at 03:32, Ben Gamari
wrote: Harendra Kumar
writes: I was looking at the RTS code for allocating small objects via prim ops e.g. newByteArray# . The code looks like:
stg_newByteArrayzh ( W_ n ) { MAYBE_GC_N(stg_newByteArrayzh, n);
payload_words = ROUNDUP_BYTES_TO_WDS(n); words = BYTES_TO_WDS(SIZEOF_StgArrBytes) + payload_words; ("ptr" p) = ccall allocateMightFail(MyCapability() "ptr", words);
We are making a foreign call here (ccall). I am wondering how much overhead a ccall adds? I guess it may have to save and restore registers. Would it be better to do the fast path case of allocating small objects from the nursery using cmm code like in stg_gc_noregs?
GHC's operational model is designed in such a way that foreign calls are fairly cheap (e.g. we don't need to switch stacks, which can be quite costly). Judging by the assembler produced for newByteArray# in one random x86-64 tree that I have lying around, it's only a couple of data-movement instructions, an %eax clear, and a stack pop:
36: 48 89 ce mov %rcx,%rsi 39: 48 89 c7 mov %rax,%rdi 3c: 31 c0 xor %eax,%eax 3e: e8 00 00 00 00 call 43
43: 48 83 c4 08 add $0x8,%rsp The data movement operations in particular are quite cheap on most microarchitectures where GHC would run due to register renaming. I doubt that this overhead would be noticable in anything but a synthetic benchmark. However, it never hurts to measure.
Cheers,
- Ben