Performance of small allocations via prim ops

I was looking at the RTS code for allocating small objects via prim ops e.g. newByteArray# . The code looks like: stg_newByteArrayzh ( W_ n ) { MAYBE_GC_N(stg_newByteArrayzh, n); payload_words = ROUNDUP_BYTES_TO_WDS(n); words = BYTES_TO_WDS(SIZEOF_StgArrBytes) + payload_words; ("ptr" p) = ccall allocateMightFail(MyCapability() "ptr", words); We are making a foreign call here (ccall). I am wondering how much overhead a ccall adds? I guess it may have to save and restore registers. Would it be better to do the fast path case of allocating small objects from the nursery using cmm code like in stg_gc_noregs? -harendra

That sounds like a worthy experiment!
I guess that would look like having an inline macro’d up path that checks
if it can get the job done that falls back to the general code?
Last I checked, the overhead for this sort of c call was on the order of
10nanoseconds or less which seems like it’d be very unlikely to be a
bottleneck, but do you have any natural or artificial benchmark programs
that would show case this?
For this sortah code, extra branching for that optimization could easily
have a larger performance impact than the known function call on modern
hardware. (Though take my intuitions about these things with a grain of
salt. )
On Tue, Apr 4, 2023 at 9:50 PM Harendra Kumar
I was looking at the RTS code for allocating small objects via prim ops e.g. newByteArray# . The code looks like:
stg_newByteArrayzh ( W_ n ) { MAYBE_GC_N(stg_newByteArrayzh, n);
payload_words = ROUNDUP_BYTES_TO_WDS(n); words = BYTES_TO_WDS(SIZEOF_StgArrBytes) + payload_words; ("ptr" p) = ccall allocateMightFail(MyCapability() "ptr", words);
We are making a foreign call here (ccall). I am wondering how much overhead a ccall adds? I guess it may have to save and restore registers. Would it be better to do the fast path case of allocating small objects from the nursery using cmm code like in stg_gc_noregs?
-harendra _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

On Fri, 7 Apr 2023 at 02:18, Carter Schonwald
That sounds like a worthy experiment!
I guess that would look like having an inline macro’d up path that checks if it can get the job done that falls back to the general code?
Last I checked, the overhead for this sort of c call was on the order of 10nanoseconds or less which seems like it’d be very unlikely to be a bottleneck, but do you have any natural or artificial benchmark programs that would show case this?
I converted my example code into a loop and ran it a million times with a 1 byte array size (would be 8 bytes after alignment). So roughly 3 words would be allocated per array, including the header and length. It took 5 ms using the statically known size optimization which inlines the alloc completely, and 10 ms using an unknown size (from program arg) which makes a call to newByteArray# . That turns out to be of the order of 5ns more per allocation. It does not sound like a big deal. -harendra

Great /fast experimentation!
I will admit I’m pleased that my dated intuition is still correct, but more
importantly we have more current data!
Thanks for the exploration and sharing what you found!
On Fri, Apr 7, 2023 at 7:35 AM Harendra Kumar
On Fri, 7 Apr 2023 at 02:18, Carter Schonwald
wrote: That sounds like a worthy experiment!
I guess that would look like having an inline macro’d up path that checks if it can get the job done that falls back to the general code?
Last I checked, the overhead for this sort of c call was on the order of 10nanoseconds or less which seems like it’d be very unlikely to be a bottleneck, but do you have any natural or artificial benchmark programs that would show case this?
I converted my example code into a loop and ran it a million times with a 1 byte array size (would be 8 bytes after alignment). So roughly 3 words would be allocated per array, including the header and length. It took 5 ms using the statically known size optimization which inlines the alloc completely, and 10 ms using an unknown size (from program arg) which makes a call to newByteArray# . That turns out to be of the order of 5ns more per allocation. It does not sound like a big deal.
-harendra

Harendra Kumar
I was looking at the RTS code for allocating small objects via prim ops e.g. newByteArray# . The code looks like:
stg_newByteArrayzh ( W_ n ) { MAYBE_GC_N(stg_newByteArrayzh, n);
payload_words = ROUNDUP_BYTES_TO_WDS(n); words = BYTES_TO_WDS(SIZEOF_StgArrBytes) + payload_words; ("ptr" p) = ccall allocateMightFail(MyCapability() "ptr", words);
We are making a foreign call here (ccall). I am wondering how much overhead a ccall adds? I guess it may have to save and restore registers. Would it be better to do the fast path case of allocating small objects from the nursery using cmm code like in stg_gc_noregs?
GHC's operational model is designed in such a way that foreign calls are
fairly cheap (e.g. we don't need to switch stacks, which can be quite
costly). Judging by the assembler produced for newByteArray# in one
random x86-64 tree that I have lying around, it's only a couple of
data-movement instructions, an %eax clear, and a stack pop:
36: 48 89 ce mov %rcx,%rsi
39: 48 89 c7 mov %rax,%rdi
3c: 31 c0 xor %eax,%eax
3e: e8 00 00 00 00 call 43

Thanks Ben and Carter.
I compiled the following to Cmm:
{-# LANGUAGE MagicHash #-}
{-# LANGUAGE UnboxedTuples #-}
import GHC.IO
import GHC.Exts
data M = M (MutableByteArray# RealWorld)
main = do
_ <- IO (\s -> case newByteArray# 1# s of (# s1, arr #) -> (# s1, M
arr #))
return ()
It produced the following Cmm:
{offset
c1k3: // global
Hp = Hp + 24;
if (Hp > HpLim) (likely: False) goto c1k7; else goto c1k6;
c1k7: // global
HpAlloc = 24;
R1 = Main.main1_closure;
call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8;
c1k6: // global
I64[Hp - 16] = stg_ARR_WORDS_info;
I64[Hp - 8] = 1;
R1 = GHC.Tuple.()_closure+1;
call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
}
It seems to be as good as it gets. There is absolutely no scope for
improvement in this.
-harendra
On Fri, 7 Apr 2023 at 03:32, Ben Gamari
Harendra Kumar
writes: I was looking at the RTS code for allocating small objects via prim ops e.g. newByteArray# . The code looks like:
stg_newByteArrayzh ( W_ n ) { MAYBE_GC_N(stg_newByteArrayzh, n);
payload_words = ROUNDUP_BYTES_TO_WDS(n); words = BYTES_TO_WDS(SIZEOF_StgArrBytes) + payload_words; ("ptr" p) = ccall allocateMightFail(MyCapability() "ptr", words);
We are making a foreign call here (ccall). I am wondering how much overhead a ccall adds? I guess it may have to save and restore registers. Would it be better to do the fast path case of allocating small objects from the nursery using cmm code like in stg_gc_noregs?
GHC's operational model is designed in such a way that foreign calls are fairly cheap (e.g. we don't need to switch stacks, which can be quite costly). Judging by the assembler produced for newByteArray# in one random x86-64 tree that I have lying around, it's only a couple of data-movement instructions, an %eax clear, and a stack pop:
36: 48 89 ce mov %rcx,%rsi 39: 48 89 c7 mov %rax,%rdi 3c: 31 c0 xor %eax,%eax 3e: e8 00 00 00 00 call 43
43: 48 83 c4 08 add $0x8,%rsp The data movement operations in particular are quite cheap on most microarchitectures where GHC would run due to register renaming. I doubt that this overhead would be noticable in anything but a synthetic benchmark. However, it never hurts to measure.
Cheers,
- Ben

Ah, some other optimization seems to be kicking in here. When I increase
the size of the array to > 128 then I see a call to stg_newByteArray# being
emitted:
{offset
c1kb: // global
if ((Sp + -8) < SpLim) (likely: False) goto c1kc; else goto c1kd;
c1kc: // global
R1 = Main.main1_closure;
call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8;
c1kd: // global
I64[Sp - 8] = c1k9;
R1 = 129;
Sp = Sp - 8;
call stg_newByteArray#(R1) returns to c1k9, args: 8, res: 8,
upd: 8;
-harendra
On Fri, 7 Apr 2023 at 10:49, Harendra Kumar
Thanks Ben and Carter.
I compiled the following to Cmm:
{-# LANGUAGE MagicHash #-} {-# LANGUAGE UnboxedTuples #-}
import GHC.IO import GHC.Exts
data M = M (MutableByteArray# RealWorld)
main = do _ <- IO (\s -> case newByteArray# 1# s of (# s1, arr #) -> (# s1, M arr #)) return ()
It produced the following Cmm:
{offset c1k3: // global Hp = Hp + 24; if (Hp > HpLim) (likely: False) goto c1k7; else goto c1k6; c1k7: // global HpAlloc = 24; R1 = Main.main1_closure; call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8; c1k6: // global I64[Hp - 16] = stg_ARR_WORDS_info; I64[Hp - 8] = 1; R1 = GHC.Tuple.()_closure+1; call (P64[Sp])(R1) args: 8, res: 0, upd: 8; }
It seems to be as good as it gets. There is absolutely no scope for improvement in this.
-harendra
On Fri, 7 Apr 2023 at 03:32, Ben Gamari
wrote: Harendra Kumar
writes: I was looking at the RTS code for allocating small objects via prim ops e.g. newByteArray# . The code looks like:
stg_newByteArrayzh ( W_ n ) { MAYBE_GC_N(stg_newByteArrayzh, n);
payload_words = ROUNDUP_BYTES_TO_WDS(n); words = BYTES_TO_WDS(SIZEOF_StgArrBytes) + payload_words; ("ptr" p) = ccall allocateMightFail(MyCapability() "ptr", words);
We are making a foreign call here (ccall). I am wondering how much overhead a ccall adds? I guess it may have to save and restore registers. Would it be better to do the fast path case of allocating small objects from the nursery using cmm code like in stg_gc_noregs?
GHC's operational model is designed in such a way that foreign calls are fairly cheap (e.g. we don't need to switch stacks, which can be quite costly). Judging by the assembler produced for newByteArray# in one random x86-64 tree that I have lying around, it's only a couple of data-movement instructions, an %eax clear, and a stack pop:
36: 48 89 ce mov %rcx,%rsi 39: 48 89 c7 mov %rax,%rdi 3c: 31 c0 xor %eax,%eax 3e: e8 00 00 00 00 call 43
43: 48 83 c4 08 add $0x8,%rsp The data movement operations in particular are quite cheap on most microarchitectures where GHC would run due to register renaming. I doubt that this overhead would be noticable in anything but a synthetic benchmark. However, it never hurts to measure.
Cheers,
- Ben

Little bit of grepping in the code gave me this:
emitPrimOp cfg primop =
let max_inl_alloc_size = fromIntegral (stgToCmmMaxInlAllocSize cfg)
in case primop of
NewByteArrayOp_Char -> \case
[(CmmLit (CmmInt n w))]
| asUnsigned w n <= max_inl_alloc_size --
<------------------------------- see this line
-> opIntoRegs $ \ [res] -> doNewByteArrayOp res (fromInteger n)
_ -> PrimopCmmEmit_External
We are emitting a more efficient code when the size of the array is
smaller. And the threshold is governed by a compiler flag:
, make_ord_flag defGhcFlag "fmax-inline-alloc-size"
(intSuffix (\n d -> d { maxInlineAllocSize = n }))
This means allocation of smaller arrays is extremely efficient and we can
control it using `-fmax-inline-alloc-size`, the default is 128. That's a
new thing I learnt today.
Given this new finding, my original question now applies only to the case
when the array size is bigger than this configurable threshold, which is a
little less motivating. And Ben says that the call is not expensive, so we
can leave it there.
-harendra
On Fri, 7 Apr 2023 at 11:08, Harendra Kumar
Ah, some other optimization seems to be kicking in here. When I increase the size of the array to > 128 then I see a call to stg_newByteArray# being emitted:
{offset c1kb: // global if ((Sp + -8) < SpLim) (likely: False) goto c1kc; else goto c1kd; c1kc: // global R1 = Main.main1_closure; call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8; c1kd: // global I64[Sp - 8] = c1k9; R1 = 129; Sp = Sp - 8; call stg_newByteArray#(R1) returns to c1k9, args: 8, res: 8, upd: 8;
-harendra
On Fri, 7 Apr 2023 at 10:49, Harendra Kumar
wrote: Thanks Ben and Carter.
I compiled the following to Cmm:
{-# LANGUAGE MagicHash #-} {-# LANGUAGE UnboxedTuples #-}
import GHC.IO import GHC.Exts
data M = M (MutableByteArray# RealWorld)
main = do _ <- IO (\s -> case newByteArray# 1# s of (# s1, arr #) -> (# s1, M arr #)) return ()
It produced the following Cmm:
{offset c1k3: // global Hp = Hp + 24; if (Hp > HpLim) (likely: False) goto c1k7; else goto c1k6; c1k7: // global HpAlloc = 24; R1 = Main.main1_closure; call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8; c1k6: // global I64[Hp - 16] = stg_ARR_WORDS_info; I64[Hp - 8] = 1; R1 = GHC.Tuple.()_closure+1; call (P64[Sp])(R1) args: 8, res: 0, upd: 8; }
It seems to be as good as it gets. There is absolutely no scope for improvement in this.
-harendra
On Fri, 7 Apr 2023 at 03:32, Ben Gamari
wrote: Harendra Kumar
writes: I was looking at the RTS code for allocating small objects via prim ops e.g. newByteArray# . The code looks like:
stg_newByteArrayzh ( W_ n ) { MAYBE_GC_N(stg_newByteArrayzh, n);
payload_words = ROUNDUP_BYTES_TO_WDS(n); words = BYTES_TO_WDS(SIZEOF_StgArrBytes) + payload_words; ("ptr" p) = ccall allocateMightFail(MyCapability() "ptr", words);
We are making a foreign call here (ccall). I am wondering how much overhead a ccall adds? I guess it may have to save and restore registers. Would it be better to do the fast path case of allocating small objects from the nursery using cmm code like in stg_gc_noregs?
GHC's operational model is designed in such a way that foreign calls are fairly cheap (e.g. we don't need to switch stacks, which can be quite costly). Judging by the assembler produced for newByteArray# in one random x86-64 tree that I have lying around, it's only a couple of data-movement instructions, an %eax clear, and a stack pop:
36: 48 89 ce mov %rcx,%rsi 39: 48 89 c7 mov %rax,%rdi 3c: 31 c0 xor %eax,%eax 3e: e8 00 00 00 00 call 43
43: 48 83 c4 08 add $0x8,%rsp The data movement operations in particular are quite cheap on most microarchitectures where GHC would run due to register renaming. I doubt that this overhead would be noticable in anything but a synthetic benchmark. However, it never hurts to measure.
Cheers,
- Ben

We are emitting a more efficient code when the size of the array is smaller. And the threshold is governed by a compiler flag:
It would be good if this was documented. Perhaps in the Haddock for
`newByteArray#`? Or where?
S
On Fri, 7 Apr 2023 at 07:07, Harendra Kumar
Little bit of grepping in the code gave me this:
emitPrimOp cfg primop = let max_inl_alloc_size = fromIntegral (stgToCmmMaxInlAllocSize cfg) in case primop of NewByteArrayOp_Char -> \case [(CmmLit (CmmInt n w))] | asUnsigned w n <= max_inl_alloc_size -- <------------------------------- see this line -> opIntoRegs $ \ [res] -> doNewByteArrayOp res (fromInteger n) _ -> PrimopCmmEmit_External
We are emitting a more efficient code when the size of the array is smaller. And the threshold is governed by a compiler flag:
, make_ord_flag defGhcFlag "fmax-inline-alloc-size" (intSuffix (\n d -> d { maxInlineAllocSize = n }))
This means allocation of smaller arrays is extremely efficient and we can control it using `-fmax-inline-alloc-size`, the default is 128. That's a new thing I learnt today.
Given this new finding, my original question now applies only to the case when the array size is bigger than this configurable threshold, which is a little less motivating. And Ben says that the call is not expensive, so we can leave it there.
-harendra
On Fri, 7 Apr 2023 at 11:08, Harendra Kumar
wrote: Ah, some other optimization seems to be kicking in here. When I increase the size of the array to > 128 then I see a call to stg_newByteArray# being emitted:
{offset c1kb: // global if ((Sp + -8) < SpLim) (likely: False) goto c1kc; else goto c1kd; c1kc: // global R1 = Main.main1_closure; call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8; c1kd: // global I64[Sp - 8] = c1k9; R1 = 129; Sp = Sp - 8; call stg_newByteArray#(R1) returns to c1k9, args: 8, res: 8, upd: 8;
-harendra
On Fri, 7 Apr 2023 at 10:49, Harendra Kumar
wrote: Thanks Ben and Carter.
I compiled the following to Cmm:
{-# LANGUAGE MagicHash #-} {-# LANGUAGE UnboxedTuples #-}
import GHC.IO import GHC.Exts
data M = M (MutableByteArray# RealWorld)
main = do _ <- IO (\s -> case newByteArray# 1# s of (# s1, arr #) -> (# s1, M arr #)) return ()
It produced the following Cmm:
{offset c1k3: // global Hp = Hp + 24; if (Hp > HpLim) (likely: False) goto c1k7; else goto c1k6; c1k7: // global HpAlloc = 24; R1 = Main.main1_closure; call (stg_gc_fun)(R1) args: 8, res: 0, upd: 8; c1k6: // global I64[Hp - 16] = stg_ARR_WORDS_info; I64[Hp - 8] = 1; R1 = GHC.Tuple.()_closure+1; call (P64[Sp])(R1) args: 8, res: 0, upd: 8; }
It seems to be as good as it gets. There is absolutely no scope for improvement in this.
-harendra
On Fri, 7 Apr 2023 at 03:32, Ben Gamari
wrote: Harendra Kumar
writes: I was looking at the RTS code for allocating small objects via prim ops e.g. newByteArray# . The code looks like:
stg_newByteArrayzh ( W_ n ) { MAYBE_GC_N(stg_newByteArrayzh, n);
payload_words = ROUNDUP_BYTES_TO_WDS(n); words = BYTES_TO_WDS(SIZEOF_StgArrBytes) + payload_words; ("ptr" p) = ccall allocateMightFail(MyCapability() "ptr", words);
We are making a foreign call here (ccall). I am wondering how much overhead a ccall adds? I guess it may have to save and restore registers. Would it be better to do the fast path case of allocating small objects from the nursery using cmm code like in stg_gc_noregs?
GHC's operational model is designed in such a way that foreign calls are fairly cheap (e.g. we don't need to switch stacks, which can be quite costly). Judging by the assembler produced for newByteArray# in one random x86-64 tree that I have lying around, it's only a couple of data-movement instructions, an %eax clear, and a stack pop:
36: 48 89 ce mov %rcx,%rsi 39: 48 89 c7 mov %rax,%rdi 3c: 31 c0 xor %eax,%eax 3e: e8 00 00 00 00 call 43
43: 48 83 c4 08 add $0x8,%rsp The data movement operations in particular are quite cheap on most microarchitectures where GHC would run due to register renaming. I doubt that this overhead would be noticable in anything but a synthetic benchmark. However, it never hurts to measure.
Cheers,
- Ben
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

On Fri, 7 Apr 2023 at 12:57, Simon Peyton Jones
We are emitting a more efficient code when the size of the array is smaller. And the threshold is governed by a compiler flag:
It would be good if this was documented. Perhaps in the Haddock for `newByteArray#`? Or where?
The flag is documented in the GHC user guide but the behavior would be better discoverable if `newByteArray#` mentions it. -harendra

I am confused by this flag. This flag allows us to allocate statically
known arrays sizes of <= n to be allocated from the current nursery block.
But looking at the code in allocateMightFail, as I interpret it, any size
array up to LARGE_OBJECT_THRESHOLD is anyway allocated from the current
nursery block. So why have this option? Why not fix this to
LARGE_OBJECT_THRESHOLD? Maybe I am missing something.
-harendra
On Fri, 7 Apr 2023 at 15:45, Harendra Kumar
On Fri, 7 Apr 2023 at 12:57, Simon Peyton Jones < simon.peytonjones@gmail.com> wrote:
We are emitting a more efficient code when the size of the array is smaller. And the threshold is governed by a compiler flag:
It would be good if this was documented. Perhaps in the Haddock for `newByteArray#`? Or where?
The flag is documented in the GHC user guide but the behavior would be better discoverable if `newByteArray#` mentions it.
-harendra

Harendra Kumar
I am confused by this flag. This flag allows us to allocate statically known arrays sizes of <= n to be allocated from the current nursery block. But looking at the code in allocateMightFail, as I interpret it, any size array up to LARGE_OBJECT_THRESHOLD is anyway allocated from the current nursery block. So why have this option? Why not fix this to LARGE_OBJECT_THRESHOLD? Maybe I am missing something.
In principle we could do so. The motivation for making this a flag isn't immediately clear from the commit implementing this optimisation (1eece45692fb5d1a5f4ec60c1537f8068237e9c1). One complication is that currently GHC has no way to know the value of LARGE_OBJECT_THRESHOLD (which is a runtime system macro). Typically to handle this sort of thing we use utils/deriveConstants to generate a Haskell binding mirroring the value of the C declaration. However, as GHC becomes runtime-retargetable we may need to revisit this design. Cheers, - Ben

One complication is that currently GHC has no way to know the value of LARGE_OBJECT_THRESHOLD (which is a runtime system macro). Typically to handle this sort of thing we use utils/deriveConstants to generate a Haskell binding mirroring the value of the C declaration. However, as GHC becomes runtime-retargetable we may need to revisit this design.
Since https://gitlab.haskell.org/ghc/ghc/-/commit/085983e63bfe6af23f8b85fbfcca8db4... (2021-03) we don't do this. We only read constants from the header file provided by the RTS unit. Adding one more constant for LARGE_OBJECT_THRESHOLD shouldn't be an issue. Cheers Sylvain
participants (5)
-
Ben Gamari
-
Carter Schonwald
-
Harendra Kumar
-
Simon Peyton Jones
-
Sylvain Henry