
That sounds like a worthy experiment!
I guess that would look like having an inline macro’d up path that checks
if it can get the job done that falls back to the general code?
Last I checked, the overhead for this sort of c call was on the order of
10nanoseconds or less which seems like it’d be very unlikely to be a
bottleneck, but do you have any natural or artificial benchmark programs
that would show case this?
For this sortah code, extra branching for that optimization could easily
have a larger performance impact than the known function call on modern
hardware. (Though take my intuitions about these things with a grain of
salt. )
On Tue, Apr 4, 2023 at 9:50 PM Harendra Kumar
I was looking at the RTS code for allocating small objects via prim ops e.g. newByteArray# . The code looks like:
stg_newByteArrayzh ( W_ n ) { MAYBE_GC_N(stg_newByteArrayzh, n);
payload_words = ROUNDUP_BYTES_TO_WDS(n); words = BYTES_TO_WDS(SIZEOF_StgArrBytes) + payload_words; ("ptr" p) = ccall allocateMightFail(MyCapability() "ptr", words);
We are making a foreign call here (ccall). I am wondering how much overhead a ccall adds? I guess it may have to save and restore registers. Would it be better to do the fast path case of allocating small objects from the nursery using cmm code like in stg_gc_noregs?
-harendra _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs