That sounds like a worthy experiment! 

I  guess that would look like having an inline macro’d up path that checks if it can get the job done that falls back to the general code?

Last I checked, the overhead for this sort of c call was on the order of 10nanoseconds or less which seems like it’d be very unlikely to be a bottleneck, but do you have any natural or artificial benchmark programs that would show case this? 

For this sortah code, extra branching for that optimization could easily have a larger performance impact than the known function call on modern hardware.  (Though take my intuitions about these things with a grain of salt. )

On Tue, Apr 4, 2023 at 9:50 PM Harendra Kumar <harendra.kumar@gmail.com> wrote:
I was looking at the RTS code for allocating small objects via prim ops e.g. newByteArray# . The code looks like:

stg_newByteArrayzh ( W_ n )
{
    MAYBE_GC_N(stg_newByteArrayzh, n);

    payload_words = ROUNDUP_BYTES_TO_WDS(n);
    words = BYTES_TO_WDS(SIZEOF_StgArrBytes) + payload_words;
    ("ptr" p) = ccall allocateMightFail(MyCapability() "ptr", words);

We are making a foreign call here (ccall). I am wondering how much overhead a ccall adds? I guess it may have to save and restore registers. Would it be better to do the fast path case of allocating small objects from the nursery using cmm code like in stg_gc_noregs?

-harendra
_______________________________________________
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs