Hi Simon,

Thanks - I already did this for alloca/malloc, I'll add the others from 
your patch.

Thank you.

We go to quite a lot of trouble to avoid locking in the common cases and 
fast paths - most of our data structures are CPU-local.  Where in 
particular have you encountered locking that could be reduced?

The pinned_object_block is CPU-local, usually no locking is required. 
Only when the block is full do we have to get a new block from the block 
allocator, and that requires a lock, but it's a rare case.

OK, the code I have checked out from the repository contains this in "rts/sm/Storage.h":

extern bdescr * pinned_object_block;

And in "rts/sm/Storage.c":

bdescr *pinned_object_block;

My C might be rusty, but I see no way for pinned_object_block to be CPU local. If it is truly CPU local then what makes it to be that kind?

As for locking, here is one one of examples:

StgPtr
allocatePinned( lnat n )
{
    StgPtr p;
    bdescr *bd = pinned_object_block;

    // If the request is for a large object, then allocate()
    // will give us a pinned object anyway.
    if (n >= LARGE_OBJECT_THRESHOLD/sizeof(W_)) {
  p = allocate(n);
        Bdescr(p)->flags |= BF_PINNED;
        return p;
    }

    ACQUIRE_SM_LOCK; // [RTVD: here we acquire the lock]

    TICK_ALLOC_HEAP_NOCTR(n);
    CCS_ALLOC(CCCS,n);

    // If we don't have a block of pinned objects yet, or the current
    // one isn't large enough to hold the new object, allocate a new one.
    if (bd == NULL || (bd->free + n) > (bd->start + BLOCK_SIZE_W)) {
  pinned_object_block = bd = allocBlock();
  dbl_link_onto(bd, &g0s0->large_objects);
  g0s0->n_large_blocks++;
  bd->gen_no = 0;
  bd->step   = g0s0;
  bd->flags  = BF_PINNED | BF_LARGE;
  bd->free   = bd->start;
  alloc_blocks++;
    }

    p = bd->free;
    bd->free += n;
    RELEASE_SM_LOCK; // [RTVD: here we release the lock]
    return p;
}

Of course, TICK_ALLOC_HEAP_NOCTR and CCS_ALLOC may require synchronization if they use shared state (which is, again, probably unnecessary). However, in case no profiling goes on and "pinned_object_block" is TSO-local, isn't it possible to remove locking completely from this code? The only case when locking will be necessary is when a fresh block has to be allocated, and that can be done within the "allocBlock" method (or, more precisely, by using "allocBlock_lock".

ACQUIRE_SM_LOCK/RELEASE_SM_LOCK pair is present in other places too, but I have not analysed yet if it is really necessary there. For example, things like newCAF and newDynCAF are wrapped into it.

With kind regards,
Denys Rtveliashvili