We discussed trying to come up with a primitive

eagerlyBlackhole# :: a -> a
-- meaning
eagerlyBlackhole a = runRW# $ \s ->
  case noDuplicate# s of _ -> a

that would guarantee that the thunk is entered by only one thread. There are important situations where that check is redundant. Consider this code:

(m >>= \a -> strictToLazyST (f a)) >>= o

The implementation of >>= needs to wrap up the execution of its first argument in eagerlyBlackhole#. In this case, however, there is no risk that two threads will execute m, because it's forced within the same outer eagerlyBlackhole in which it was created. Is there a cheap way to detect this situation at run-time,  when executing m, to avoid the synchronization delay? This probably doesn't arise too much for unsafePerformIO, but in lazy ST there may be a lot of nested no-duplicate thunks.

I looked into trying to fix things up with RULES for *> and such, but I ran into trouble with eta expansion and in any case there are limits to what they can do to help >>=, and the whole thing is rather complicated.