if your computation in the C call takes more than 400 nano seconds, the overhead of the safe ffi convention is less onerous and you should do that when applicable.
an alternative is so use forkOn to setup a worker thread on various GHC capabilities, and have them in parallel work on different chunks of the list. that way you're pausing ALL the capabilities (which again, you should only use the unsafe ffi convention for computations that take <= 10 microseconds if you want things to be well behaved)