I guess I should explain what that flag does...
The GHC RTS maintains capabilities, the number of capabilities is specified by the `+RTS -N` option. Each capability is a virtual machine that executes Haskell code, and maintains its own runqueue of threads to process.
A capability will perform a context switch at the next heap block allocation (every 4k of allocation) after the timer expires. The timer defaults to 20ms, and can be set by the -C flag. Capabilities perform context switches in other circumstances as well, such as when a thread yields or blocks.
My guess is that either the context switching logic changed in ghc-7.8, or possibly your code used to trigger a switch via some other mechanism (stack overflow or something maybe?), but is optimized differently now so instead it needs to wait for the timer to expire.
The problem we had was that a time-sensitive thread was getting scheduled on the same capability as a long-running non-yielding thread, so the time-sensitive thread had to wait for a context switch timeout (even though there were free cores available!). I expect even with -N4 you'll still see occasional delays (perhaps <5% of calls).
We've solved our problem with judicious use of `forkOn`, but that won't help at N1.
We did see this behavior in 7.6, but it's definitely worse in 7.8.
Incidentally, has there been any interest in a work-stealing scheduler? There was a discussion from about 2 years ago, in which Simon Marlow noted it might be tricky, but it would definitely help in situations like this.
John L.