
systems that don't use an existing user-space thread library (such as Concurrent Haskell or libthread [1]) emulate user-space threads by keeping a pool of processors and re-using them (e.g., IIUC Apache does this).
Your response seems to be yet another argument that processes are too expensive to be used the same way as threads. In my mind pooling vs new-creation is only relevant to process vs thread in the performance aspects. The fact that people use thread-pools means that they think that even thread-creation is too expensive. The central aspect in my mind is a default share-everything, or default share-nothing. One is much easier to reason about and encourages writing systems that have less shared-memory contention.
This is similar to the plan9 conception of processes. You have a generic rfork() call that takes flags that say what to share with your parent: namespace, environment, heap, etc. Thus the only difference between a thread and a process is different flags to rfork(). Under the covers, I believe linux is similar, with its clone() call. The fast context switching part seems orthogonal to me. Why is it that getting the OS involved for context switches kills the performance? Is it that the ghc RTS can switch faster because it knows more about the code it's running (i.e. the OS obviously couldn't switch on memory allocations like that)? Or is jumping up to kernel space somehow expensive by nature? And why does the OS need so many more K to keep track of a thread than the RTS? I don't really know much about either OSes or language runtimes so this is interesting to me.