
Sorry for the length of this. There are three sections: the first is about how I don't like for "nonconcurrent" to be the default, the second is about bound threads and the third is about implementing concurrent reentrant on top of state threads.
no, state-threads, a la NSPR, state-threads.sf.net, or any other of a bunch of implementations.
Ah. I was thinking of old-style GHC or hugs only, where there is one C stack and only the Haskell state is per-haskell-thread. My bad. So now that I know of an implementation method where they don't cause the same problems they used to cause in GHC, I am no longer opposed to the existance of nonconcurrent reentrant imports. To me, "nonconcurrent" is still nothing but a hint to the implementation for improving performance; if an implementation doesn't support concurrent reentrancy at all, that is a limitation of the implementation. I think that this is a real problem for libraries; library writers will have to choose whether they preclude their library from being used in multithreaded programs or whether they want to sacrifice portability (unless they spend the time messing around with cpp or something like it). Some foreign calls are known never to take much time; those can be annotated as nonconcurrent. For calls that might take nontrivial amounts of time, the question whether they should be concurrent or not *cannot be decided locally*; it depends on what other code is running in the same program. Maybe the default should be "as concurrent as the implementation supports", with an optional "nonconcurrent" annotation for performance, and an optional "concurrent" annotation to ensure an error/warning when the implementation does not support it. Of course, implementations would be free to provide a flag *as a non-standard extension* that changes the behaviour of unannotated calls. ==== Bound Threads ==== In GHC, there is a small additional cost for each switch to and from a bound thread, but no additional cost for actual foreign call-outs. For jhc, I think you could implement a similar system where there are multiple OS threads, one of which runs multiple state threads; this would have you end up in pretty much the same situation as GHC, with the added bonus of being able to implement foreign import nonconcurrent reentrant for greater performance. If you don't want to spend the time to implement that, then you could go with a possibly simpler implementation involving inter-thread messages for every foreign call from a bound thread, which would of course be slow (that's the method I'd have recommended to hugs). If the per-call cost is an issue, we could have an annotation that can be used whenever the programmer knows that a foreign function does not access thread-local storage. This annotation, the act of calling a foreign import from a forkIO'ed (=non-bound) thread, and the act of calling a foreign import from a Haskell implementation that does not support bound threads, all place this proof obligation on the programmer. Therefore I'd want it to be an explicit annotation, not the default.
"if an implementation supports haskell code running on multiple OS threads, it must support the bound threads proposal. if it does not, then all 'nonconcurrent' foreign calls must be made on the one true OS thread"
*) "Haskell code running on multiple OS threads" is irrelevant. Only the FFI allows you to observe which OS thread you are running in. This should be worded in terms of what kind of concurrent FFI calls are supported, or whether call-in from arbitrary OS threads is supported. *) Note though that this makes it *impossible* to make a concurrent call to one of Apple's GUI libraries (both Carbon and Cocoa insist on being called from the OS thread that runs the C main function). So good-bye to calculating things in the background while a GUI is waiting for user input. We could also say that a modified form of the bound threads proposal is actually mandatory; the implementation you have in mind would support it with the following exceptions: a) Foreign calls from forkIO'ed threads can read and write (a.k.a. interfere with) the thread local state of the "main" OS thread; people are not supposed to call functions that use thread local state from forkIO'ed threads anyway. b) Concurrent foreign imports might not see the appropriate thread local state. c) Call-ins from OS threads other than the main thread are not allowed, therefore there is no forkOS and no runInBoundThread. (Or, alternatively, call-ins from other OS threads create unbound threads instead). ==== On the implementability of "concurrent reentrant" ====
It might not be absolutely easy to implement "concurrent reentrant", but it's no harder than concurrent non-reentrant calls.
it is much much harder. you have to deal with your haskell run-time being called into from an _alternate OS thread_ meaning you have to deal with the os threading primitives and locking and mutexi and in general pay a lot of the cost you would for a fully OS threaded implementation.
I don't follow your claim. The generated code for a foreign export will have to a) check a thread-local flag/the current thread id to see whether we are being called from a non-concurrent reentrant import or from "elsewhere". Checking a piece of thread-local state is FAST. b) If we are "elsewhere", send an interthread message to the runtime thread. The runtime thread will need to periodically check whether an interthread message has arrived, and if there is no work, block waiting for it. The fast path of checking whether something has been posted to the message queue is fast indeed - you just have to check a global flag. So no locking and mutexes -- sorry, I don't buy "mutexi" ;-) -- in your regular code. What is so hard or so inefficient about this? Remember, for concurrent non-reentrant, you will have to deal with inter-OS-thread messaging, too. About how fast thread-local state really is: __thread attribute on Linux: ~ 2 memory load instructions. __declspec(thread) in MSVC++ on Windows: about the same. pthread_getspecific on Mac OS X/x86 and Mac OS X/G5: ~10 instructions pthread_getspecific on Linux and TlsGetValue on Windows: ~10-20 instructions pthread_getspecific on Mac OS X/G4: a system call :-(. Also, to just check whether you can use the fast-path call-in, you could optimise things by just checking whether the stack pointer is in the expected range for the runtime OS thread (fast case), or not (slow case). All in all, I can't see a good excuse to not implement foreign import concurrent reentrant when you've already implemented concurrent nonreentrant. Cheers, Wolfgang