
Hi everyone, Before adding non-concurrent, reentrant calls to the language standard, please take some time to think about what that means. If you have forkIO'ed multiple threads, things start to interact in strange ways. I think this is a can of worms we don't want to open. (Or open again. It's still open in GHC's non-threaded RTS, and the worms are crawling all over the place there). So I'm going to ask a few questions about the semantics of non- concurrent reentrant calls, and if people can provide answers that don't scare me, I'll concede that they have a place in the language standard. 1.) Assume thread A and B are running. Thread A makes a non- concurrent, reentrant call to Foreign Lands. The foreign function calls a foreign-exported Haskell function 'foo'. While 'foo' is executing, does thread B resume running? 2.) Assume the same situation as in 1, and assume that the answer to 1 is yes. While 'foo' is running, (Haskell) thread B makes a non- concurrent, reentrant foreign call. The foreign function calls back to the foreign-exported Haskell function 'bar'. Because the answer to 1 was yes, 'foo' will resume executing concurrently with 'bar'. If 'foo' finishes executing before 'bar' does, what will happen? 3.) Same situation as in 1. When 'foo' is called, it forks (using forkIO) a Haskell thread C. How many threads are running now? 4.) Should there be any guarantee about (Haskell) threads not making any progress while another (Haskell) thread is executing a non- concurrent call? Two more questions, not related to semantics: 5.) Assume that Haskell Programmer A writes a Haskell library that uses some foreign code with callbacks, like for example, the GLU Tesselator (comes with OpenGL), or, as a toy example, the C Standard Library's qsort function. Should Programmer A specify "concurrent reentrant" on his foreign import? Programmer B will say "please don't", as he wants to use a Haskell implementation which doesn't support "concurrent reentrant". Programmer C will say "please do", as he wants his application's GUI to stay responsive while the library code is executing. So what should the poor library programmer A do? 6.) Why do people consider it too hard to do interthread messaging for handling a "foreign export" from arbitrary OS threads, when they already agree to spend the same effort on interthread messaging for handling a "foreign import concurrent"? Are there any problems that I am not aware of? If I am wrong and we should indeed provide non-concurrent reentrant calls in a concurrent Haskell system, then I think there should be satisfying answers to the above questions... Cheers, Wolfgang

On Fri, Mar 31, 2006 at 03:16:50PM -0500, Wolfgang Thaller wrote:
Before adding non-concurrent, reentrant calls to the language standard, please take some time to think about what that means. If you have forkIO'ed multiple threads, things start to interact in strange ways. I think this is a can of worms we don't want to open. (Or open again. It's still open in GHC's non-threaded RTS, and the worms are crawling all over the place there).
I am still digesting your message, but a quick note is that when you specify non-concurrent, you arn't saying "it can't be concurrent" but rather "I don't absolutely need it to be" so GHC would still treat all reentrant calls as concurrent and that is a-okay by the spec. John -- John Meacham - ⑆repetae.net⑆john⑈

On Fri, Mar 31, 2006 at 03:16:50PM -0500, Wolfgang Thaller wrote:
So I'm going to ask a few questions about the semantics of non- concurrent reentrant calls, and if people can provide answers that don't scare me, I'll concede that they have a place in the language standard.
first of all, a quick note, for GHC, the answers will be "the same thing it does now with -threaded". but I will try to answer with what a simple cooperative system would do.
1.) Assume thread A and B are running. Thread A makes a non- concurrent, reentrant call to Foreign Lands. The foreign function calls a foreign-exported Haskell function 'foo'. While 'foo' is executing, does thread B resume running?
if 'foo' blocks on a mvar,read,write,etc... then yes.
2.) Assume the same situation as in 1, and assume that the answer to 1 is yes. While 'foo' is running, (Haskell) thread B makes a non- concurrent, reentrant foreign call. The foreign function calls back to the foreign-exported Haskell function 'bar'. Because the answer to 1 was yes, 'foo' will resume executing concurrently with 'bar'. If 'foo' finishes executing before 'bar' does, what will happen?
I am confused, why would anything in particular need to happen at all? the threads are completly independent. The non-concurrent calls could just be haskell code that happens to not contain any pre-emption points for all it cares. in particular, in jhc, non-concurrent foreign imports and exports are just C function calls. no boilerplate at all in either direction. calling an imported foreign function is no different than calling one written in haskell so the fact that threads A and B are calling foregin functions doesn't really change anything.
3.) Same situation as in 1. When 'foo' is called, it forks (using forkIO) a Haskell thread C. How many threads are running now?
3 potentially runable.
4.) Should there be any guarantee about (Haskell) threads not making any progress while another (Haskell) thread is executing a non- concurrent call?
I don't understand why we would need that at all.
Two more questions, not related to semantics:
5.) Assume that Haskell Programmer A writes a Haskell library that uses some foreign code with callbacks, like for example, the GLU Tesselator (comes with OpenGL), or, as a toy example, the C Standard Library's qsort function. Should Programmer A specify "concurrent reentrant" on his foreign import? Programmer B will say "please don't", as he wants to use a Haskell implementation which doesn't support "concurrent reentrant". Programmer C will say "please do", as he wants his application's GUI to stay responsive while the library code is executing. So what should the poor library programmer A do?
He should say just 'reentrant' since concurrent isn't needed for correctness because the tessalation routines are basic calculations and will return. However, on a system like GHC that actually can run code concurrently and actually would have issues enforcing a 'non-concurrent' guarentee it would run concurrently anyway. It would be hard not to on an implementation that supported true OS threads actually. everyone wins. in the absolute worst case there are always #ifdefs but I doubt they will be needed.
6.) Why do people consider it too hard to do interthread messaging for handling a "foreign export" from arbitrary OS threads, when they already agree to spend the same effort on interthread messaging for handling a "foreign import concurrent"? Are there any problems that I am not aware of?
it is not that it is hard (well it is sort of), it is just absurdly inefficient and you would have no choice but to pay that price for _every_ foregin export. even when not needed which it mostly won't be. the cost of a foreign export should be a simple 'call' instruction (potentially) when an implementation supports that. the cost of a foreign import concurrent nonreentrant is only paid when actually using such a function, and quite cheap. on linux at least, a single futex, a cached pthread and it gets rolled into the main event loop. so a couple system calls max overhead. John -- John Meacham - ⑆repetae.net⑆john⑈

John Meacham wrote:
first of all, a quick note, for GHC, the answers will be "the same thing it does now with -threaded". but I will try to answer with what a simple cooperative system would do.
Sure. Unless someone dares answer "yes" to question 4, GHC will stay as it is.
2.) Assume the same situation as in 1, and assume that the answer to 1 is yes. While 'foo' is running, (Haskell) thread B makes a non- concurrent, reentrant foreign call. The foreign function calls back to the foreign-exported Haskell function 'bar'. Because the answer to 1 was yes, 'foo' will resume executing concurrently with 'bar'. If 'foo' finishes executing before 'bar' does, what will happen?
I am confused, why would anything in particular need to happen at all?
the threads are completly independent. The non-concurrent calls could just be haskell code that happens to not contain any pre-emption points for all it cares. in particular, in jhc, non-concurrent foreign imports and exports are just C function calls. no boilerplate at all in either direction. calling an imported foreign function is no different than calling one written in haskell so the fact that threads A and B are calling foregin functions doesn't really change anything.
In an implementation which runs more than one Haskell thread inside one OS thread, like ghc without -threaded or hugs, the threads are NOT completely independent, because they share one C stack. So while bar executes, stack frames for both foreign functions will be on the stack, and it will be impossible to return from foo before bar and the foreign function that called it completes. I think this kind of semantics is seriously scary and has no place as default behaviour in the language definition. If you implement concurrency by using the pthreads library, you need to either make sure that only one thread mutates the heap at a time, or deal with SMP. In either case, concurrent foreign calls would be trivial.
4.) Should there be any guarantee about (Haskell) threads not making any progress while another (Haskell) thread is executing a non- concurrent call?
I don't understand why we would need that at all.
Good. Neither do I, but in the discussions about this issue that we had three years ago several people seemed to argue for that.
5.) [...] So what should the poor library programmer A do?
He should say just 'reentrant' since concurrent isn't needed for correctness because the tessalation routines are basic calculations and will return.
Let's say they will return after a few minutes. So having them block the GUI is a show-stopper for programmer C. And if programmer C happens to use a Haskell implementation that supports "concurrent reentrant" but also a more efficient "non- concurrent reentrant", he will not be able to use the library.
everyone wins. in the absolute worst case there are always #ifdefs but I doubt they will be needed.
Except for programmer C on some haskell implementations. I don't buy it yet :-).
6.) Why do people consider it too hard to do interthread messaging for handling a "foreign export" from arbitrary OS threads, when they already agree to spend the same effort on interthread messaging for handling a "foreign import concurrent"? Are there any problems that I am not aware of?
it is not that it is hard (well it is sort of), it is just absurdly inefficient and you would have no choice but to pay that price for _every_ foregin export. even when not needed which it mostly won't be. the cost of a foreign export should be a simple 'call' instruction (potentially) when an implementation supports that.
As we seem to agree that the performance issue is non-existant for implementations that use one OS thread for every haskell thread, and that we don't want to change how GHC works, the following refers to a system like hugs where all Haskell code and the entire runtime system always runs in a single OS thread. It might not be absolutely easy to implement "concurrent reentrant", but it's no harder than concurrent non-reentrant calls. If a haskell implementation has a hacker on its team who is able to do the former, then this is no problem either. As for the efficiency argument: if it is sufficiently slow, then that is an argument for including "nonconcurrent reentrant" as an option. It is not an argument for making it the default, or for leaving out "concurrent reentrant".
the cost of a foreign import concurrent nonreentrant is only paid when actually using such a function, and quite cheap. on linux at least, a single futex, a cached pthread and it gets rolled into the main event loop. so a couple system calls max overhead.
Sure. But what gives you the idea that the cost of a foreign export or a foreign import concurrent reentrant would be paid when you are not using them? If we include nonconcurrent reentrant foreign imports in the system, or if we just optimise foreign imports for the single-threaded case, all that the foreign export would have to do is to check a flag (NO system calls involved). If the callback is from a foreign import concurrent reentrant or if it is from an entirely Haskell-free C thread, then we will have to do an inter-thread RPC to the runtime thread. Unavoidable. For Hugs, I guess that overhead would be absorbed in its general slowness. For Yhc, it might be an issue. A related performance sink are all foreign imports if such an implementation supports bound threads (which are, after all, needed to use some libraries, like OpenGL and Carbon/Cocoa, from a multi- threaded program). If the foreign function needs to be executed in a dedicated thread, then even a nonconcurrent nonreentrant call would involve inter-thread messaging (in this hypothetical hugs+bound threads). We should consider adding a "nothreadlocal" attribute to foreign imports - when it is known that the foreign function does not access thread-local state, we can use the traditional, more efficient implementation for "foreign import nonconcurrent nonreentrant". Cheers, Wolfgang

On Fri, Mar 31, 2006 at 06:41:18PM -0500, Wolfgang Thaller wrote:
I am confused, why would anything in particular need to happen at all?
the threads are completly independent. The non-concurrent calls could just be haskell code that happens to not contain any pre-emption points for all it cares. in particular, in jhc, non-concurrent foreign imports and exports are just C function calls. no boilerplate at all in either direction. calling an imported foreign function is no different than calling one written in haskell so the fact that threads A and B are calling foregin functions doesn't really change anything.
In an implementation which runs more than one Haskell thread inside one OS thread, like ghc without -threaded or hugs, the threads are NOT completely independent, because they share one C stack. So while bar executes, stack frames for both foreign functions will be on the stack, and it will be impossible to return from foo before bar and the foreign function that called it completes. I think this kind of semantics is seriously scary and has no place as default behaviour in the language definition.
no, state-threads, a la NSPR, state-threads.sf.net, or any other of a bunch of implementations. each thread has its own stack, you 'longjmp' between them. it can almost practically be done in portable C except the mallocing of the new stack, but there are many free libraries (less a library, more a header file with some #ifdefs) for all processors out there that support that, or at least that gcc supports in the first place. this would be by far the easist way to add concurrency to any haskell compiler, other than the addition of the 'create a new stack' and 'longjmp' primitives, it can be implemented 100% in the standard libraries with haskell code. that is why I am confident in saying it probably won't rule out future implementations we have not thought of yet. since it is mostly pure haskell anyway.
If you implement concurrency by using the pthreads library, you need to either make sure that only one thread mutates the heap at a time, or deal with SMP. In either case, concurrent foreign calls would be trivial.
indeed. but pthreads has its own tradeoffs. there is certainly room for both types of haskell implementations.
4.) Should there be any guarantee about (Haskell) threads not making any progress while another (Haskell) thread is executing a non- concurrent call?
I don't understand why we would need that at all.
Good. Neither do I, but in the discussions about this issue that we had three years ago several people seemed to argue for that.
wacky. I can't think of a reason, it would be quite tricky to pull off with a fully pthreaded implementation anyway.
5.) [...] So what should the poor library programmer A do?
He should say just 'reentrant' since concurrent isn't needed for correctness because the tessalation routines are basic calculations and will return.
Let's say they will return after a few minutes. So having them block the GUI is a show-stopper for programmer C. And if programmer C happens to use a Haskell implementation that supports "concurrent reentrant" but also a more efficient "non- concurrent reentrant", he will not be able to use the library.
well, I think he has a choice to make there about what is more important to him. I admit, it has to be a judgement call at some point, as eventually performance problems become correctness ones. but perhaps this is an argument for a concurrent-hint flag, "make this concurrent and reentrant if possible, but its gonna be reentrant anyway no matter what" I mean, one could bend the rules any say coooperative systems do implement "concurrent reentrant" with just an incredibly crappy scheduling algorithm, but I think I'd rather have it fail outright than "pretend". but a 'concurrent-hint' flag could be useful, as a library writer may not know the preference of his user. a completely different solution would be just to foreign import the routine twice, with each convention and have some way for the user of a library to choose which one they want, perhaps with a flag. of course, both might not be available with all implementations. in any case, I don't think it is a showstopper.
everyone wins. in the absolute worst case there are always #ifdefs but I doubt they will be needed.
Except for programmer C on some haskell implementations. I don't buy it yet :-).
Well, certain implementations will always have their own extensions that people might rely on. I just don't want the language standard itself to rule out valid and useful implementation methods. Haskell with IO multiplexing is a very powerful platform indeed and this proposal lets us keep it in the language proper and that is very nice, from an implementor and a library writers point of view. often concurrent haskell is just a nicer way to express things and being able to use that expresivity in portable libraries is something I look forward to.
6.) Why do people consider it too hard to do interthread messaging for handling a "foreign export" from arbitrary OS threads, when they already agree to spend the same effort on interthread messaging for handling a "foreign import concurrent"? Are there any problems that I am not aware of?
it is not that it is hard (well it is sort of), it is just absurdly inefficient and you would have no choice but to pay that price for _every_ foregin export. even when not needed which it mostly won't be. the cost of a foreign export should be a simple 'call' instruction (potentially) when an implementation supports that.
As we seem to agree that the performance issue is non-existant for implementations that use one OS thread for every haskell thread, and that we don't want to change how GHC works, the following refers to a system like hugs where all Haskell code and the entire runtime system always runs in a single OS thread.
It might not be absolutely easy to implement "concurrent reentrant", but it's no harder than concurrent non-reentrant calls. If a haskell implementation has a hacker on its team who is able to do the former, then this is no problem either. As for the efficiency argument: if it is sufficiently slow, then that is an argument for including "nonconcurrent reentrant" as an option. It is not an argument for making it the default, or for leaving out "concurrent reentrant".
it is much much harder. you have to deal with your haskell run-time being called into from an _alternate OS thread_ meaning you have to deal with the os threading primitives and locking and mutexi and in general pay a lot of the cost you would for a fully OS threaded implementation. foreign concurrent nonreentrant imports can be implemented in pure haskell with just some FFI calls. concurrent reentrant requires changes to the run-time model and imposes a cost to _every_ foreign export.
the cost of a foreign import concurrent nonreentrant is only paid when actually using such a function, and quite cheap. on linux at least, a single futex, a cached pthread and it gets rolled into the main event loop. so a couple system calls max overhead.
Sure. But what gives you the idea that the cost of a foreign export or a foreign import concurrent reentrant would be paid when you are not using them? If we include nonconcurrent reentrant foreign imports in the system, or if we just optimise foreign imports for the single-threaded case, all that the foreign export would have to do is to check a flag (NO system calls involved). If the callback is from a foreign import concurrent reentrant or if it is from an entirely Haskell-free C thread, then we will have to do an inter-thread RPC to the runtime thread. Unavoidable.
the flag will have to be thread-local as even in cooperative systems multiple forigin calls can be running at once. and it would be a run-time check on every haskell-C transition. in any case, it is a run-time cost. Nothing precludes implementations from providing a 'concurrent reentrant' mode if they really want to (and document it properly) in fact, it is required of implementatios that support OS threads in the current proposal. but if you are cooperative anyway it would be a lot of work for little gain. if you write your program such that it will work on a cooperative system, then you already deal with the fact things won't be running at the same time, and if you write your code expecting things to run at the same time, then you are guarenteed 'concurrent reentrant' anyway (assuming the implementation follows all of the OS threading option)
For Hugs, I guess that overhead would be absorbed in its general slowness. For Yhc, it might be an issue.
for jhc it would be a very big issue. jhc compiles to straight C, as in foreign calls are literally just C calls with no more overhead at all. it would probably be so for any other implementations that compile to 'straight c' (or c--). for a suitable definition of straght.
A related performance sink are all foreign imports if such an implementation supports bound threads (which are, after all, needed to use some libraries, like OpenGL and Carbon/Cocoa, from a multi- threaded program). If the foreign function needs to be executed in a dedicated thread, then even a nonconcurrent nonreentrant call would involve inter-thread messaging (in this hypothetical hugs+bound threads). We should consider adding a "nothreadlocal" attribute to foreign imports - when it is known that the foreign function does not access thread-local state, we can use the traditional, more efficient implementation for "foreign import nonconcurrent nonreentrant".
Yeah, I left mention of bound threads out of the basic standard because I don't want to say anything about OS threads in the required standard. but perhaps we should say something on the issue. "if an implementation supports haskell code running on multiple OS threads, it must support the bound threads proposal. if it does not, then all 'nonconcurrent' foreign calls must be made on the one true OS thread" would be good enough for me. it does mean 'concurrent' foreign calls on a non-OS-threaded implementations cannot use TLS (well, the same TLS as each other that is), but that seems reasonable, as the only way to concurrentize an arbitrary C function we know nothing about is to threadize it so I don't think there actually is a solution, you just can't run things concurrently that also need to be on the same thread :) John -- John Meacham - ⑆repetae.net⑆john⑈

Hello John, Saturday, April 1, 2006, 4:53:00 AM, you wrote:
In an implementation which runs more than one Haskell thread inside one OS thread, like ghc without -threaded or hugs, the threads are NOT completely independent, because they share one C stack. So while
no, state-threads, a la NSPR, state-threads.sf.net, or any other of a bunch of implementations.
each thread has its own stack, you 'longjmp' between them. it can almost practically be done in portable C except the mallocing of the new stack,
new stacks can be allocated by alloca() calls. all these alloca-allocated stack segments can be used as pool of stacks assigned to the forked threads. although i don't tried this, my own library also used processor-specific method. i planned to suggest this to the ghc-via-C compilation we are discussed on February. it's great that the libs already exists. why you don't implement this himself, at least in cooperative (yield-driven) manner? -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On Sat, Apr 01, 2006 at 02:30:30PM +0400, Bulat Ziganshin wrote:
new stacks can be allocated by alloca() calls. all these alloca-allocated stack segments can be used as pool of stacks assigned to the forked threads. although i don't tried this, my own library also used processor-specific method.
so you alloca new big areas and then use 'longjmp' to jump back and forth within the same stack simulating many stacks? that is a neat trick. will confuse the hell out of the bohem garbage collector but I don't want to rely on that much longer anyway :) however, it would be a good thing to fall back to if no processor specific stack creation routine is available. this minimal threads library http://www.cs.uiowa.edu/%7Ejones/opsys/threads/ uses an interesting trick where it probes the setjmp structure to find the SP reasonably portably on any stack-based architecture. pretty clever. John -- John Meacham - ⑆repetae.net⑆john⑈

Hello John, Monday, April 3, 2006, 12:53:05 PM, you wrote:
new stacks can be allocated by alloca() calls. all these alloca-allocated stack segments can be used as pool of stacks assigned to the forked threads. although i don't tried this, my own library also used processor-specific method.
so you alloca new big areas and then use 'longjmp' to jump back and forth within the same stack simulating many stacks?
yes
that is a neat trick. will confuse the hell out of the bohem garbage collector but I don't want to rely on that much longer anyway :)
setjmp/longjmp is not compatible with C++ exception handling (because stack unwinding will be confused :) ). may be this GC also don't compatible with setjmp/longjmp in any case? it will be cool to extend jhc with multi-threading -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Sorry for the length of this. There are three sections: the first is about how I don't like for "nonconcurrent" to be the default, the second is about bound threads and the third is about implementing concurrent reentrant on top of state threads.
no, state-threads, a la NSPR, state-threads.sf.net, or any other of a bunch of implementations.
Ah. I was thinking of old-style GHC or hugs only, where there is one C stack and only the Haskell state is per-haskell-thread. My bad. So now that I know of an implementation method where they don't cause the same problems they used to cause in GHC, I am no longer opposed to the existance of nonconcurrent reentrant imports. To me, "nonconcurrent" is still nothing but a hint to the implementation for improving performance; if an implementation doesn't support concurrent reentrancy at all, that is a limitation of the implementation. I think that this is a real problem for libraries; library writers will have to choose whether they preclude their library from being used in multithreaded programs or whether they want to sacrifice portability (unless they spend the time messing around with cpp or something like it). Some foreign calls are known never to take much time; those can be annotated as nonconcurrent. For calls that might take nontrivial amounts of time, the question whether they should be concurrent or not *cannot be decided locally*; it depends on what other code is running in the same program. Maybe the default should be "as concurrent as the implementation supports", with an optional "nonconcurrent" annotation for performance, and an optional "concurrent" annotation to ensure an error/warning when the implementation does not support it. Of course, implementations would be free to provide a flag *as a non-standard extension* that changes the behaviour of unannotated calls. ==== Bound Threads ==== In GHC, there is a small additional cost for each switch to and from a bound thread, but no additional cost for actual foreign call-outs. For jhc, I think you could implement a similar system where there are multiple OS threads, one of which runs multiple state threads; this would have you end up in pretty much the same situation as GHC, with the added bonus of being able to implement foreign import nonconcurrent reentrant for greater performance. If you don't want to spend the time to implement that, then you could go with a possibly simpler implementation involving inter-thread messages for every foreign call from a bound thread, which would of course be slow (that's the method I'd have recommended to hugs). If the per-call cost is an issue, we could have an annotation that can be used whenever the programmer knows that a foreign function does not access thread-local storage. This annotation, the act of calling a foreign import from a forkIO'ed (=non-bound) thread, and the act of calling a foreign import from a Haskell implementation that does not support bound threads, all place this proof obligation on the programmer. Therefore I'd want it to be an explicit annotation, not the default.
"if an implementation supports haskell code running on multiple OS threads, it must support the bound threads proposal. if it does not, then all 'nonconcurrent' foreign calls must be made on the one true OS thread"
*) "Haskell code running on multiple OS threads" is irrelevant. Only the FFI allows you to observe which OS thread you are running in. This should be worded in terms of what kind of concurrent FFI calls are supported, or whether call-in from arbitrary OS threads is supported. *) Note though that this makes it *impossible* to make a concurrent call to one of Apple's GUI libraries (both Carbon and Cocoa insist on being called from the OS thread that runs the C main function). So good-bye to calculating things in the background while a GUI is waiting for user input. We could also say that a modified form of the bound threads proposal is actually mandatory; the implementation you have in mind would support it with the following exceptions: a) Foreign calls from forkIO'ed threads can read and write (a.k.a. interfere with) the thread local state of the "main" OS thread; people are not supposed to call functions that use thread local state from forkIO'ed threads anyway. b) Concurrent foreign imports might not see the appropriate thread local state. c) Call-ins from OS threads other than the main thread are not allowed, therefore there is no forkOS and no runInBoundThread. (Or, alternatively, call-ins from other OS threads create unbound threads instead). ==== On the implementability of "concurrent reentrant" ====
It might not be absolutely easy to implement "concurrent reentrant", but it's no harder than concurrent non-reentrant calls.
it is much much harder. you have to deal with your haskell run-time being called into from an _alternate OS thread_ meaning you have to deal with the os threading primitives and locking and mutexi and in general pay a lot of the cost you would for a fully OS threaded implementation.
I don't follow your claim. The generated code for a foreign export will have to a) check a thread-local flag/the current thread id to see whether we are being called from a non-concurrent reentrant import or from "elsewhere". Checking a piece of thread-local state is FAST. b) If we are "elsewhere", send an interthread message to the runtime thread. The runtime thread will need to periodically check whether an interthread message has arrived, and if there is no work, block waiting for it. The fast path of checking whether something has been posted to the message queue is fast indeed - you just have to check a global flag. So no locking and mutexes -- sorry, I don't buy "mutexi" ;-) -- in your regular code. What is so hard or so inefficient about this? Remember, for concurrent non-reentrant, you will have to deal with inter-OS-thread messaging, too. About how fast thread-local state really is: __thread attribute on Linux: ~ 2 memory load instructions. __declspec(thread) in MSVC++ on Windows: about the same. pthread_getspecific on Mac OS X/x86 and Mac OS X/G5: ~10 instructions pthread_getspecific on Linux and TlsGetValue on Windows: ~10-20 instructions pthread_getspecific on Mac OS X/G4: a system call :-(. Also, to just check whether you can use the fast-path call-in, you could optimise things by just checking whether the stack pointer is in the expected range for the runtime OS thread (fast case), or not (slow case). All in all, I can't see a good excuse to not implement foreign import concurrent reentrant when you've already implemented concurrent nonreentrant. Cheers, Wolfgang

On Mon, Apr 03, 2006 at 02:00:33PM -0400, Wolfgang Thaller wrote:
Sorry for the length of this. There are three sections: the first is about how I don't like for "nonconcurrent" to be the default, the second is about bound threads and the third is about implementing concurrent reentrant on top of state threads.
no, state-threads, a la NSPR, state-threads.sf.net, or any other of a bunch of implementations.
Ah. I was thinking of old-style GHC or hugs only, where there is one C stack and only the Haskell state is per-haskell-thread. My bad. So now that I know of an implementation method where they don't cause the same problems they used to cause in GHC, I am no longer opposed to the existance of nonconcurrent reentrant imports.
Any particular reason hugs and GHC didn't use the state-threads approach out of curiosity? did it conflict with the push-enter model? (jhc uses the eval-apply model so I am more familier with that)
To me, "nonconcurrent" is still nothing but a hint to the implementation for improving performance; if an implementation doesn't support concurrent reentrancy at all, that is a limitation of the implementation.
It also implys that a function call will run on the same OS thread as the OS thread the current haskell thread is running on. however, we need not codify this behavior as some future implementations might not follow it or have a well defined meaning for 'OS thread the current haskell thread is running on' (GHC already doesn't when bound threads arn't used I am led to believe?)
Maybe the default should be "as concurrent as the implementation supports", with an optional "nonconcurrent" annotation for performance, and an optional "concurrent" annotation to ensure an error/warning when the implementation does not support it. Of course, implementations would be free to provide a flag *as a non-standard extension* that changes the behaviour of unannotated calls.
A concurrent hint would be okay. I have a preference for including an explicit annotation in that case. In fact, I'd support it if all properties were explicitly annotated and we didn't allow a blank specification. My only real 'must-have' is that the 4 modes all can be explicitly and unambiguously specified. I have opinions on the syntax/hints but that is more flexable. Another nice minor thing would be if haskell implementations were required to ignore annotations starting with 'x-' for implementation specific hints. In jhc there are no such thing as compiler primitives*, there are only FFI imports and a couple primitive imports that don't translate to code (seq,unsafeCoerce). this means that things like 'log' and 'sin' and every basic operation goes through the FFI mechanism so it needs to be _fast_ _fast_. A neat side effect is that jhcs implementation of the prelude is mostly portable to different compilers. * almost true, for historical reasons I hope to fix there are a few built in numeric operators.
==== Bound Threads ====
In GHC, there is a small additional cost for each switch to and from a bound thread, but no additional cost for actual foreign call-outs. For jhc, I think you could implement a similar system where there are multiple OS threads, one of which runs multiple state threads; this would have you end up in pretty much the same situation as GHC, with the added bonus of being able to implement foreign import nonconcurrent reentrant for greater performance. If you don't want to spend the time to implement that, then you could go with a possibly simpler implementation involving inter-thread messages for every foreign call from a bound thread, which would of course be slow (that's the method I'd have recommended to hugs).
I am not quite sure whether you are saying something different from what I plan for jhc or not, my current thinking for jhc is, the single one true OS thread for all haskell threads in an EDSM loop (epoll based). the EDSM loop has its own stack (and is, in fact, just another haskell thread as the scheduler is implemented in haskell), each haskell thread has its own stack. non concurrent calls are just C 'call's. nothing more. concurrent nonreentrant calls are desugared to haskell code that FFI calls 'socketpair(2)' pokes arguments into structure FFI calls 'pthread_create' pthread_create is passed a function that unpacks the args, calls the foreign function, stores the result and writes one byte to one end of the socketpair. calls 'Concurrent.IO.threadWaitRead' on the other end of the socket pair. peeks the return value (in practice, socketpair will be a much faster 'futex' on linux and OS threads may or may not be cached) very low overhead, probably the minimal possible for an arbitrary unknown FFI call. An alternate mode I'd like to experiment with one day is the complete oposite side of the spectrum: one OS thread per haskell thread, no guarentees about duplicated work between threads. However, the extra overhead will likely make this mode uncommon except in certain specialized circumstances.
If the per-call cost is an issue, we could have an annotation that can be used whenever the programmer knows that a foreign function does not access thread-local storage. This annotation, the act of calling a foreign import from a forkIO'ed (=non-bound) thread, and the act of calling a foreign import from a Haskell implementation that does not support bound threads, all place this proof obligation on the programmer. Therefore I'd want it to be an explicit annotation, not the default.
well, the cost of bound threads is not the cost of the call itself, it is that they must be serialized. foreign concurrent calls can run concurrently to haskell code and each other. but 'bound' foreign calls must wait for their dedicated thread to become available. I think there needs to be an annotation as to which functions require boundness so suddenly all foreign calls arn't serialized just because they are in a 'forkOS'ed thread. in any case, the cost would end up on the foreign call in a cooperative system most likely as it will have to find and wait for the specific thread rather than just using the current one or creating a new one.
We could also say that a modified form of the bound threads proposal is actually mandatory; the implementation you have in mind would support it with the following exceptions:
a) Foreign calls from forkIO'ed threads can read and write (a.k.a. interfere with) the thread local state of the "main" OS thread; people are not supposed to call functions that use thread local state from forkIO'ed threads anyway.
b) Concurrent foreign imports might not see the appropriate thread local state.
c) Call-ins from OS threads other than the main thread are not allowed, therefore there is no forkOS and no runInBoundThread. (Or, alternatively, call-ins from other OS threads create unbound threads instead).
I'll have to think about these rules some. We will need to say something about bound threads.
==== On the implementability of "concurrent reentrant" ====
It might not be absolutely easy to implement "concurrent reentrant", but it's no harder than concurrent non-reentrant calls.
it is much much harder. you have to deal with your haskell run-time being called into from an _alternate OS thread_ meaning you have to deal with the os threading primitives and locking and mutexi and in general pay a lot of the cost you would for a fully OS threaded implementation.
I don't follow your claim. The generated code for a foreign export will have to a) check a thread-local flag/the current thread id to see whether we are being called from a non-concurrent reentrant import or from "elsewhere". Checking a piece of thread-local state is FAST. b) If we are "elsewhere", send an interthread message to the runtime thread. The runtime thread will need to periodically check whether an interthread message has arrived, and if there is no work, block waiting for it. The fast path of checking whether something has been posted to the message queue is fast indeed - you just have to check a global flag. I'd integrate it into the EDSM loop somehow (futex maybe) as I have a moral adversion for periodic checking of anything.
the main thing is that it is a cost paid by every foreign export. perhaps a flag saying "this will only be called nonconcurrently on exports" though, perhaps that can be an x-flag if other compilers can't take advantage of it. In my survey of when 'reentrant concurrent' was needed, I looked at all the standard libraries and didn't find anywhere it was actually needed. Are there some compelling examples of when it is really needed in a setting that doesn't have OS threads to begin with? (I am not asserting they don't exist, I just want to see some example uses of this feature to get a better handle on the implementation cost)
Remember, for concurrent non-reentrant, you will have to deal with inter-OS-thread messaging, too.
just a wait on a pipe. which is exactly what the EDSM loop does anyway so nothing threadspecific at all other than the call to pthread_create.
About how fast thread-local state really is: __thread attribute on Linux: ~ 2 memory load instructions. __declspec(thread) in MSVC++ on Windows: about the same. pthread_getspecific on Mac OS X/x86 and Mac OS X/G5: ~10 instructions pthread_getspecific on Linux and TlsGetValue on Windows: ~10-20 instructions pthread_getspecific on Mac OS X/G4: a system call :-(.
how prevelant is support for __thread BTW? is it required by any standards or an ELFism?
Also, to just check whether you can use the fast-path call-in, you could optimise things by just checking whether the stack pointer is in the expected range for the runtime OS thread (fast case), or not (slow case). neat trick.
All in all, I can't see a good excuse to not implement foreign import concurrent reentrant when you've already implemented concurrent nonreentrant.
I am beginning to agree perhaps. I still definitly want them to be declared separately in the FFI declaration though. I will hapily trade some FFI verbosity for greater future-proofness when desiging a language standard that is meant to last. John -- John Meacham - ⑆repetae.net⑆john⑈

John Meacham wrote (... but I've reordered things):
My only real 'must-have' is that the 4 modes all can be explicitly and unambiguously specified. I have opinions on the syntax/hints but that is more flexable.
I basically agree (the syntax discussion will take place in the years after the semantics discussion), but... I want programmers to have a way of saying "this function might spend a lot of time in foreign lands". These calls should be concurrent on all implementations that support it (because some separately developed/maintained piece of Haskell code might expect to run a computation in the background), but if there are implementations that don't support it shouldn't flag an error, because that would encourage library writers to specify nonconcurrent when they can't prove that it's safe, or make code needlessly nonportable. Another way to look at it: You cannot decide whether the call actually has to be done concurrently by just looking at the call site - you'd need to look at the entire program, and asking people (especially library writers) to state and guarantee global properties of a program that might not even be finished yet is a Bad Thing. Therefore, the concurrency annotation on the foreign import can only be a hint on whether the foreign function is guaranteed to return quickly or not; the actual requirement for the call to be "concurrent" is hidden in the other code that expects to run at the same time. Therefore, it would be wrong for an implementation that doesn't support concurrent calls (reentrant or nonreentrant, I don't care) to flag an error; the foreign import declaration correctly refuses to give a guarantee that the function will return quickly. The error is in the code that expects to run concurrently with a foreign import on an implementation that doesn't support that (but of course, a compiler can't detect such an error).
Another nice minor thing would be if haskell implementations were required to ignore annotations starting with 'x-' for implementation specific hints.
Sounds good. Syntax discussion postponed again ('x-' looks so mime- typish. Could we add a meaningless 'application/' to the front? Just kidding).
In my survey of when 'reentrant concurrent' was needed, I looked at all the standard libraries and didn't find anywhere it was actually needed. Are there some compelling examples of when it is really needed in a setting that doesn't have OS threads to begin with? (I am not asserting they don't exist, I just want to see some example uses of this feature to get a better handle on the implementation cost)
In my experience, reentrant calls are rare in non-GUI code, but they become quite common in GUI code (OK, in some GUI libraries, there is only one, called something like RunMainEventLoop, but then it runs almost all of the time and is absolutely crucial). And with most GUI libraries, the GUI's main event loop will refuse to cooperate well with a Haskell's implementation's scheduler, so it will need to be called as a "concurrent" foreign import if your application is to do any background processing while waiting for events. Other libraries that rely on callbacks would include the GLU Tesselator that I already mentioned, as well as several packages for solving optimisation problems. For those, concurrency would probably only become an issue when they are used with a GUI (even if it's only to display a progress bar). Another reason why you don't see them in Haskell standard library code might be that everyone prefers Data.List.sort to foreign import ccall qsort.
Any particular reason hugs and GHC didn't use the state-threads approach out of curiosity? did it conflict with the push-enter model? (jhc uses the eval-apply model so I am more familier with that)
It was before my time. I guess it's because GHC uses a separate heap- allocated Haskell thread, so it made sense not to bother to allocate a separate C stack for every one of them. Don't know about Hugs.
It also implys that a function call will run on the same OS thread as the OS thread the current haskell thread is running on.
This shouldn't be put into a standard, as the bound threads proposal already gives a different guarantee about that, and both guarantees taken together probably guarantee too much - taken together, they probably mean every Haskell thread has to be an OS thread. It might be an implementation-specific guarantee, unless the bound threads become a part of the standard in their entirety.
'OS thread the current haskell thread is running on' (GHC already doesn't when bound threads arn't used I am led to believe?)
There should be no such thing as the 'OS thread the current haskell thread is running on' in any standard; OS thread identity is only observed through the FFI.
this means that things like 'log' and 'sin' and every basic operation goes through the FFI mechanism so it needs to be _fast_ _fast_. A neat side effect is that jhcs implementation of the prelude is mostly portable to different compilers.
I, too, want foreign import nonconcurrent nonreentrant to compile to a plain call without any extras. GHC achieves that goal, even in the presence of bound threads; I'm optimistic about jhc + state threads, too. ==== Bound Threads / Implementation methods ====
I am not quite sure whether you are saying something different from what I plan for jhc or not, my current thinking for jhc is, [...] An alternate mode I'd like to experiment with one day is the complete oposite side of the spectrum:
one OS thread per haskell thread, no guarentees about duplicated work between threads.
Let me add a third one: One OS thread per haskell thread, locks & condition variables used to make them behave just like state threads. Creating threads and switching from one thread to another would be slower than with state threads by about the time it takes to do a trip to the kernel and back. Nonconcurrent foreign imports are just plain calls, concurrent foreign imports have to release the lock (and maybe signal another thread) and then re-acquire it afterwards. Foreign exports need to check whether the current OS thread owns the lock (TLS access), and wait for the lock if it doesn't. And a fourth one: Your "single true OS thread" runs all *unbound* Haskell threads and your scheduler (on separate stacks). Bound threads run in their own OS threads, and the scheduler will use OS thread primitives (foreign imported nonconcurrent) to run them and to wait for their time slice to finish (if you're doing preemption) or for them to block. It's like "if target thread is bound, use OS thread primitives, else use State Thread primitives to pass control to it". Foreign import nonconcurrent is a plain call. Foreign import concurrent from a bound thread releases the lock before calling (and waits for it afterwards). From an unbound thread you could implement it the way you planned to. Call-ins would need that one stack pointer range check, and wait for a lock if it fails.
well, the cost of bound threads is not the cost of the call itself, it is that they must be serialized. foreign concurrent calls can run concurrently to haskell code and each other. but 'bound' foreign calls must wait for their dedicated thread to become available. I think there needs to be an annotation as to which functions require boundness so suddenly all foreign calls arn't serialized just because they are in a 'forkOS'ed thread.
I think you're mistaken here. For every given OS thread, *at most one* Haskell thread will ever be bound to it. So when you make a foreign call, you can be sure that the OS thread that is supposed to execute it is available *right now*, because the only Haskell thread that has the right to cause code to be executed in the bound OS thread has obviously finished any previous foreign calls. No need to wait. Ever. In the bound threads proposal, the only basic method to create a bound thread is a call to a foreign export (or to main). The resulting thread of Haskell execution is bonund to the OS thread that made the call. The library function forkOS is just a call to pthread_create or it's Windows equivalent. It would be possible to add an annotation to "foreign export" that states that "the Haskell thread that results from a call-in to this function does not need to be bound" (if that would improve performance), but back when "bound threads" were born we decided against it to keep things simple; instead, we added "runInUnboundThread :: IO a -> IO a" to keep the people who were concerned about performance happy, and I have yet to see it used by anyone.
==== On the implementability of "concurrent reentrant" ==== [...] b) The runtime thread will need to periodically check whether an interthread message has arrived, and if there is no work, block waiting for it. The fast path of checking whether something has been posted to the message queue is fast indeed - you just have to check a global flag.
I'd integrate it into the EDSM loop somehow (futex maybe) as I have a moral adversion for periodic checking of anything.
Yes, I was thinking of checking the flag just as a way to avoid calling epoll if a foreign call is already waiting. Basically, you'll want to write a message to a pipe so that it gets picked up by the EDSM loop. After all, a pipe is nothing but a traditional unix-style message queue :-).
the main thing is that it is a cost paid by every foreign export. perhaps a flag saying "this will only be called nonconcurrently on exports" though, perhaps that can be an x-flag if other compilers can't take advantage of it.
Yes, such an x-flag would be entirely reasonable if those few instructions for a stack pointer range test become significant. Those functions will, of course, be very rare.
how prevelant is support for __thread BTW? is it required by any standards or an ELFism?
It is an ELFism by birth; __declspec(thread) is the equivalent MSVCism, and I don't know of any other equivalent features (but then my horizon doesn't extend much beyond the Mac/Linux/Windows triad). I hope that Apple and the mingw32 team soon implement it, too, but that's just an unfounded hope. That's all for now, Cheers, Wolfgang

Hello John, Tuesday, April 4, 2006, 5:55:19 AM, you wrote:
In my survey of when 'reentrant concurrent' was needed, I looked at all the standard libraries and didn't find anywhere it was actually needed. Are there some compelling examples of when it is really needed in a setting that doesn't have OS threads to begin with?
may be i can help? :) i'm not sure that i understood you properly, but my own compression program uses several compression algorithms, written in C, in parallel - i.e. output from first algorithm is piped to the second and so on. if this is not enough - all these algorithms are executed via one function, which receives algorithm name and input/output routines, so i run in parallel, for example: compress "words" read_func1 write_func1 and compress "ppm" read_func2 write_func2 -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On 03-Apr-2006, John Meacham
On Mon, Apr 03, 2006 at 02:00:33PM -0400, Wolfgang Thaller wrote:
About how fast thread-local state really is: __thread attribute on Linux: ~ 2 memory load instructions. __declspec(thread) in MSVC++ on Windows: about the same. pthread_getspecific on Mac OS X/x86 and Mac OS X/G5: ~10 instructions pthread_getspecific on Linux and TlsGetValue on Windows: ~10-20 instructions pthread_getspecific on Mac OS X/G4: a system call :-(.
how prevelant is support for __thread BTW? is it required by any standards or an ELFism?
It's a recent innovation that standards have not yet had time to catch up with. But __thread is the way of the future, IMHO. GNU C/C++ supports __thread for at least the following architectures: IA-32, x86-64, IA-64, SPARC (32- and 64-bit), Alpha, S390 (31- and 64-bit), SuperHitachi (SH), HP/PA 64-bit. It's also supported by Borland C++ builder, Sun Studio C/C++, and Visual C++. (VC++ does use a slightly different syntax, but you can use macros to work around that.) -- Fergus J. Henderson | "I have always known that the pursuit Galois Connections, Inc. | of excellence is a lethal habit" Phone: +1 503 626 6616 | -- the last words of T. S. Garp.

I think the following kinds of foreign calls wrt. concurrency are
sensible:
1. Other Haskell threads might get paused (but don't have to).
Examples: sqrt, qsort (we assume that qsort never needs a long time
between calls to the comparison function, so there is no need to
allow other threads, and it's more important to avoid context
switching overhead).
2. Other Haskell threads should run if possible, but this is not
strictly necessary if the implementation doesn't support that.
Examples: stat, computing md5 of a byte array (the call doesn't
block for an arbitrarily long time, so pausing other threads is
acceptable, but with slow hardware or on a multiprocessor it might
be preferable to allow for more concurrency).
3. Other Haskell threads must run.
Examples: wait, blocking read (a blocking call; not running Haskell
threads might lead to deadlocks).
2 is the same as 1 on some implementations, and the same as 3 on others.
3 is possible only in implementations which make use of OS threads.
A foreign call annotated with 3 would be an error in some implementations.
Old GHC, before bound threads, can't support 3. In GHC with bound threads
2 is equivalent to 3. In SMP version even 1 allows some other threads
(but multiple threads doing calls of kidn 1 might stop other threads).
1 or 2 are reasonable defaults.
Variant 1 has two subvariants: allowing callbacks or not. I don't know
whether it makes sense to differentiate this for other variants, i.e.
whether disallowing callbacks allows to generate faster code. Anyway,
if 2 is chosen as the default, I wish to be able to specify variant 1
sans callbacks with a single keyword, like 'unsafe' today, because
it's quite common.
I'm not sure whether 3 should be provided at all. Perhaps when
wrapping a foreign function it's generally not known whether full
concurrency is essential or not, so it's always better to block other
threads than to reject code.
Some functions need to use a different implementation when variant 2
blocks other threads. For example instead of calling a blocking
function, the program might call some its variant with a timeout,
and after the timeout other threads are given a chance to run.
This way the waiting thread uses its timeslices and wakes up the
processor every couple of milliseconds, but at least it works.
In my implementation of my language Kogut I simply don't block
the timer signal in this case.
So I propose to not provide 3, but instead provide a constant which
a program can use to discover whether 2 blocks other threads or not.
Using that constant it can either choose an alternative strategy,
or abort if it knows that 3 would be essential.
Wolfgang Thaller
1.) Assume thread A and B are running. Thread A makes a non- concurrent, reentrant call to Foreign Lands. The foreign function calls a foreign-exported Haskell function 'foo'. While 'foo' is executing, does thread B resume running?
Yes, when the scheduler chooses it.
2.) Assume the same situation as in 1, and assume that the answer to 1 is yes. While 'foo' is running, (Haskell) thread B makes a non- concurrent, reentrant foreign call. The foreign function calls back to the foreign-exported Haskell function 'bar'. Because the answer to 1 was yes, 'foo' will resume executing concurrently with 'bar'. If 'foo' finishes executing before 'bar' does, what will happen?
There are sensible implementations where the foreign code of thread A after calling 'foo' continues running, and is running alone, while thread B is paused until thread A either calls another Haskell function or returns to Haskell. Bound threads in GHC work like this I think. And there are sensible implementations where thread A is paused trying to return from 'foo', until 'bar' returns. This might lead to a deadlock if 'bar' will wait for something only thread A can do, but it's unavoidable if the implementation doesn't use OS threads at all. I think these implementations coincide with those which are unable to provide variant 3, so providing the constant I mentioned allows to distinguish these cases too.
3.) Same situation as in 1. When 'foo' is called, it forks (using forkIO) a Haskell thread C. How many threads are running now?
Three.
4.) Should there be any guarantee about (Haskell) threads not making any progress while another (Haskell) thread is executing a non- concurrent call?
No. In an implementation which runs every Haskell threads on its own OS thread, with a concurrent runtime, all foreign calls are actually concurrent and the modifiers have no effect.
5.) Assume that Haskell Programmer A writes a Haskell library that uses some foreign code with callbacks, like for example, the GLU Tesselator (comes with OpenGL), or, as a toy example, the C Standard Library's qsort function. Should Programmer A specify "concurrent reentrant" on his foreign import?
If the call can take a long time before entering Haskell, then it should be annotated with 2. Otherwise with 1. Sometimes it's impossible to tell beforehand whether the call will take a long time, e.g. getaddrinfo might return a cached answer immediately or wait for nameservers to respond. It's an unavoidable pity. Such call can be annotated with 2, and the overhead will be wasted sometimes. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/
participants (5)
-
Bulat Ziganshin
-
Fergus Henderson
-
John Meacham
-
Marcin 'Qrczak' Kowalczyk
-
Wolfgang Thaller