Re: [Haskell-cafe] Re: Bound threads

Marcin Kowalczyk wrote:
Indeed, my brain is melting, but I did it :-)
Congratulations. How about we found a "Bound-thread-induced brain melt victims' support group"?
[...] I have added some optimizations:
I think we had thought of most of these optimizations, but things were already very complex, so I kept putting that off until we decided to do IO differently, at which point they were no longer necessary. Besides simplicity, one of the main reasons for moving our select() call from the run-time system to the libraries was to avoid the performance hit of having to call select() every time through the scheduler loop rather than only once per IO operation. Imagine having one or more (unbound) threads that spend most of their time waiting for IO, and a bunch of (also unbound) threads that do some computation. If select() is part of the scheduler loop, you will get a select() call whenever a thread-switch between the computation threads happens. If, on the other hand, the select() call happens in a separate OS thread, you will get extra inter-OS-thread messaging once the select wakes up, but that happens far less often as a thread-switch between the computation threads.
Bound threads introduced problems. They can partially be solved, e.g. the worker pool, the wakeup pipe, epoll descriptor are correctly recreated. But there is simply no way to return from callbacks because the corresponding C contexts no longer exist. So I made them as follows:
All threads except the thread performing the fork become unbound. [...]
What happens when fork is called from an unbound thread? Does it become bound in the child process? When does the child process terminate? Does the thread that called fork gain a "main thread" status so that the process will exit as soon as the thread exits? For GHC, we side-stepped all those issues by only providing a simplified version of forkProcess: System.Posix.forkProcess :: IO () -> IO System.Posix.Types.ProcessID In the child process, only the IO action given as a parameter will run, and once it returns, the child process will terminate. This covers most use cases of fork and is, to the best of my knowledge, the most general version that can be implemented with the same semantics for both GHC's threaded RTS and GHC's non-threaded RTS.
I measured the speed of some syscalls on my system, to see what is worth optimizing:
- pthread_mutex_lock + unlock (NPTL) 0.1 us
pthread_mutex_* is not necessarily a syscall. When there is no contention, the NPTL is able to do it entirely in user space without a context switch. The kernel only gets involved when a thread is actually suspended waiting for a lock. The situation is the same on Mac OS X, but Microsoft's multithreading API is rumored to be a lot slower (kernel calls for every lock/unlock - I haven't checked this, though). Cheers, Wolfgang

Wolfgang Thaller
Indeed, my brain is melting, but I did it :-)
Congratulations. How about we found a "Bound-thread-induced brain melt victims' support group"?
The melt was entertaining :-)
Besides simplicity, one of the main reasons for moving our select() call from the run-time system to the libraries was to avoid the performance hit of having to call select() every time through the scheduler loop rather than only once per IO operation.
I use epoll when available. It's Linux-specific and allows to register and unregister descriptors separately from waiting. This not only saves process time to set up the array, but also kernel time scanning the array and hooking to files. I've heard BSD kqueue mechanism has similar properties. I unregister descriptors from epoll "lazily": when epoll returned that data is available but no thread was in fact waiting for it. This saves repeated registration when a thread alternates between I/O and computation. When the scheduler determines that it has no thread to wake up immediately, it performs a GC before going to wait if the program did roughly at least half of work until the next normal GC.
Imagine having one or more (unbound) threads that spend most of their time waiting for IO, and a bunch of (also unbound) threads that do some computation. If select() is part of the scheduler loop, you will get a select() call whenever a thread-switch between the computation threads happens.
Actually once the next thread in the "running and I/O" queue is an I/O thread, not in every scheduler iteration. Or more precisely a consecutive span of I/O threads in this queue. epoll_wait takes 0.2 us here, poll takes 1 us, select takes 0.6 us (1 descriptor in each case). I wonder why poll is slower than select. I was thinking about integration with gtk/glib event loop. There are two ways: either we ask glib to poll using a function supplied by us, or we perform polling using glib functions instead of raw epoll / poll / select. The first choice seems better because otherwise callbacks registered at glib were started from the scheduler and this will not work, it's even not clear on behalf of which thread they should run. In this case we must provide a function with an interface of poll(). Without additional support in the runtime (other than making file objects which don't close their underlying file, but this is easy), the function can be implemented by starting a thread for each descriptor, collecting the results, and cancelling threads when some descriptor is ready or when the timeout expires. Let's assume that real poll is used by our scheduler and that no other thread does I/O at the moment, and see what really happens: - the threads are created at the end of the run queue - other threads in the program execute their time slices - each of the newly created threads is marked as waiting for I/O - other threads in the program execute again (ugh) - the scheduler looks at the first I/O thread and makes poll() - all threads whose I/O is ready are woken up - the next running thread is chosen (perhaps one of threads woken up in the previous step) - it notifies the manager thread that glib-poll-emulation is ready - when execution reaches the manager, it kills other threads and reports the result to glib It seems that other than a bunch of context switches there is not much work besides the required minimum. (It gets worse wich epoll, which is suitable for a mostly unchanging set of watched descriptors.) With GHC implementation I think each thread which adds a descriptor will wake up the service thread through a pipe, and later they will wake it up again to unregister files when they become cancelled.
All threads except the thread performing the fork become unbound. [...]
What happens when fork is called from an unbound thread? Does it become bound in the child process?
No. But in ForkProcess and ForkProcessKillThreads this thread plays the role of the main thread: it receives Unix signals; it receives internal asynchronous signals like heap overflow and deadlock; when it terminates, the process terminates; if it terminates with an unhandled exception, a handler which normally prints the stack trace is called. AtExit handlers are not run though. I don't know what should be done, this got quite hairy. Actually the above termination semantics currently applies only when fork is called with an I/O action as an argument. It can also be called without, like C fork(), and in this case the behavior is ugly: if the new main thread in the child process was not the main thread before, its termination is not special and there will be a deadlock. This should be changed somehow. In any case, when the previous main thread terminates (as it's cancelled cleanly by ForkProcess), it checks whether it's still the main thread. If not, it disappears like normal threads, except that its OS thread will wait on a condition variable forever. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/
participants (2)
-
Marcin 'Qrczak' Kowalczyk
-
Wolfgang Thaller