
Greetings Currently GHC supports two kinds of threads - pinned to a specific capability (bound threads) and those it can migrate between any capabilities (unbound threads). For purposes of achieving lower latency in Haskell applications it would be nice to have something in between - threads GHC can migrate but within a certain subset of capabilities only. I'm developing a program that contains several kinds of threads - those that do little work and sensitive to latency and those that can spend more CPU time and less latency sensitive. I looked into several cases of increased latency in those sensitive threads (using GHC eventlog) and in all cases sensitive threads were waiting for non-sensitive threads to finish working. I was able to reduce worst case latency by factor of 10 by pinning all the threads in the program to specific capability but manually distributing threads (60+ of them) between capabilities (several different machines with different numbers of cores available) seems very fragile. World stopping GC is still a problem but at least in my case is much less frequently so. It would be nice to be able to allow GHC runtime to migrate a thread between a subset of capabilities using interface similar to this one: -- creates a thread that is allowed to migrate between capabilities according to following rule: ghc is allowed to run this thread on Nth capability if Nth `mod` size_of_word bit in mask is set. forkOn' :: Int -> IO () -> IO ThreadId forkOn' mask act = undefined This should allow to define up to 64 (32) distinct groups and allow user to break down their threads into bigger number of potentially intersecting groups by specifying things like capability 0 does latency sensitive things, caps 1..5 - less sensitive things, caps 6-7 bulk things. Anything obvious I'm missing? Any recommendations to how to implement this?

On 10 September 2017 at 04:03, Michael Baikov
Greetings
Currently GHC supports two kinds of threads - pinned to a specific capability (bound threads) and those it can migrate between any capabilities (unbound threads). For purposes of achieving lower latency in Haskell applications it would be nice to have something in between - threads GHC can migrate but within a certain subset of capabilities only.
That's not correct actually: a bound thread is associated with a particular OS thread, but it can migrate between capabilities just like unbound threads.
I'm developing a program that contains several kinds of threads - those that do little work and sensitive to latency and those that can spend more CPU time and less latency sensitive. I looked into several cases of increased latency in those sensitive threads (using GHC eventlog) and in all cases sensitive threads were waiting for non-sensitive threads to finish working. I was able to reduce worst case latency by factor of 10 by pinning all the threads in the program to specific capability but manually distributing threads (60+ of them) between capabilities (several different machines with different numbers of cores available) seems very fragile. World stopping GC is still a problem but at least in my case is much less frequently so.
If you have a fixed set of threads you might just want to use -N<threads> -qn<cores>, and then pin every thread to a different capability. This gives you 1:1 scheduling at the GHC level, delegating the scheduling job to the OS. You will also want to use nursery chunks with something like -n2m, so you don't waste too much nursery space on the idle capabilities. Even if your set of threads isn't fixed you might be able to use a hybrid scheme with -N<large> -qn<cores> and pin the high-priority threads on their own capability, while putting all the low-priority threads on a single capability, or a few separate ones. It would be nice to be able to allow GHC runtime to migrate a thread
between a subset of capabilities using interface similar to this one:
-- creates a thread that is allowed to migrate between capabilities according to following rule: ghc is allowed to run this thread on Nth capability if Nth `mod` size_of_word bit in mask is set. forkOn' :: Int -> IO () -> IO ThreadId forkOn' mask act = undefined
This should allow to define up to 64 (32) distinct groups and allow user to break down their threads into bigger number of potentially intersecting groups by specifying things like capability 0 does latency sensitive things, caps 1..5 - less sensitive things, caps 6-7 bulk things.
We could do this, but it would add some complexity to the scheduler and load balancer (which has already been quite hard to get right, I fixed a handful of bugs there recently). I'd be happy review a patch if you want to try it though. Cheers Simon Anything obvious I'm missing? Any recommendations to how to implement this?
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

I'm developing a program that contains several kinds of threads - those that do little work and sensitive to latency and those that can spend more CPU time and less latency sensitive. I looked into several cases of increased latency in those sensitive threads (using GHC eventlog) and in all cases sensitive threads were waiting for non-sensitive threads to finish working. I was able to reduce worst case latency by factor of 10 by pinning all the threads in the program to specific capability but manually distributing threads (60+ of them) between capabilities (several different machines with different numbers of cores available) seems very fragile. World stopping GC is still a problem but at least in my case is much less frequently so.
If you have a fixed set of threads you might just want to use -N<threads> -qn<cores>, and then pin every thread to a different capability. This gives you 1:1 scheduling at the GHC level, delegating the scheduling job to the OS. You will also want to use nursery chunks with something like -n2m, so you don't waste too much nursery space on the idle capabilities.
Even if your set of threads isn't fixed you might be able to use a hybrid scheme with -N<large> -qn<cores> and pin the high-priority threads on their own capability, while putting all the low-priority threads on a single capability, or a few separate ones.
It would be nice to be able to allow GHC runtime to migrate a thread between a subset of capabilities using interface similar to this one:
-- creates a thread that is allowed to migrate between capabilities according to following rule: ghc is allowed to run this thread on Nth capability if Nth `mod` size_of_word bit in mask is set. forkOn' :: Int -> IO () -> IO ThreadId forkOn' mask act = undefined
This should allow to define up to 64 (32) distinct groups and allow user to break down their threads into bigger number of potentially intersecting groups by specifying things like capability 0 does latency sensitive
There's about 80 threads right now and some of them are very short lived. Most of them are low priority and require lots of CPU which means having to manually distribute them over several capabilities - this process I'd like to avoid. things, caps 1..5 - less sensitive things, caps 6-7 bulk things.
We could do this, but it would add some complexity to the scheduler and
load balancer (which has already been quite hard to get right, I fixed a handful of bugs there recently). I'd be happy review a patch if you want to try it though. I guess I'll start by studying the scheduler and load balancer in more details. Thank you for your input Simon!

Hi,
Here is a simple diagram of forkIO, forkOn and forkOS:
https://takenobu-hs.github.io/downloads/haskell_ghc_illustrated.pdf#page=69
Regards,
Takenobu
2017-09-11 21:54 GMT+09:00 Michael Baikov
I'm developing a program that contains several kinds of threads - those that do little work and sensitive to latency and those that can spend more CPU time and less latency sensitive. I looked into several cases of increased latency in those sensitive threads (using GHC eventlog) and in all cases sensitive threads were waiting for non-sensitive threads to finish working. I was able to reduce worst case latency by factor of 10 by pinning all the threads in the program to specific capability but manually distributing threads (60+ of them) between capabilities (several different machines with different numbers of cores available) seems very fragile. World stopping GC is still a problem but at least in my case is much less frequently so.
If you have a fixed set of threads you might just want to use -N<threads> -qn<cores>, and then pin every thread to a different capability. This gives you 1:1 scheduling at the GHC level, delegating the scheduling job to the OS. You will also want to use nursery chunks with something like -n2m, so you don't waste too much nursery space on the idle capabilities.
Even if your set of threads isn't fixed you might be able to use a hybrid scheme with -N<large> -qn<cores> and pin the high-priority threads on their own capability, while putting all the low-priority threads on a single capability, or a few separate ones.
There's about 80 threads right now and some of them are very short lived. Most of them are low priority and require lots of CPU which means having to manually distribute them over several capabilities - this process I'd like to avoid.
It would be nice to be able to allow GHC runtime to migrate a thread between a subset of capabilities using interface similar to this one:
-- creates a thread that is allowed to migrate between capabilities according to following rule: ghc is allowed to run this thread on Nth capability if Nth `mod` size_of_word bit in mask is set. forkOn' :: Int -> IO () -> IO ThreadId forkOn' mask act = undefined
This should allow to define up to 64 (32) distinct groups and allow user to break down their threads into bigger number of potentially intersecting groups by specifying things like capability 0 does latency sensitive things, caps 1..5 - less sensitive things, caps 6-7 bulk things.
We could do this, but it would add some complexity to the scheduler and load balancer (which has already been quite hard to get right, I fixed a handful of bugs there recently). I'd be happy review a patch if you want to try it though.
I guess I'll start by studying the scheduler and load balancer in more details. Thank you for your input Simon!
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Hey Michael, greetings! Here's a little side issue that may also be of interest to you in case you've got HyperThreading on: https://ghc.haskell.org/trac/ghc/ticket/10229 Niklas

Hi Niklas
This is indeed looks interesting and I think I saw behavior similar to
this one. At the moment I'm working through ghc-events code to get
myself a better understanding on what is going on in threads scheduler
and get a tool that can handle event stream incrementally, once I'm
done with that - I'll see what can be done about that ticket.
On Sun, Oct 1, 2017 at 7:51 AM, Niklas Hambüchen
Hey Michael, greetings!
Here's a little side issue that may also be of interest to you in case you've got HyperThreading on:
https://ghc.haskell.org/trac/ghc/ticket/10229
Niklas

Note that the (AFAIK unreleased) version of ghc-events on the master
branch of the upstream repo can parse event streams incrementally, if
that's what you meant.
--
Mathieu Boespflug
Founder at http://tweag.io.
On 1 October 2017 at 03:49, Michael Baikov
Hi Niklas
This is indeed looks interesting and I think I saw behavior similar to this one. At the moment I'm working through ghc-events code to get myself a better understanding on what is going on in threads scheduler and get a tool that can handle event stream incrementally, once I'm done with that - I'll see what can be done about that ticket.
On Sun, Oct 1, 2017 at 7:51 AM, Niklas Hambüchen
wrote: Hey Michael, greetings!
Here's a little side issue that may also be of interest to you in case you've got HyperThreading on:
https://ghc.haskell.org/trac/ghc/ticket/10229
Niklas
ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

On Sun, Oct 1, 2017 at 8:09 PM, Boespflug, Mathieu
Note that the (AFAIK unreleased) version of ghc-events on the master branch of the upstream repo can parse event streams incrementally, if that's what you meant.
It can but it got some problems. For one the only thing incremental parser can do is to print the output - for everything else old parser will be used and output of that incremental is partially out of order due to the way event blocks are stored. Anyway, I already have my own version that does proper incremental parsing, provides interface with a streaming library and collects some info that wasn't available in the original version. How it's more about shuffling stuff around, cleaning and testing.

You might want to have a look at
https://github.com/mboes/ghc-events/tree/streaming. Similar to what
you mention, it uses the "streaming" package for the incremental
parsing. I ran into an issue with binary that I wasn't able to track
down: even medium sized buffers make the parsing slower (significantly
so) rather than faster (my suspicion: something somewhere that should
be constant time is actually linear).
--
Mathieu Boespflug
Founder at http://tweag.io.
On 1 October 2017 at 14:34, Michael Baikov
On Sun, Oct 1, 2017 at 8:09 PM, Boespflug, Mathieu
wrote: Note that the (AFAIK unreleased) version of ghc-events on the master branch of the upstream repo can parse event streams incrementally, if that's what you meant.
It can but it got some problems. For one the only thing incremental parser can do is to print the output - for everything else old parser will be used and output of that incremental is partially out of order due to the way event blocks are stored. Anyway, I already have my own version that does proper incremental parsing, provides interface with a streaming library and collects some info that wasn't available in the original version. How it's more about shuffling stuff around, cleaning and testing.

Hmmm.... I'll take a look, but from what I see - it uses the same code as
ghc-events for decoding and all the streaming done in a short single commit
- it must suffer the same bug then. It is not a single stream of events,
it's several of them per capability mixed due to caching done by RTS so you
need to decode several streams at once and merge the results.
On Oct 1, 2017 20:43, "Boespflug, Mathieu"
You might want to have a look at https://github.com/mboes/ghc-events/tree/streaming. Similar to what you mention, it uses the "streaming" package for the incremental parsing. I ran into an issue with binary that I wasn't able to track down: even medium sized buffers make the parsing slower (significantly so) rather than faster (my suspicion: something somewhere that should be constant time is actually linear). -- Mathieu Boespflug Founder at http://tweag.io.
On 1 October 2017 at 14:34, Michael Baikov
wrote: On Sun, Oct 1, 2017 at 8:09 PM, Boespflug, Mathieu
wrote: Note that the (AFAIK unreleased) version of ghc-events on the master branch of the upstream repo can parse event streams incrementally, if that's what you meant.
It can but it got some problems. For one the only thing incremental parser can do is to print the output - for everything else old parser will be used and output of that incremental is partially out of order due to the way event blocks are stored. Anyway, I already have my own version that does proper incremental parsing, provides interface with a streaming library and collects some info that wasn't available in the original version. How it's more about shuffling stuff around, cleaning and testing.

"Boespflug, Mathieu"
You might want to have a look at https://github.com/mboes/ghc-events/tree/streaming. Similar to what you mention, it uses the "streaming" package for the incremental parsing. I ran into an issue with binary that I wasn't able to track down: even medium sized buffers make the parsing slower (significantly so) rather than faster (my suspicion: something somewhere that should be constant time is actually linear).
Indeed there was a rather terrible bug potentially leading to unexpected asymptotic performance issues present in `binary` versions prior to 0.8.4 IIRC. See https://github.com/kolmodin/binary/pull/115. Perhaps this is what you are hitting? Cheers, - Ben
participants (6)
-
Ben Gamari
-
Boespflug, Mathieu
-
Michael Baikov
-
Niklas Hambüchen
-
Simon Marlow
-
Takenobu Tani