Runtime performance degradation for multi-threaded C FFI callback

I posted this issue on StackOverflow today. A brief recap: In the case when C FFI calls back a Haskell function, I have observed sharp increase in total time when multi-threading is enabled in C code (even when total number of function calls to Haskell remain same). In my test, I called a Haskell function 5M times using two scenarios (GHC 7.0.4, RHEL5, 12-core box): - Single-threaded C function: call back Haskell function 5M times - Total time 1.32s - 5 threads in C function: each thread calls back the Haskell function 1M times - so, total is still 5M - Total time 7.79s - Verified that pthread didn't contribute much to the overhead by having the same code call a C function instead, and compared with single-threaded version. So, almost all of the increase in overhead seems to come from GHC runtime. What I want to ask is if this is a known issue for GHC runtime? If not, I will file a bug report for GHC team with code to reproduce it. I don't want to file a duplicate bug report if this is already known issue. I searched through GHC trac using some keywords but didn't see any bugs related to it. StackOverflow post link (has code and details on how to reproduce the issue): http://stackoverflow.com/questions/8902568/runtime-performance-degradation-f...

Hmm, this kind of sounds like GHC is assuming that it has control over all of the threads, and when this assumption fails bad things happen. (We use lightweight threads, and use the operating system threads that map to pthreads sparingly.) I'm sure Simon Marlow could give a more accurate assessment, however. Edward Excerpts from Sanket Agrawal's message of Tue Jan 17 23:31:38 -0500 2012:
I posted this issue on StackOverflow today. A brief recap:
In the case when C FFI calls back a Haskell function, I have observed sharp increase in total time when multi-threading is enabled in C code (even when total number of function calls to Haskell remain same). In my test, I called a Haskell function 5M times using two scenarios (GHC 7.0.4, RHEL5, 12-core box):
- Single-threaded C function: call back Haskell function 5M times - Total time 1.32s - 5 threads in C function: each thread calls back the Haskell function 1M times - so, total is still 5M - Total time 7.79s - Verified that pthread didn't contribute much to the overhead by having the same code call a C function instead, and compared with single-threaded version. So, almost all of the increase in overhead seems to come from GHC runtime.
What I want to ask is if this is a known issue for GHC runtime? If not, I will file a bug report for GHC team with code to reproduce it. I don't want to file a duplicate bug report if this is already known issue. I searched through GHC trac using some keywords but didn't see any bugs related to it.
StackOverflow post link (has code and details on how to reproduce the issue): http://stackoverflow.com/questions/8902568/runtime-performance-degradation-f...

Hello Sanket, What happens if you run this experiment with 5 threads in the C function, and have GHC run RTS with -N7? (e.g. five C threads + seven GHC threads = 12 threads on your 12-core box.) Edward Excerpts from Sanket Agrawal's message of Tue Jan 17 23:31:38 -0500 2012:
I posted this issue on StackOverflow today. A brief recap:
In the case when C FFI calls back a Haskell function, I have observed sharp increase in total time when multi-threading is enabled in C code (even when total number of function calls to Haskell remain same). In my test, I called a Haskell function 5M times using two scenarios (GHC 7.0.4, RHEL5, 12-core box):
- Single-threaded C function: call back Haskell function 5M times - Total time 1.32s - 5 threads in C function: each thread calls back the Haskell function 1M times - so, total is still 5M - Total time 7.79s - Verified that pthread didn't contribute much to the overhead by having the same code call a C function instead, and compared with single-threaded version. So, almost all of the increase in overhead seems to come from GHC runtime.
What I want to ask is if this is a known issue for GHC runtime? If not, I will file a bug report for GHC team with code to reproduce it. I don't want to file a duplicate bug report if this is already known issue. I searched through GHC trac using some keywords but didn't see any bugs related to it.
StackOverflow post link (has code and details on how to reproduce the issue): http://stackoverflow.com/questions/8902568/runtime-performance-degradation-f...

Hi Edward,
I was just going to get back to you about it. I did find out that the issue
was indeed one GHC thread dealing with 5 C threads for callback (1:5
mapping) - so, the C threads were blocking on callback waiting for the only
GHC thread to be available. I updated the code to do 1:1 mapping - 5 GHC
threads for 5 C threads. That proved to be almost linearly scalable.
John Latos suggested the above approach two days back, but I didn't get to
test the idea until now.
It doesn't seem to matter whether number of GHC threads are increased, if
the mapping between GHC threads and C threads is not 1:1. I got 1:1 mapping
by doing forkIO for each C thread. Is it really possible to do 7:5 mapping
(that is 7 GHC threads to choose from, for 5 C threads during callback)? I
can't think of a way to do it. Not that I need it. I am just curious if
that is possible.
Thanks,
Sanket
On Fri, Jan 20, 2012 at 11:16 PM, Edward Z. Yang
Hello Sanket,
What happens if you run this experiment with 5 threads in the C function, and have GHC run RTS with -N7? (e.g. five C threads + seven GHC threads = 12 threads on your 12-core box.)
Edward
I posted this issue on StackOverflow today. A brief recap:
In the case when C FFI calls back a Haskell function, I have observed sharp increase in total time when multi-threading is enabled in C code (even when total number of function calls to Haskell remain same). In my test, I called a Haskell function 5M times using two scenarios (GHC 7.0.4, RHEL5, 12-core box):
- Single-threaded C function: call back Haskell function 5M times - Total time 1.32s - 5 threads in C function: each thread calls back the Haskell function 1M times - so, total is still 5M - Total time 7.79s - Verified that
Excerpts from Sanket Agrawal's message of Tue Jan 17 23:31:38 -0500 2012: pthread
didn't contribute much to the overhead by having the same code call a C function instead, and compared with single-threaded version. So, almost all of the increase in overhead seems to come from GHC runtime.
What I want to ask is if this is a known issue for GHC runtime? If not, I will file a bug report for GHC team with code to reproduce it. I don't want to file a duplicate bug report if this is already known issue. I searched through GHC trac using some keywords but didn't see any bugs related to it.
StackOverflow post link (has code and details on how to reproduce the issue):
http://stackoverflow.com/questions/8902568/runtime-performance-degradation-f...

On 21/01/2012 15:35, Sanket Agrawal wrote:
Hi Edward,
I was just going to get back to you about it. I did find out that the issue was indeed one GHC thread dealing with 5 C threads for callback (1:5 mapping) - so, the C threads were blocking on callback waiting for the only GHC thread to be available. I updated the code to do 1:1 mapping - 5 GHC threads for 5 C threads. That proved to be almost linearly scalable.
This is almost right, except that your callbacks are not waiting for a GHC *thread*, but what we call a "capability", which is roughly speaking "permission to execute Haskell code". The +RTS -N option chooses the number of capabilities. I expect that with -N1, your program is spending a lot of time just switching between the different OS threads. It's possible that we could make the runtime more flexible here. I recently made it possible to modify the number of capabilities at runtime, so it's conceivable that the runtime could automatically add capabilities if it is being called from multiple OS threads.
John Latos suggested the above approach two days back, but I didn't get to test the idea until now.
It doesn't seem to matter whether number of GHC threads are increased, if the mapping between GHC threads and C threads is not 1:1. I got 1:1 mapping by doing forkIO for each C thread. Is it really possible to do 7:5 mapping (that is 7 GHC threads to choose from, for 5 C threads during callback)? I can't think of a way to do it. Not that I need it. I am just curious if that is possible.
Just think of +RTS -N7 as being 7 *locks*, not 7 threads. Then it makes perfect sense to have 7 locks available for 5 threads. Cheers, Simon
Thanks, Sanket
On Fri, Jan 20, 2012 at 11:16 PM, Edward Z. Yang
mailto:ezyang@mit.edu> wrote: Hello Sanket,
What happens if you run this experiment with 5 threads in the C function, and have GHC run RTS with -N7? (e.g. five C threads + seven GHC threads = 12 threads on your 12-core box.)
Edward
Excerpts from Sanket Agrawal's message of Tue Jan 17 23:31:38 -0500 2012 tel:38%20-0500%202012: > I posted this issue on StackOverflow today. A brief recap: > > In the case when C FFI calls back a Haskell function, I have observed > sharp increase in total time when multi-threading is enabled in C code > (even when total number of function calls to Haskell remain same). In my > test, I called a Haskell function 5M times using two scenarios (GHC 7.0.4, > RHEL5, 12-core box): > > > - Single-threaded C function: call back Haskell function 5M times - > Total time 1.32s > - 5 threads in C function: each thread calls back the Haskell function 1M > times - so, total is still 5M - Total time 7.79s - Verified that pthread > didn't contribute much to the overhead by having the same code call a C > function instead, and compared with single-threaded version. So, almost all > of the increase in overhead seems to come from GHC runtime. > > What I want to ask is if this is a known issue for GHC runtime? If not, I > will file a bug report for GHC team with code to reproduce it. I don't want > to file a duplicate bug report if this is already known issue. I searched > through GHC trac using some keywords but didn't see any bugs related to it. > > StackOverflow post link (has code and details on how to reproduce the > issue): > http://stackoverflow.com/questions/8902568/runtime-performance-degradation-f...
_______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
participants (3)
-
Edward Z. Yang
-
Sanket Agrawal
-
Simon Marlow