New subject: Runtime performance degradation for multi-threaded C FFI callback

23 Jan 2012

      Hi Simon,

I'm not certain that your explanation matches what I observed.

All of my tests were done on a 4-core machine, executing with "+RTS
-N", which should be the same as "+RTS -N4" I believe.

With 1 Haskell thread (the main thread) and 4 process threads (via
pthreads), I saw a significant performance degradation compared to 5
Haskell threads (main + 4 via forkIO) and 4 process threads.  As I
understand your explanation, if C callbacks are scheduled according to
available capabilities, there should be no difference between these
situations.

I observed this with GHC-7.2.1, however Daniel Fischer reported that,
with ghc-7.2.2, he observed different behavior (which matches your
explanation AFAICT).  Is it possible that the scheduling of callbacks
into Haskell changed between those versions?

Thanks,
John L.
...
From: Simon Marlow 
Subject: Re: Runtime performance degradation for multi-threaded C FFI
       callback
To: Sanket Agrawal 
Cc: glasgow-haskell-users 
Message-ID: <4F1D2F4D.9050709@gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
On 21/01/2012 15:35, Sanket Agrawal wrote:
...
Hi Edward,
I was just going to get back to you about it. I did find out that the
issue was indeed one GHC thread dealing with 5 C threads for callback
(1:5 mapping) - so, the C threads were blocking on callback waiting for
the only GHC thread to be available. I updated the code to do 1:1
mapping - 5 GHC threads for 5 C threads. That proved to be almost
linearly scalable.
This is almost right, except that your callbacks are not waiting for a
GHC *thread*, but what we call a "capability", which is roughly speaking
"permission to execute Haskell code".  The +RTS -N option chooses the
number of capabilities.
I expect that with -N1, your program is spending a lot of time just
switching between the different OS threads.
It's possible that we could make the runtime more flexible here.  I
recently made it possible to modify the number of capabilities at
runtime, so it's conceivable that the runtime could automatically add
capabilities if it is being called from multiple OS threads.
...
John Latos suggested the above approach two days back, but I didn't get
to test the idea until now.
It doesn't seem to matter whether number of GHC threads are increased,
if the mapping between GHC threads and C threads is not 1:1. I got 1:1
mapping by doing forkIO for each C thread. Is it really possible to do
7:5 mapping (that is 7 GHC threads to choose from, for 5 C threads
during callback)? I can't think of a way to do it. Not that I need it. I
am just curious if that is possible.
Just think of +RTS -N7 as being 7 *locks*, not 7 threads.  Then it makes
perfect sense to have 7 locks available for 5 threads.
Cheers,
       Simon

Re: Runtime performance degradation for multi-threaded C FFI callback

John Lato

Simon Marlow

Daniel Fischer

John Lato

Simon Marlow

tags

participants (3)