What version of GHC is this?  I vaguely remember fixing something like this.

The rule of thumb is: if you think it is a bug then report it, and we'll investigate further.

Simon, it is in GHC 7.4.1. Yes, you fixed a bug #4262 ("GHC's runtime never terminates worker threads"). I have filed the bug report #5897, with code to reproduce it.

This bug seems to be due to mvar callback from C FFI. If I remove mvar callback, the number of workers stay constant. But, it happens only if C FFI thread count exceed a threshold, 6 in my case. Also, I can consistently crash the code with segmentation fault/bus error on Mac if I increase the number of C FFI threads. On Linux too, the crash happens but not as often.

This seems to be a big bug in my opinion because mvar callback is important for coordination between GHC threads and C FFI threads. I can work around it for now, by keeping the number of C FFI threads below the threshold that triggers the bug. I suspect this bug has been in GHC all along, but wasn't discovered until now because it happens only if C FFI thread count cross a threshold, and mvar callback is involved.

 

Cheers,
       Simon



On Sat, Feb 25, 2012 at 3:41 PM, Sanket Agrawal
<sanket.agrawal@gmail.com <mailto:sanket.agrawal@gmail.com>> wrote:

   On further investigation, it seems to be very specific to Mac OS
   Lion (I am running 10.7.3) - all tests were with -N3 option:

   - I can reliably crash the code with seg fault or bus error if I
   create more than 8 threads in C FFI (each thread creates its own
   mutex, for 1-1 coordination with Haskell timer thread). My iMac has
   4 processors. In gdb, I can see that the crash happened
   in __psynch_cvsignal () which seems to be related to pthread mutex.

   - If I increase the number of C FFI threads (and hence, pthread
   mutexes) to >=7, the number of tasks starts increasing. 8 is the max
   number of FFI threads in my testing where the code runs without
   crashing. But, it seems that there is some kind of pthread mutex
   related leak. What the timer thread does is to fork 8 parallel
   haskell threads to acquire mutexes from each of the C FFI thread.
   Though the function returns after acquiring, collecting data, and
   releasing mutex, some of the threads seem to be marked as active by
   GC, because of mutex memory leak. Exactly how, I don't know.

   - If I keep the number of C FFI threads to <=6, there is no memory
   leak. The number of tasks stays steady.

   So, it seems to be pthread library issue (and not a GHC issue).
   Something to keep in mind when developing code on Mac that involves
   mutex coordination with C FFI.


   On Sat, Feb 25, 2012 at 2:59 PM, Sanket Agrawal
   <sanket.agrawal@gmail.com <mailto:sanket.agrawal@gmail.com>> wrote:

       I wrote a program that uses a timed thread to collect data from
       a C producer (using FFI). The number of threads in C producer
       are fixed (and created at init). One haskell timer thread uses
       threadDelay to run itself on timed interval. When I look at RTS
       output after killing the program after couple of timer
       iterations, I see number of worker tasks increasing with time.

         For example, below is an output after 20 iterations of timer
       event:

                              MUT time (elapsed)       GC time  (elapsed)
          Task  0 (worker) :    0.00s    (  0.00s)       0.00s    (  0.00s)
          Task  1 (worker) :    0.00s    (  0.00s)       0.00s    (  0.00s)
          .......output until task 37 snipped as it is same as task
       1.......
          Task 38 (worker) :    0.07s    (  0.09s)       0.00s    (  0.00s)
          Task 39 (worker) :    0.07s    (  0.09s)       0.00s    (  0.00s)
          Task 40 (worker) :    0.18s    ( 10.20s)       0.00s    (  0.00s)
          Task 41 (worker) :    0.18s    ( 10.20s)       0.00s    (  0.00s)
          Task 42 (worker) :    0.18s    ( 10.20s)       0.00s    (  0.00s)
          Task 43 (worker) :    0.18s    ( 10.20s)       0.00s    (  0.00s)
          Task 44 (worker) :    0.52s    ( 10.74s)       0.00s    (  0.00s)
          Task 45 (worker) :    0.52s    ( 10.75s)       0.00s    (  0.00s)
          Task 46 (worker) :    0.52s    ( 10.75s)       0.00s    (  0.00s)
          Task 47 (bound)  :    0.00s    (  0.00s)       0.00s    (  0.00s)


       After two iterations of timer event:

                               MUT time (elapsed)       GC time  (elapsed)
          Task  0 (worker) :    0.00s    (  0.00s)       0.00s    (  0.00s)
          Task  1 (worker) :    0.00s    (  0.00s)       0.00s    (  0.00s)
          Task  2 (worker) :    0.07s    (  0.09s)       0.00s    (  0.00s)
          Task  3 (worker) :    0.07s    (  0.09s)       0.00s    (  0.00s)
          Task  4 (worker) :    0.16s    (  1.21s)       0.00s    (  0.00s)
          Task  5 (worker) :    0.16s    (  1.21s)       0.00s    (  0.00s)
          Task  6 (worker) :    0.16s    (  1.21s)       0.00s    (  0.00s)
          Task  7 (worker) :    0.16s    (  1.21s)       0.00s    (  0.00s)
          Task  8 (worker) :    0.48s    (  1.80s)       0.00s    (  0.00s)
          Task  9 (worker) :    0.48s    (  1.81s)       0.00s    (  0.00s)
          Task 10 (worker) :    0.48s    (  1.81s)       0.00s    (  0.00s)
          Task 11 (bound)  :    0.00s    (  0.00s)       0.00s    (  0.00s)


       Haskell code has one forkIO call to kick off C FFI - C FFI
       creates 8 threads. Runtime options are "-N3 +RTS -s". timer
       event is kicked off after forkIO. It is for the form (pseudo-code):

       timerevent <other arguments> time = run where run = do
       threadDelay time >> do some work >> run where <other variables
       defined for run function>

       I also wrote a simpler code using just timer event (fork one
       timer event, and run another timer event after that), but didn't
       see any tasks in RTS output.

       I tried searching GHC page for documentation on RTS output, but
       didn't find anything that could help me debug above issue. I
       suspect that timer event is the root cause of increasing number
       of tasks (with all but last 9 tasks idle -  I guess 8 tasks
       belong to C FFI, and one task to timerevent thread), and hence,
       memory leak.

       I will appreciate pointers on how to debug it. The timerevent
       does forkIO a call to send collected data from C FFI to a db
       server, but disabling that fork still results in the issue of
       increasing number of tasks. So, it seems strongly correlated
       with timer event though I am unable to reproduce it with a
       simpler version of timer event (which removes mvar sync/callback
       from C FFI).





_______________________________________________
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users