BlockedIndefinitelyOnMVar exception

Hi, I have a very big and highly threaded program that generates a BlockedIndefinitelyOnMVar exception when run. I have spent a reasonable amount of time pouring over the source code, as has Max Bolingbroke. Neither of us have the slightest idea why it raises the exception. Some questions: * Does anyone know the exact sequence of actions that causes this exception to be thrown? I couldn't find it written down. * How confident are people that this exception does really mean that it is in a blocked state? Is there any chance the error could be raised incorrectly? * Any debugging tips for this problem? Thanks, Neil

Hi,
I have a very big and highly threaded program that generates a BlockedIndefinitelyOnMVar exception when run. I have spent a reasonable amount of time pouring over the source code, as has Max Bolingbroke. Neither of us have the slightest idea why it raises the exception.
Some questions:
* Does anyone know the exact sequence of actions that causes this exception to be thrown? I couldn't find it written down. * How confident are people that this exception does really mean that it is in a blocked state? Is there any chance the error could be raised incorrectly? * Any debugging tips for this problem?
My understanding was that this error occurred when one thread was blocked, waiting on an MVar, and no other thread in the program has a reference to that MVar (this can be detected during GC). Ergo, the blocked thread will end up waiting forever because no-one can ever wake it up again. Whenever I have had this error (or its STM equivalent) I think it was always telling the truth. I seem to remember it was often a symptom of another thread terminating unexpectedly. So if thread A is blocked on an MVar that it is expecting thread B to write to, then thread B terminating can cause this error to arise in thread A, even though the real problem is in thread B. Do you actually have use of MVars in your program directly, or are they being used via a library? And do you at least know which thread is throwing this exception? It should be catchable so you can probably wrap the arguments to your forkIO calls with a catcher than indicates which thread blew up. Thanks, Neil.

My understanding was that this error occurred when one thread was blocked, waiting on an MVar, and no other thread in the program has a reference to that MVar (this can be detected during GC). Ergo, the blocked thread will end up waiting forever because no-one can ever wake it up again.
That certainly seems a sensible rule - I'll see if that can help me debug my problem.
Do you actually have use of MVars in your program directly, or are they being used via a library? And do you at least know which thread is throwing this exception? It should be catchable so you can probably wrap the arguments to your forkIO calls with a catcher than indicates which thread blew up.
I use MVar's directly, use Chan/QSem, and have about 5 concurrency data types built on top of MVar's - they're everywhere. I also have a thread pool structure, so tasks move between threads regularly - knowing which thread got blocked isn't very interesting. Thanks for the information, Neil

On 26/06/10 12:28, Neil Mitchell wrote:
I have a very big and highly threaded program that generates a BlockedIndefinitelyOnMVar exception when run. I have spent a reasonable amount of time pouring over the source code, as has Max Bolingbroke. Neither of us have the slightest idea why it raises the exception.
Some questions:
* Does anyone know the exact sequence of actions that causes this exception to be thrown? I couldn't find it written down.
Sure - it means the garbage collector found that the thread was blocked on an MVar that is otherwise unreachable, and hence the thread could never be awoken.
* How confident are people that this exception does really mean that it is in a blocked state? Is there any chance the error could be raised incorrectly?
There have been one or two bugs in the past that could lead to this exception being raised incorrectly, but I'm not aware of any right now. It's not inconceivable of course.
* Any debugging tips for this problem?
I'd use the event log: compile with -debug, run with +RTS -Ds -l, and dump the event log with show-ghc-events (cabal install ghc-events). Or just dump it to stderr with +RTS -Ds, if the log isn't too large. Use GHC.Exts.traceEvent to add your own events to the trace. Cheers, Simon

Hi Simon, Thanks for the excellent information. I've now debugged my problem, and think I've got the last of the MVar blocking problems out.
* How confident are people that this exception does really mean that it is in a blocked state? Is there any chance the error could be raised incorrectly?
There have been one or two bugs in the past that could lead to this exception being raised incorrectly, but I'm not aware of any right now. It's not inconceivable of course.
I have no reason to think it's broken. I found at least 3 separate concurrency bugs in various parts (one added the day before, one over a year old, one of which had been introduced while trying to work around the MVar problem). My suspicion for the root cause of the problem is that Concurrent.Chan is incorrect. In the course of debugging this problem we found 2 bugs in Chan, and while I never tracked down any other bugs in Chan, I no longer trust it. By rewriting parts of the program, including avoiding Chan, the bugs disappeared.I don't think I'll be using Chan again until after someone has proven in correct.
* Any debugging tips for this problem?
I'd use the event log: compile with -debug, run with +RTS -Ds -l, and dump the event log with show-ghc-events (cabal install ghc-events). Or just dump it to stderr with +RTS -Ds, if the log isn't too large. Use GHC.Exts.traceEvent to add your own events to the trace.
The event log is fantastic! Thanks, Neil

On 01/07/2010 21:10, Neil Mitchell wrote:
Hi Simon,
Thanks for the excellent information. I've now debugged my problem, and think I've got the last of the MVar blocking problems out.
* How confident are people that this exception does really mean that it is in a blocked state? Is there any chance the error could be raised incorrectly?
There have been one or two bugs in the past that could lead to this exception being raised incorrectly, but I'm not aware of any right now. It's not inconceivable of course.
I have no reason to think it's broken. I found at least 3 separate concurrency bugs in various parts (one added the day before, one over a year old, one of which had been introduced while trying to work around the MVar problem).
My suspicion for the root cause of the problem is that Concurrent.Chan is incorrect. In the course of debugging this problem we found 2 bugs in Chan, and while I never tracked down any other bugs in Chan, I no longer trust it. By rewriting parts of the program, including avoiding Chan, the bugs disappeared.I don't think I'll be using Chan again until after someone has proven in correct.
Considering Chan is <150 lines of code and has been around for many years, that's amazing! Did you report the bugs? Is it anything to do with asynchronous exceptions? You should have more luck with Control.Concurrent.STM.TChan, incedentally. It's much easier to get right, and when we benchmarked it, performance was about the same (all those withMVar/modifyMVars in Chan are quite expensive), plus you get to compose it easily: reading from either of 2 TChans is trivial. Cheers, Simon

Hi Simon,
My suspicion for the root cause of the problem is that Concurrent.Chan is incorrect. In the course of debugging this problem we found 2 bugs in Chan, and while I never tracked down any other bugs in Chan, I no longer trust it. By rewriting parts of the program, including avoiding Chan, the bugs disappeared.I don't think I'll be using Chan again until after someone has proven in correct.
Considering Chan is <150 lines of code and has been around for many years, that's amazing! Did you report the bugs? Is it anything to do with asynchronous exceptions?
Nothing to do with async exceptions. I found: http://hackage.haskell.org/trac/ghc/ticket/4154 http://hackage.haskell.org/trac/ghc/ticket/3527 Of course, there's also the async exceptions bug still around: http://hackage.haskell.org/trac/ghc/ticket/3160 However, even after having a program with no async exceptions (I never used them), and eliminating unGetChan and isEmpyChan, I still got bugs. I have no proof they came from the Chan module, and no minimal test case was ever able to recreate them, but the same program with my own Chan implementation worked. My Chan had different properties (it queues items randomly) and a subset of the Chan functions, so it still doesn't prove any issue with Chan - but I am now sceptical.
You should have more luck with Control.Concurrent.STM.TChan, incedentally. It's much easier to get right, and when we benchmarked it, performance was about the same (all those withMVar/modifyMVars in Chan are quite expensive), plus you get to compose it easily: reading from either of 2 TChans is trivial.
The performance of the Haskell is irrelevant - the program spends all its time invoking system calls. Looking at the implementation I am indeed much more trusting of TChan, I'll be using that in future if there is ever a need. Thanks, Neil

On 04/07/10 10:30, Neil Mitchell wrote:
Hi Simon,
My suspicion for the root cause of the problem is that Concurrent.Chan is incorrect. In the course of debugging this problem we found 2 bugs in Chan, and while I never tracked down any other bugs in Chan, I no longer trust it. By rewriting parts of the program, including avoiding Chan, the bugs disappeared.I don't think I'll be using Chan again until after someone has proven in correct.
Considering Chan is<150 lines of code and has been around for many years, that's amazing! Did you report the bugs? Is it anything to do with asynchronous exceptions?
Nothing to do with async exceptions. I found:
Yup, that's a bug. Not clear if it's fixable.
That too. A very similar bug in fact, if there is a fix it will probably fix both of them. The problem is that readChan holds a lock on the read end of the Chan, so neither isEmptyChan nor unGetChan can work when a reader is blocked.
Of course, there's also the async exceptions bug still around:
Yes, that's a bug (though not in Chan).
However, even after having a program with no async exceptions (I never used them), and eliminating unGetChan and isEmpyChan, I still got bugs. I have no proof they came from the Chan module, and no minimal test case was ever able to recreate them, but the same program with my own Chan implementation worked. My Chan had different properties (it queues items randomly) and a subset of the Chan functions, so it still doesn't prove any issue with Chan - but I am now sceptical.
It's surprising how difficult it is to get these MVar-based abstractions right. Some thorough testing of Chan is probably in order. Cheers, Simon

Yup, that's a bug. Not clear if it's fixable.
That too. A very similar bug in fact, if there is a fix it will probably fix both of them. The problem is that readChan holds a lock on the read end of the Chan, so neither isEmptyChan nor unGetChan can work when a reader is blocked.
I wrote my Chan around the abstraction: data Chan a = Chan (MVar (Either [a] [MVar a])) The Chan either has elements in it (Left), or has readers waiting for elements (Right). To get the fairness properties on Chan you might want to make these two lists Queue's, but I think the basic principle still works. By using this abstraction my Chan was a lot simpler. With this scheme implementing isEmpyChan or unGetChan would both work nicely. My Chan was not designed for performance. (In truth I replaced the Left with IntMap a, and inserted elements with a randomly chosen key, but the basic idea is the same.)
own Chan implementation worked. My Chan had different properties (it queues items randomly) and a subset of the Chan functions, so it still doesn't prove any issue with Chan - but I am now sceptical.
It's surprising how difficult it is to get these MVar-based abstractions right. Some thorough testing of Chan is probably in order.
Agreed! In this project I wrote 8 different concurrency abstractions. I had bugs in most. MVar is a great building block on which to put higher layered abstractions, but using it correctly is tricky. I found that I used MVar's in four ways: 1) MVar's which are always full, and are just locks around data for consistency. Created with newMVar, used with modifyMVar. 2) MVar's which contain unit and are used for locking something other than data (i.e. a file on disk). Created with newMVar, used with withMVar. 3) MVar's which are used to signal computation can begin, created with newMVarEmpty, given to someone who calls putMVar (), and waited on by the person who created them. 4) MVar's which go in a higher-level concurrency operation - CountVars (variables which wait until they have been signaled N times), RandChan (Chan but with randomness), Pool (thread pool) etc. Thanks, Neil

On 04/07/2010 21:51, Neil Mitchell wrote:
Yup, that's a bug. Not clear if it's fixable.
That too. A very similar bug in fact, if there is a fix it will probably fix both of them. The problem is that readChan holds a lock on the read end of the Chan, so neither isEmptyChan nor unGetChan can work when a reader is blocked.
I wrote my Chan around the abstraction:
data Chan a = Chan (MVar (Either [a] [MVar a]))
The Chan either has elements in it (Left), or has readers waiting for elements (Right). To get the fairness properties on Chan you might want to make these two lists Queue's, but I think the basic principle still works. By using this abstraction my Chan was a lot simpler. With this scheme implementing isEmpyChan or unGetChan would both work nicely. My Chan was not designed for performance. (In truth I replaced the Left with IntMap a, and inserted elements with a randomly chosen key, but the basic idea is the same.)
I like the idea. But what happens if one of the blocked threads gets killed by a killThread (e.g. a timeout) while it is waiting? Won't we still give it an element of the Chan sometime in the future? Perhaps this doesn't happen in your scenario, but it seems to throw a spanner in the works for using this as a general-purpose implementation. The STM version doesn't have this bug, of course :-) But then, it doesn't have fairness either. Cheers, Simon

I wrote my Chan around the abstraction:
data Chan a = Chan (MVar (Either [a] [MVar a]))
The Chan either has elements in it (Left), or has readers waiting for elements (Right). To get the fairness properties on Chan you might want to make these two lists Queue's, but I think the basic principle still works. By using this abstraction my Chan was a lot simpler. With this scheme implementing isEmpyChan or unGetChan would both work nicely. My Chan was not designed for performance. (In truth I replaced the Left with IntMap a, and inserted elements with a randomly chosen key, but the basic idea is the same.)
I like the idea. But what happens if one of the blocked threads gets killed by a killThread (e.g. a timeout) while it is waiting? Won't we still give it an element of the Chan sometime in the future? Perhaps this doesn't happen in your scenario, but it seems to throw a spanner in the works for using this as a general-purpose implementation.
I hadn't thought of that at all - my scenario doesn't have any threads being killed. With the thought of threads dying concurrency abstractions become significantly harder - I hadn't quite realised how hard that must make it. Thanks, Neil
participants (3)
-
nccb2@kent.ac.uk
-
Neil Mitchell
-
Simon Marlow