Where STM is unstable at the moment, and how we can fix it

This email is inspired by the discussion here: http:// hackage.haskell.org/trac/ghc/ticket/2401 As the ticket discusses, unsafeIOToSTM is, unlike unsafePerformIO or unsafeInterleaveIO, genuinely completely unsafe in that there is no way to use it such that a segfault or deadlock is not at least somewhat encouraged. The code attached to the ticket creates a deadlock solely through using it to write to stdout. But, for the same reason that unsafeIOToSTM is unstable, unsafeInterleaveIO now is very unstable as well -- conceivably, data generated from functions with lazy IO (including those in the prelude) could cause deadlocks within STM, and even segfaults. In summary, a "validation" step is performed on all threads inside atomically blocks during garbage collection. This validation step will, on encountering invalid threads (i.e. ones which should be rolled back) immediately kill them dead and retry. This is different than the implementation described in the STM paper, where rollbacks only occur on commit. However, it does add a measure of efficiency. The problem is that the validation code disregards exception handlers, since rollback is not an exception, and so anything embedded in STM that brackets an IO action, for example, can be rolled back without the final part of the exception even being called. As Simon M. notes, the obvious solution would be to turn rollbacks into regular exceptions, but this would open a number of cans of worms. A start, though not sufficient, would be for stm validation to respect blocked status -- not to block on it, obviously, but simply to refuse to rollback a transaction within it. Validation on GC is, after all, only an efficiency trick and implementation detail, and if it lets the occasional invalid transaction stand due to its blocked status, that transaction will simply be cleaned up later anyway. A more thorough solution would be, as I suggest at the end of the ticket, to add a new primitive with similar semantics to block -- blockRollback, of type STM () -> STM (). Anything that took place within blockRollback could not be stopped by validation. Finally, we could "split the difference" between block and blockRollback, by simply setting a rollbackBlocked flag on a *top level* invocation of block within STM, and thenceforth, not unsetting it until that block is exited, regardless of calls to unblock nested inside. This would effectively, without introducing a new primitive, ensure that rollback did not disrupt things terribly, and thus would be the solution that handled the lazyIO issue the best as well. There are lots of interesting applications of STM that require the ability to extend its semantics. To do this is going to require unsafeIOToSTM, just as unsafePerformIO is used on occasion as a low level tool to create safer and better things on top of (or as unsafeCoerce is, for that matter). However, the current state of STM means that writing these extensions of STM semantics safely is 100% impossible. I'm not sure which, if any, of the solutions that I'm presenting seem the most reasonable. However, without some sort of resolution for this issue, STM is far less powerful and useful than it can and should be. --Sterl.

Sterling Clover wrote:
This email is inspired by the discussion here: http://hackage.haskell.org/trac/ghc/ticket/2401
As the ticket discusses, unsafeIOToSTM is, unlike unsafePerformIO or unsafeInterleaveIO, genuinely completely unsafe in that there is no way to use it such that a segfault or deadlock is not at least somewhat encouraged. The code attached to the ticket creates a deadlock solely through using it to write to stdout. But, for the same reason that unsafeIOToSTM is unstable, unsafeInterleaveIO now is very unstable as well -- conceivably, data generated from functions with lazy IO (including those in the prelude) could cause deadlocks within STM, and even segfaults.
In summary, a "validation" step is performed on all threads inside atomically blocks during garbage collection. This validation step will, on encountering invalid threads (i.e. ones which should be rolled back) immediately kill them dead and retry. This is different than the implementation described in the STM paper, where rollbacks only occur on commit. However, it does add a measure of efficiency.
Its not just an efficiency trick, in fact. The validation step is absolutely necessary for correctness. The problem is that a transaction may have seen an inconsistent view of memory, and as a result it may have gone into an infinite loop; the only way to catch and recover from this situation is to validate at regular intervals, say before a GC (this suffers from the problem that the transaction has to be allocating in order to be stopped, but that's another matter). e.g. the code might be something like atomically $ do a <- readTVar ta b <- readTVar tb if a == b then loop else return () now we might know that a is never equal to b under normal conditions: all the transactions in the program satisfy the invariant. However, since we use optimistic concurrency, it might be the case that this thread sees an inconsistent view of memory in which a==b. The case would normally be caught at commit time, but this thread isn't going to commit: it goes into an infinite loop instead.
As Simon M. notes, the obvious solution would be to turn rollbacks into regular exceptions, but this would open a number of cans of worms.
A start, though not sufficient, would be for stm validation to respect blocked status -- not to block on it, obviously, but simply to refuse to rollback a transaction within it.
That wouldn't be correct, because the thread might be in an infinite loop inside a block. However, it would probably work in the cases you're interested in, so I wouldn't object to a patch that implemented this workaround for the time being. I do agree that we have a problem here, and I'll re-open the ticket (sorry for leaving it closed). I think raising an (asynchronous) exception is the right solution. We have to make sure the exception cannot be caught by an STM catch, but I think that's do-able. However, another problem we have is that when the IO system re-raises the exception, it'll be raised as a synchronous exception rather than an asynchronous exception. I've just spent an hour or so talking this over here with Simon PJ and we have some ideas for fixing it, I'll try to write it up in a ticket later. Cheers, Simon
participants (2)
-
Simon Marlow
-
Sterling Clover