
Benjamin Franksen wrote:
Chris Kuklewicz wrote:
Aside on utterly useful proofs: When you write concurrent programs you want an API with strong and useful guarantees so you can avoid deadlocks and starvation and conflicting data read/writes. Designing an using such an API is a reasoning exercise identical to creating proofs. Some systems makes this possible and even easy (such as using STM).
I have to -- partly -- disagree with your last statement. It is possible that I missed something important but as I see it, STM has by construction the disadvantage that absence of starvation is quite hard (if not impossible) to prove. A memory transaction must be re-started from the beginning if it finds (when trying to commit) another task has modified one of the TVars it depends on. This might mean that a long transaction may in fact /never/ commit if it gets interrupted often enough by short transactions which modify one of the TVars the longer transaction depends upon.
True. The STM API ensures the program will be deadlock free, the mutable state is consistent, and that progress will always be made (i.e. _some block_ will commit). The current STM implementation does not otherwise ensure fairness or avoid starvation. It is clever enough to put retrying blocks to sleep until one of the TVars they read is changed. But it does not have the single-wake up FIFO fairness of MVars.
IIRC this problem is mentioned in the original STM paper only in passing ("Starvation is surely a possibility" or some such comment). I would be very interested to hear if there has been any investigation of the consequences with regard to proving progress guarantees for programs that use STM.
The STM API ensures progress for the program as a whole (i.e. commits). If a block is failed because of a conflict this is because another block has committed. If all the threads of the program call retry then the program is stuck, but that is not STM's fault, and the STM scheduler will kill the program in this case rather than hang.
My current take on this is that there might be possibilities avoid this kind of starvation by appropriate scheduling. One idea is to assign time slices dynamically, e.g. to double it whenever a task must do a rollback instead of a commit (due to externally modified TVars), and to reset the time slice to the default on (successful) commit or exception.
With the new GHC using multiple CPU's this would not prevent a fast block from running & committing on a different CPU and causing the long block to retry.
Another possibility is to introduce dynamic task priorities and to increase priority on each unsuccessful commit.
(Unfortunately I have no proof that such measures would remove the danger of starvation, they merely seem plausible avenues for further research. I also don't know whether the current GHC runtime system already implements such heuristics.)
Cheers Ben
What you need to "increase fairness/priority" for a block is to run that block until it tries to commit while keeping other blocks that would interfere from committing. The scheduler could take the runtime of a failed block into account when re-attempting it. It could lock out other blocks that might interfere with a very-long-delayed much-re-attempted-block until it has had a change to commit. (Since the scheduler knows which TVars the problem block has needed in the past any independent blocks can still run). To keep the progress guarantee: if the problem block takes a very very long time it must be timed out and another block be allowed to commit. This partitioning into "fast" vs "slow" blocks is merely based on the observations of the runtime scheduler.