The most robust way is probably to use a completely independent
supervisor program, e.g. "upstart", "systemd", "runit", etc. These
usually have facilities for restarting the supervised program, and a
rate limit on exactly how often to try that (over a given period of time).
These *won't* work for a program that's deadlocked because an important
thread has died. For that you'll need either a watchdog (external) or an
in-program mechanism for "supervised threads" which can catch any and
all exceptions and restart threads as necessary. This tends to very
domain-specific, but you might take some inspiration for the way
supervisor hierarchies work in the actor model.
Hi Bardur, the "supervised threads" sounds like a good approach for me. Thanks!