Thursday, July 29, 2004

It's not as scary as it sounds.

Greg Black comments that he took a look at Joe Armstrong's thesis I linked to below. Just in case his discription makes it sound intimidating, the error handling philosophy discussion --- let it crash --- is one section of chapter 4 (ie. about 3 pages). Of course, handling errors is only one step towards a reliable system.

In fact, the chapters of the thesis are largely approachable independently of each other. Chapters 2 and 4 (Architecture and Programming Principles) are particularly good in this regard.

In the meantime for those who are feeling too lazy to read the actual pdf, an executive summary:

  • We don't know how to write bug-free programs
  • So every substantial program will have bugs
  • Even if we are lucky enough to miss our bugs, unexpected interactions with the outside world (including the hardware we are running on) will cause periodic failures in any long-running process
  • So make sure any faults that do occur can't interfere with the execution of your program
  • Faults are nasty, subtle, vicious creatures with thousands of non-deterministic side-effects to compensate for
  • So the only safe way to handle a bug is to terminate the buggy process
  • So don't program defensively: Just let it crash, and make sure
    1. Your runtime provides adequate logging/tracing/hot-upgrade support to detect/debug/repair the running system
    2. You run multiple levels of supervisor/watchdog all the way from supervisor trees to automatic, hot-failover hardware clusters
Simple really ;)

No comments: