Fault-Tolerant Systems Are Faulty
Richard Cook wrote an excellent article How Complex Systems Fail. The author packs a lot of ideas into four pages, divided into 18 points. Here are his first five points.
- Complex systems are intrinsically hazardous systems.
- Complex systems are heavily and successfully defended against failure.
- Catastrophe requires multiple failures – single-point failures are not enough.
- Complex systems contain changing mixtures of failures latent within them.
- Complex systems run in degraded mode.
Cook (no relation) elaborates his fifth point:
A corollary to the preceding point is that complex systems run as broken systems. The system continues to function because it contains so many redundancies and because people can make it function, despite the presence of many flaws.
Complex systems are necessarily fault-tolerant. If they weren’t fault-tolerant, they likely wouldn’t survive long enough to become complex. Unfortunately, the down side is that fault-tolerant systems are always faulty.
We want our software to be fault-tolerant, but it is very difficult to tolerate faults without encouraging and concealing them at the same time. (Think of badly formed HTML, for example.) Fault tolerance can work smoothly if you have a well-defined range of faults that your system is designed to tolerate. But misguided attempts to tolerate errors can mask problems, delaying but not preventing failure. This in turn makes the failure harder to diagnose and repair.
Some systems must be complex and fault-tolerant. But when we decide to make software fault-tolerant, especially if some realistic alternative could make it simpler, we should be aware of the consequences.
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)