DevOps Zone is brought to you in partnership with:

My name is Konrad Garus and I solve problems for a living. I am crazy about quality of code and life, zealous learner and believer in constant refinement and improvement. I fight stubborn ignorance and “good enough” with passion. Personally I also am a husband, father, passionate reader and music fan. Konrad is a DZone MVB and is not an employee of DZone and has posted 27 posts at DZone. You can read more from them at their website. View Full User Profile

Systems that Run Forever Self-heal and Scale

  • submit to reddit

I recently saw a great presentation by Joe Armstrong called “Systems that Run Forever Self-heal and Scale”. Joe Armstrong is the inventor of Erlang and he does mention Erlang quite a lot, but the principles are very much universal and applicable with other languages and tools.

The talk is well worth watching, but here’s a few quick notes for a busy reader or my future self.

General Remarks

  • If you want to run forever, you have to have more than one instance of everything. If anything is unique, then as soon as that service or machine goes down, your system goes down. This may be due to an unplanned outage or a routine software update. Obvious but still pretty hard.

  • There are two ways to design systems: scaling up or scaling down. If you want a system for 1,000 users, you can start with a design for 10 users and expand it, or start with 1,000,000 users and scale it down. You will get a different design for your 1,000 users depending on where you start.

  • The hardest part is distributing data in a consistent, durable manner. Don’t even try to do it yourself, use known algorithms, libraries and products.

    Data is sacred, pay attention to it. Web services and such frameworks? Whatever, anyone can write those.
  • Distributing computations is much easier. They can be performed anywhere, resumed or retried after a failure, etc. There are some more suggestions and hints on how to do it.

Six Rules of a Reliable System

  1. Isolation: when one process crashes, it should not crash others. This naturally leads to better fault-tolerance, scalability, reliability, testability and comprehensibility. It all also means much easier code upgrades.

  2. Concurrency: this is pretty obvious -- you need more than one computer to make a non-stop system, and that automatically means they will operate concurrently and be distributed.

  3. Failure detection: You can’t fix it if you can’t detect it. It has to work across machine and process boundaries because the entire machine and process can’t fail. You can’t heal yourself when you have a heart attack, it has to be external force.

    This implies asynchronous communication and message-driven model.

    Interesting idea: supervision trees. Supervisors on higher levels of the tree, workers in leaves.

  4. Fault identification: when it fails, you also need to know why it failed.

  5. Live code upgrade: obviously a must-have for zero downtime. Once you start the system, never stop it.

  6. Stable storage: store things forever in multiple copies, distributed across many machines and places, etc.

    With proper stable storage you don’t need backups. Snapshots, yes, but not backups.

Others: Fail fast, fail early, let it crash. Don’t swallow errors, don’t continue unless you really know what you’re doing. Better crash and let the higher level process decide how to deal with an illegal state.

The Actor Model in Erlang

We’re used to two notions of running things concurrently: processes and threads. The difference? Processes are isolated, live in different places in memory and one can’t screw the other. Threads can.

Answer from Erlang: Actors. They are isolated processes, but they’re not the heavy operating system processes. They all live in the Erlang VM, rely on it for scheduling, etc. They’re very light and you can easily run thousands of them on a computer.


Much of this is very natural in functional programming. Perhaps that’s what makes functional programming so popular nowadays – that in this paradigm it’s so much easier to write reliable, fault-tolerant, scalable, comprehensible systems.

Published at DZone with permission of Konrad Garus, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)


Jonathan Halterman replied on Sat, 2013/08/24 - 3:14pm

In JVM land there are a few options for failure detection, some of them having been inspired by Erlang's supervision trees. Akka's  actors are one option. Sarge (a library I maintain) is another. And Netflix's Hystrix library, which implements the circuit breaker pattern, is another.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.