Fault Tolerance 101 (Joe Armstrong)


Fault Tolerance 101

Joe Armstrong is always informative, entertaining, and edifying on the subject of fault tolerance. The man invented Erlang. How can we create fault tolerant systems when our systems are unspecified? It’s kind of the same talk he’s given for years, which has its roots in his PhD. They are great, big ideas that need learning.

People have claimed that even asking this question is a mistake (a common one) but I will ask it nonetheless: how can we bring these principles to Clojure to build more fault tolerant systems?


I don’t get it. Why is it a mistake to ask this question?

Armstrong is not advocating for voodoo. He’s advocating for a way of thinking and a resulting architectural choices. Why is this something that cannot be applied in many languages, codified into libraries and best practices, etc.?


I don’t get it either. But every time I ask an Erlang person if there is something other languages can learn from Erlang, I get a lecture about how everyone tries and they fail. They go on about the process model, how failures are handled, and everything the VM does for you.

But I don’t want to replicate the Actor Model. I think there is a lot that can be learned and a lot that already was learned. For instance, fail fast is a pretty common programming practice and it’s supported by modern VMs. Divide by zero, which would kill a C process, simply throws an exception.

And I don’t want to copy the Actor Model. I agree, it’s misguided to add actors to an OO language. But people always talk about the gold hidden in OTP. When I ask about it, they can’t give me any straight answers. So I’m reading about OTP myself. One thing I have managed to extract from Erlangers is that there are patterns encoded in the OTP. One of them is Try Three Times, which means accept two failures from your child process before failing yourself. That’s trivial to implement in Clojure and I use it all the time. I want more things like that.


Yeah. That lecture is probably just coming from loyalty and parochialism. Of course Erlang has lessons to teach the world besides “use Erlang”. How could it be otherwise?

One of my biggest regrets regarding 2014 is that I did not attend the 3-day Erlang Camp in Austin. There’s a great talk about Erlang, OTP, and Erlang camp on one of the podcasts – maybe Software Engineering Radio. I remember noting it was coming up, and even though I am now building a messaging system (!!) I thought I would not have the time to attend and that by the time it rolled around I would be too far along to benefit from it.

Wrong on all accounts. Hopefully they’ll do it again in 2015.


It might have been me talking with Martin J. Logan that you might be thinking of: http://www.functionalgeekery.com/episode-13-martin-j-logan/

I also recently talked with Eric B. Merritt one of the other co-authors of Erlang and OTP in Action, and does ErlangCamp as well for an episode ( http://www.functionalgeekery.com/episode-20-eric-b-merritt/ ), and after we were done recording, he mentioned that they are likely not going to be doing an ErlangCamp this year as the past few have just barely met their costs and more information on Erlang is getting out into the world.

I would guess they could be convinced if you could show them there was enough interest around it, but it didn’t sound like there were any set plans at this point.


I think they reason they say that, is because of the way the supervision is structured due to being based on a process model.

The premise of the idea is that the error handling is moved out into a different process, the supervisor, and that handles how things should be dealt with. The other process is free to crash and burn because it lives in isolation from everything else, and shares no state with the outside world.

The supervisor is configured with a setting that tells it the dependencies of the processes it supervises, so it knows how to start from a fresh state.

One of the metaphors I have seen around this is if you are working in Microsoft Word and it starts acting up, you kill Word and restart it. If that is still fails you would restart all of Office, and then if still having issues, you would likely restart higher level groups of programs until you get to logging of your account, and then rebooting the whole machine.

As far as the Try Three Times, in Erlang, it is more of a try X times in X seconds, and then crash yourself and let the issue escalate.

That is not to say that this whole model can’t be done in other languages, but I would guess a lot of the reason that people would think it is tricky has to do more with the process model, then purely actor model, although the actor model helps due to the state being isolated away from everything else.

Clojure may have a better chance then some of the other languages with some of the threading models that it uses, and the fact that it (mostly) deals with immutable local data as well. The global mutable items like refs, atoms, and agents(?) would have to be avoided to ensure a clean state and the ability to retry.

I am guessing the catch is that the logic for these tasks to supervise and allowed to fail and be restarted would need to be pulled out into their own tasks so they could fail and be restarted cleanly without cratering the other system.

The other option that Clojure could give you is to do some aspect oriented, pre/post logic by building up a macro system that would allow you to abstract away the error handling that you would normally have to do and be able to manage the nesting of dependencies to allow for restarts.

I am thinking of some of the other side of the “use Erlang” argument is that it is not that it cannot be done in other languages, but that it is a large effort, and unless there will be a team of people dedicated to a project like that, it is better to try and use one of the languages that runs on the Erlang Runtime System for that part of the system, instead of trying to rewrite OTP all on your own for your single project.

Another good reference, in addition to Erlang and OTP in Action, that I have picked up that shows some of the intricacies of OTP, is Designing for Scalability with Erlang/OTP by Francesco Cesarini and Steve Vinoski from O’Reilly.

Happy to try and answer any other questions you may have about this.


Hey Proctor,

Thanks for the awesome answer.

Restarting/retrying is something Clojurists think about already. Atoms, for instance, use optimistic concurrency. If there’s simultaneous updates to an atom, one will win and the rest will retry. So we’re used to thinking that way. I believe that Clojure cannot do everything that Erlang does. But I believe there’s still a lot it could do very easily. There’s a lot of wisdom and experience baked into OTP, and we could all benefit if it was shared.

Related to that: what is the most common OTP pattern you use?

I’ve got a few small patterns that have come in handy: try n times, try other (try X, if it fails, try Y), exponential back-off (sleep for 1 second after a failure, after second failure, sleep 2 seconds, after three failures sleep 4 seconds–great for rate limited services where you don’t know the limit). I’d love to know of some others.

The biggest problem I face is killing threads. It’s literally impossible on the JVM to kill a thread. So you need some other abstraction on top of threads if you want to do it. I’m still working on that, with no real solution in sight.



Right, and let me just add one more thing that I forgot:

I don’t really expect these things to work well for Clojure for parallel execution. Even in a single thread, it should be possible to make use of some of them. For instance, try-3-times does work very well for a single thread.



I’m just jumping into this but I find my Clojure code is often influenced by Erlang (and Haskell and Scheme) ideology, even if indirectly.

One library I found interesting was Michael Drogalis’s dire, which provides “Erlang-style supervisor error handling for Clojure.” https://github.com/MichaelDrogalis/dire What was particularly interesting was the availability of two different approaches, “drop-in flavor” and “Erlang-style.”

I’ve since dropped the dependency in my projects, in favor of a more Haskell-y approach with the either monad and https://github.com/funcool/cats

I think it’s important to avoid over-zealous devotion to any particular language/library/design pattern/framework/editor and keep an open mind to good ideas, irrespective of their source or baggage.