How to Build a Highly Available System
Without a Toolkit--Slides

Butler Lampson
Microsoft

 

Abstract: In this talk I will describe the general recipe for building a highly available system and show how to do it efficiently in practice. The idea is to build a replicated state machine.  The tricky part is getting all the replicas to agree on the content and order of the inputs.  For this you need a general fault-tolerant algorithm for reaching consensus. Two-phase commit uses the simplest possible such algorithm, but  I will describe Lamport's "Paxos" algorithm, which is the best one that works without real-time guarantees.  Since consensus is expensive, you also need to minimize its use, and you do that with leases, which are locks that time out.

The other theme of the talk is a general method for designing or understanding a complex system, especially a concurrent or fault-tolerant one.  Using the Paxos algorithm as the example, I will show you how to write a precise spec for such a system and how to prove that an implementation satisfies the spec.  There won't be any formulas in the talk.

 

 

Word, Acrobat, HTML. A paper is here.