NoSQL Zone is brought to you in partnership with:

Developer with experience in a variety of different systems and technologies, with a customer focus and balance with business goals. Particularly interested in backend and large scale systems, and also interested in high level architecture, and API design. Always open to feedback in order to keep learning and improving as a professional. Rodrigo is a DZone MVB and is not an employee of DZone and has posted 39 posts at DZone. You can read more from them at their website. View Full User Profile

Consensus Protocol

11.12.2012
| 3414 views |
  • submit to reddit
You've probably heard of 2-phase commit, Paxos, consensus protocol. Paxos itself seems to be a complex thing to understand and, although proven to be correct, it's not uncommon to hear that it's not really required and some "simpler" algorithms can be used.


These posts I came across give a great explanation on the shortcomings of 2-phase commit when it comes to failures and also of 3-phase commit, which handles well one type of failure (fail-stop). It gives great examples of hard distributed system issues with these protocols that make them not robust enough.

And this is a great quote from these articles:

Mike Burrows, inventor of the Chubby service at Google, says that “there is only one consensus protocol, and that’s Paxos” – all other approaches are just broken versions of Paxos.
I'd definitely recommend you take the time to read them:

Consensus Protocols: Two-Phase Commit
Consensus Protocols: Three-phase Commit
Consensus Protocols: Paxos

And, just to finalize, when I see quotes like above and the number of issues with 2PC and 3PC, I wonder how reliable consensus protocols like the one used by MongoDB to pick the primary replica actually is:
We use a consensus protocol to pick a primary. Exact details will be spared here but that basic process is:
1.get maxLocalOpOrdinal from each server.
2.if a majority of servers are not up (from this server's POV), remain in Secondary mode and stop.
3.if the last op time seems very old, stop and await human intervention.
4.else, using a consensus protocol, pick the server with the highest maxLocalOpOrdinal as the Primary.

Any server in the replica set, when it fails to reach master, attempts a new election process.
Published at DZone with permission of Rodrigo De Castro, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)