Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ELI5: How etcd really works and what is consensus algorithm

I am having hard time to grab what etcd (in CoreOS) really does, because all those "distributed key-value storage" thingy seems intangible to me. Further reading into etcd, it delves into into Raft consensus algorithm, and then it becomes really confusing to understand.

Let's say that what happen if a cluster system doesn't have etcd?

Thanks for your time and effort!

like image 802
Aizan Fahri Avatar asked Jul 12 '14 00:07

Aizan Fahri


2 Answers

As someone with no CoreOS experience building a distributed system using etcd, I think I can shed some light on this.

The idea with etcd is to give some very basic primitives that are applicable for building a wide variety of distributed systems. The reason for this is that distributed systems are fundamentally hard. Most programmers don't really grok the difficulties simply because there are orders of magnitude more opportunity to learn about single-system programs; this has really only started to shift in the last 5 years since cloud computing made distributed systems cheap to build and experiment with. Even so, there's a lot to learn.

One of the biggest problems in distributed systems is consensus. In other words, guaranteeing that all nodes in a system agree on a particular value. Now, if hardware and networks were 100% reliable then it would be easy, but of course that is impossible. Designing an algorithm to provide some meaningful guarantees around consensus is a very difficult problem, and one that a lot of smart people have put a lot of time into. Paxos was the previous state of the art algorithm, but was very difficult to understand. Raft is an attempt to provide similar guarantees but be much more approachable to the average programmer. However, even so, as you have discovered, it is non-trivial to understand it's operational details and applications.

In terms of what etcd is specifically used for in CoreOS I can't tell you. But what I can say with certainty is that any data which needs to be shared and agreed upon by all machines in a cluster should be stored in etcd. Conversely, anything that a node (or subset of nodes) can handle on its own should emphatically not be stored in etcd (because it incurs the overhead of communicating and storing it on all nodes).

With etcd it's possible to have a large number of identical machines automatically coordinate, elect a leader, and guarantee an identical history of data in its key-value store such that:

  • No etcd node will ever return data which is not agreed upon by the majority of nodes.
  • For cluster size x any number of machines > x/2 can continue operating and accepting writes even if the others die or lose connectivity.
  • For any machines losing connectivity (eg. due to a netsplit), they are guaranteed to continue to return correct historical data even though they will fail to write.

The key-value store itself is quite simple and nothing particularly interesting, but these properties allow one to construct distributed systems that resist individual component failure and can provide reasonable guarantees of correctness.

like image 76
gtd Avatar answered Sep 24 '22 08:09

gtd


etcd is a reliable system for cluster-wide coordination and state management. It is built on top of Raft.

Raft gives etcd a total ordering of events across a system of distributed etcd nodes. This has many advantages and disadvantages:

Advantages include:

  • any node may be treated like a master
  • minimal downtime (a client can try another node if one isn't responding)
  • avoids split-braining
  • a reliable way to build distributed locks for cluster-wide coordination
  • users of etcd can build distributed systems without ad-hoc, buggy, homegrown solutions

    For example: You would use etcd to coordinate an automated election of a new Postgres master so that there remains only one master in the cluster.

Disadvantages include:

  • for safety reasons, it requires a majority of the cluster to commit writes - usually to disk - before replying to a client
  • requires more network chatter than a single master system
like image 26
bmizerany Avatar answered Sep 25 '22 08:09

bmizerany