Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is it recommended to create clusters with odd number of nodes

There are several resources about distributed systems, like the mongo db documentation that recommend odd number of nodes in a cluster.

What are the benefits of having odd number of nodes?

like image 618
ProgramCpp Avatar asked Nov 12 '19 16:11

ProgramCpp


2 Answers

Short answer: in this case of MongoDB, having an odd number of nodes increases your clustered system's availability (uptime).

Look at the table in the MongoDB documentation you linked:

+-------------------+------------------------------------------+-----------------+
| Number of Members | Majority Required to Elect a New Primary | Fault Tolerance |
+-------------------+------------------------------------------+-----------------+
|         3         |                    2                     |        1        |
+-------------------+------------------------------------------+-----------------+
|         4         |                    3                     |        1        |
+-------------------+------------------------------------------+-----------------+
|         5         |                    3                     |        2        |
+-------------------+------------------------------------------+-----------------+
|         6         |                    4                     |        2        |
+-------------------+------------------------------------------+-----------------+

Notice that how when you have an odd number of members and add one more (to become even), your fault tolerance does not go up! (Meaning, your cluster cannot tolerate more failed members than it originally could)

This is because MongoDB requires a majority of members to be up in order to elect a primary. This property is not specific to MongoDB, but any clustered system that requires a majority of members to be up (for example, see also etcd).

Your system availability actually goes down when increasing to an even number of nodes because, although your fault tolerance remains the same, there are more nodes that can fail so the probability of a fault occurring goes up.

In addition, having an even number of members decreases the probability that if there is a network partition then some subset of your nodes will be able to continue running. For example, if you've got a 6 node cluster then it opens up the possibility that a network partition could partition your nodes into 2 3-node partitions. In such a case then neither partition will be able to communicate with a majority of members and your cluster becomes unavailable.

The counter-intuitive conclusion is that, if you have an even-membered cluster then it is actually beneficial (from a high-availability standpoint) to remove one of the members.

like image 70
Mark A Avatar answered Nov 02 '22 23:11

Mark A


The odd number of nodes help - and not necessary - in electing a leader in a cluster. It is essential to avoid multiple leaders getting elected, a condition known as split-brain problem. consensus algorithms use voting for electing the leader. i.e, elect the node with majority votes.

consider a cluster of 5 nodes. the minimum majority required is 3 (5/2 or 2 + 2 + 1- the deal breaker).

It is important to note that the majority of the cluster votes are required for leader election even under failure conditions.

consider 1 out of 5 nodes failed. we can still elect a leader with majority votes of 3. well, what if out of the 4 nodes, two leaders get elected with equal votes of 2? that's left to the consensus algorithm to resolve the contention (maybe, simply reinitiate election)

let's say, 2 out of 5 nodes failed. we can still elect a leader with majority votes of 3 i.e when all 3 available nodes vote for the same node.

One usually gets confused about achieving majority when one of the odd nodes fails, leaving them even in number. it should be clear by now, that the majority of the initial cluster size (preferably odd) is required to elect the leader.

we have seen how the odd number clusters help in case of node failures. another point to add here is, how this helps in case of network partitions. In the worst case, the network partition can split the cluster into exactly two equal halves which cannot happen in an odd-numbered cluster.

As long as the part of cluster or the number of operating nodes are greater or equal to floor(n/2)+1, to reach consensus based on majority w.r.t the initial cluster size, the cluster can continue to operate

like image 45
ProgramCpp Avatar answered Nov 02 '22 23:11

ProgramCpp