Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Questions about Paxos implementation

I am implementing Paxos in a cluster simulator application, using the documentation available in Wikipedia. Unfortunately, it leaves several doors open to interpretation and does not provide much information on key implementation issues. It is unclear and incomplete.

  • Assuming a cluster divided in 3 regions, each containing 3 nodes (total = 9 nodes). What happens if communication is broken between regions? There is no way any leader can reach quorum (which is 5).

Isn't Paxos going to enter an infinite loop? I guess one should not initiate Paxos if one cannot communicate with at least a quorum of nodes.

  • In Phase 1b: 'If the proposal number N is larger than any previous proposal, then each Acceptor promises not to accept proposals less than N, and sends the value it last accepted for this instance to the Proposer'.

What is 'the last value it accepted'? Is it any previous proposal number from the proposer? What does 'instance' refer to exactly in this case?

  • In Phase 1a: Does one include the value to agree on with the Prepare message or is this deferred to the Accept! message? Or it does matter?

  • In Phase 2a: 'If any of the Acceptors have already accepted a value, the leader must Choose a value with the maximum proposal number N'.

What is value here? Is it the proposal number? I believe not, but this phrase is unclear.

  • In Phase 2a: 'Otherwise, the Proposer is free to choose any value'. What does this mean? A value for what? For the proposal number?

  • Paxos seems to rely on an increasing value of N (proposal number) to work? Is this correct?

  • The wikipedia entry does not discuss the initial values a node should set before starting to participate in Paxos. What are these?

P.S.: I don't have enough reputation to create a 'Paxos' tag (any volunteer?)

like image 377
Jérôme Verstrynge Avatar asked May 01 '11 18:05

Jérôme Verstrynge


People also ask

Why do we need Paxos?

Paxos is usually used where durability is needed to replicate large datasets, such as a file or a database. The protocol attempts to make progress even during periods when some bounded number of replicas are unresponsive.

What is Paxos cluster?

Paxos is a family of protocols for solving consensus in a network of unreliable or fallible processors. Consensus is the process of agreeing on one result among a group of participants. This problem becomes difficult when the participants or their communications may experience failures.

Who invented Paxos?

The Paxos algorithm was developed by Leslie Lamport, published in his 1998 paper The Part-Time Parliament. Paxos works in three phases to make sure multiple nodes agree on the same value in spite of partial network or node failures.


2 Answers

What is an instance?

The nomenclature in Paxos is a little unintuitive.

  • An instance is the algorithm for choosing one value.
  • A round refers to a proposer's single attempt of a Phase 1 + Phase 2. A node can have multiple rounds in an instance of Paxos. A round id is globaly unique per instance across all nodes. This is sometimes called proposal number.
  • A node may take on several roles; most notably Proposer and Acceptor. In my answers, I'll assume each node takes on both roles.
  • Phase 1 is also known as the Prepare phase.
    • In Phase 1a, a Proposer sends a Prepare!(roundId) message to the Acceptors
    • In Phase 1b, the Acceptors reply with either Promise!(roundId, value) or PrepareNack!()
  • Phase 2 is also known as the Accept phase.
    • In Phase 2a, a Proposer sends an Accept!(roundId, value) message to the Acceptors
    • In Phase 2b, the Acceptors reply with either Accepted!(...) or AcceptNack!()

Assuming a cluster divided in 3 regions, each containing 3 nodes (total = 9 nodes). What happens if communication is broken between regions? There is no way any leader can reach quorum (which is 5).

Paxos requires you can get at least a quorum (5 nodes in your case). Go with your solution of three regions; having two network partitions between the three regions is very bad news. I also use a version of Paxos which can change node membership from one instance to the next. This is useful for partitions and node failure.

Isn't Paxos going to enter an infinite loop?

A naive implementation of Paxos is not guaranteed to terminate because multiple nodes can leap-frog Prepare phases. There are two ways of getting around this. One is to have a random backoff before starting new Prepare phases. The second is to route all requests to a designated leader, that acts as proposer (The leader is chosen by a Paxos instance. See also Multi-paxos)

In Phase 1b: 'If the proposal number N is larger than any previous proposal, then each >>Acceptor promises not to accept proposals less than N, and sends the value it last accepted for >>this instance to the Proposer'.

What is 'the last value it accepted'? Is it any previous proposal number from the proposer?

When a node receives an Accept!(roundId, value) message from a Proposer and it hasn't promised to not accept the value (due to a Prepare!(higherRoundId) message), it stores the value and the roundId (I'll call them acceptedValue and acceptedRoundId). It may write over these due to subsequent Accept!(...) messages.

When a node receives a Prepare!(roundId) message from a Proposer, it stores roundId as promiseRoundId = max(roundId, promiseRoundId). It then sends a Promise!(acceptedRoundId, acceptedValue) back to the Proposer. NB: if a node hasn't received an Accept!(...) message, it replies with Promise!(null, null).

In Phase 1a: Does one include the value to agree on with the Prepare message or is this deferred to the Accept! message? Or it does matter?

There is no need to send it. I don't.

In Phase 2a: 'If any of the Acceptors have already accepted a value, the leader must Choose a value with the maximum proposal number N'.

What is value here? Is it the proposal number? I believe not, but this phrase is unclear.

The value is the actual data the algorithm is reaching consensus on. I'll rephrase this to

To start the Accept Phase, The Proposer must choose a value to be accepted depending on the results of the Prepare phase. If any Acceptor replied with Promise(roundId, value), the Proposer must use the value associated with the highest roundId. Otherwise, the Proposer received only Promise(null, null), and may choose any value to send to the acceptors.

NB: Proposal number here is the same thing as roundId.

In Phase 2a: 'Otherwise, the Proposer is free to choose any value'. What does this mean? A value for what? For the proposal number?

This is the value you want to have consensus on. This is typically a state change across the distributed system, perhaps triggered by a client request.

Paxos seems to rely on an increasing value of N (proposal number) to work? Is this correct?

The wikipedia entry does not discuss the initial values a node should set before starting to participate in Paxos. What are these?

Round ids (aka proposal numbers) should be increasing and must be unique per instance across all nodes. The Paxos paper assumes you can do this because it is trivial to achieve. Here's one scheme that produces the same results on all nodes:

  1. Say there are M nodes participating in an instance of Paxos.
  2. Sort all the nodes lexicographically. index[node] is the index of a node in this sorted list.
  3. roundId = i*M + index[node] where i is the ith round this node is starting (that is i is unique per node per paxos instance, and is monotonically increasing).

Or in pseudo-code (which is clearly lacking a few major optimizations):

define runPaxos( allNodesThisPaxosInstance, myValue ) {
    allNodesThisPaxosInstance.sort()
    offset = allNodesThisPaxosInstance.indexOf( thisNode )
    for (i = 0; true; i++) {
        roundId = offset + i * allNodesThisPaxosInstance.size()
        prepareResult = doPreparePhase( roundId )
        
        if (!prepareResult.shouldContinue?)
            return

        if (prepareResult.hasAnyValue?)
           chosenValue = prepareResult.valueWithHighestRoundId
        else
            chosenValue = myValue
        
        acceptResult = doAcceptPhase( roundId, chosenValue )
        
        if (!acceptResult.shouldContinue?)
            return
    }
}
like image 157
Michael Deardeuff Avatar answered Sep 26 '22 07:09

Michael Deardeuff


I have found the following document explaining Paxos in more details. I have updated the wikipedia entry accordingly.

The answers to my question I could find are:

Isn't Paxos going to enter an infinite loop?

Paxos only works if at least a quorum of nodes can communicate with each other (in our case 5). Hence, if a node cannot communicate with at least a quorum of nodes, it should not try Paxos.

What is 'the last value it accepted'?

It is the last accepted proposition number and corresponding value.

What does 'instance' refer to exactly in this case?

It refers to the acceptor.

Does one include the value to agree on with the Prepare message or is this deferred to the Accept! message? Or it does matter?

The value is not included in the Prepare message, it is left to the Accept Request message.

What is value here? Is it the proposal number? I believe not, but this phrase is unclear.

'Otherwise, the Proposer is free to choose any value'. What does this mean? A value for what? For the proposal number?

If acceptors have already accepted a proposal from the proposer, they can return the corresponding proposal number and value, else nothing.

The second question falls, since the Wikipedia entry was misleading. One can choose an arbitrary value for a given proposal or derive it from values corresponding to proposals accepted earlier.

Paxos seems to rely on an increasing value of N (proposal number) to work? Is this correct?

Yes. A proposer p needs to number its proposals increasingly.

The wikipedia entry does not discuss the initial values a node should set before starting to participate in Paxos. What are these?

Nodes should keep their last accepted proposal number and eventually, the corresponding value too. They should persist it. When connecting for the first time, the initial proposal number for a given proposer should be null (or any equivalent).

like image 45
Jérôme Verstrynge Avatar answered Sep 22 '22 07:09

Jérôme Verstrynge