Leader election for paxos-based replicated key value store

Tags:

I am going to implement a key value store with multi Paxos. I would have several nodes, one of which is the primary node. This primary node receive update requests and replicate values to slave nodes.

My question is how the primary node (or leader) is selected? Can I still use the Paxos algorithm? If so, do you think it is necessary to abstract the paxos implementation to a single unit that could be used not only by the replication unit but also the leader election unit？

If I use the node with the least id to be the leader? How can I implement the master lease?

Thanks for any answers.

504

asked Mar 25 '14 02:03

yuefengz

1 Answers

Before I get to the actual question, I would suggest that for a paxos-like system, you don't think of it as a master-slave relationship, but rather an equal-peer relationship. Basic Paxos doesn't even have a leader concept. Multi-paxos tacks on a leader as a performance optimization, electing that leader is part of the protocol.

Multi-Paxos boils down to Paxos underneath: there is a prepare phase and an accept phase. The insight of Multi-Paxos is that once a node wins an accept round, it has simultaneously won leader election and after that the prepare phase isn't necessary from that leader until it detects that another node has taken over leadership.

And now some practical advise. I have many years of experience working on several paxos, multi-paxos, and other consensus systems.

I first suggest not implementing either Paxos or Multi-paxos. Optimizing Paxos systems for performance while keeping it correct is very hard—especially if you are having these types of questions. I would instead look into implementing the Raft protocol.

Taking both protocols as is right off the paper, the Raft protocol can have much better throughput than Multi-Paxos. The Raft authors (and others) suggest that Raft is easier to understand, and implement.

You may also look into using one of the open-source Raft systems. I don't have experience with any of them to tell you how easy it is to maintain. I have heard, though, of pain in maintaining Zookeeper instances. (I have also heard complaints about Zookeeper's correctness proof.)

Next, it has been proven that every consensus protocol can loop forever. Build into your system a time-out mechanism, and randomized backoffs where appropriate. This is how practical engineers get around theoretical impossibilities.

Finally, examine your throughput needs. If your throughput is high enough, you will need to figure out how to partition across several consensus-clusters. And that's a whole 'nother ball of wax.

139

answered Nov 11 '22 01:11

Michael Deardeuff

Related questions
                            
                                Distributed Authentication
                            
                                Get Number of Nodes in a Distributed system
                            
                                How does node know which nodes have seen the cluster current state?
                            
                                locking on a server farm (asp.net)
                            
                                How does an odd number solve a split brain in a distributed system?
                            
                                Elasticsearch vs Kafka: Putting intelligence in producers
                            
                                How to connect MetaTrader with a Node.JS?
                            
                                How to get Filename/File Contents as key/value input for MAP when running a Hadoop MapReduce Job?
                            
                                What are the use cases for a Vector Clock versus a Version Vector?
                            
                                Scheduled tasks in cluster using zookeeper
                            
                                gRPC client not working when called from within gRPC service
                            
                                Strategy to keep local cache see the same "version" of data in a distributed system
                            
                                What is the difference between Sequential Consistency and Eventual Consistency?
                            
                                Modern ways to allow function calls across the network
                            
                                MySQL: How to modify stored procedures atomically?
                            
                                Bloom filters in a distributed environment
                            
                                How does Cassandra partitioning work when replication factor == cluster size?
                            
                                2PC distributed transactions across many microservices?
                            
                                Splitting an array finding minimum difference between the sum of two subarray in distributed environment

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Leader election for paxos-based replicated key value store

Tags:

key-value-store

distributed-system

paxos

yuefengz

People also ask

1 Answers

Michael Deardeuff

Recent Activity

Donate For Us