avoiding overuse of consensus protocols in a distributed system

Tags:

paxos

I'm new to distributed systems, and I'm reading about "simple Paxos". It creates a lot of chatter and I'm thinking about performance implications.

Let's say you're building a globally-distributed database, with several small-ish clusters located in different locations. It seems important to minimize the amount of cross-site communication.

What are the decisions you definitely need to use consensus for? The only one I thought of for sure was deciding whether to add or remove a node (or set of nodes?) from the network. It seems like this is necessary for vector clocks to work. Another I was less sure about was deciding on an ordering for writes to the same location, but should this be done by a leader which is elected via Paxos?
It would be nice to avoid having all nodes in the system making decisions together. Could a few nodes at each local cluster participate in cross-cluster decisions, and all local nodes communicate using a local Paxos to determine local answers to cross-site questions? The latency would be the same assuming the network is not saturated, but the cross-site network traffic would be much lighter.
Let's say you can split your database's tables along rows, and assign each subset of rows to a subset of nodes. Is it normal to elect a set of nodes to contain each subset of the data using Paxos across all machines in the system, and then only run Paxos between those nodes for all operations dealing with that subset of data?

And a catch-all: are there any other design-related or algorithmic optimizations people are doing to address this?

228

asked Apr 30 '13 19:04

1 Answers

Good questions, and good insights!

It creates a lot of chatter and I'm thinking about performance implications.

Let's say you're building a globally-distributed database, with several small-ish clusters located in different locations. It seems important to minimize the amount of cross-site communication.

What are the decisions you definitely need to use consensus for? The only one I thought of for sure was deciding whether to add or remove a node (or set of nodes?) from the network. It seems like this is necessary for vector clocks to work. Another I was less sure about was deciding on an ordering for writes to the same location, but should this be done by a leader which is elected via Paxos?

Yes, performance is a problem that my team had seen in practice as well. We maintain a consistent database & distributed lock manager; and orignally used Paxos for all writes, some reads and cluster membership updates.

Here are some of the optimizations we did:

As much as possible, nodes sent the transitions to a Distinguished Proposer/Learner (elected via Paxos), which
- decided on write ordering, and
- batched transitions while waiting for the response from the prior instance. (But batching too much also caused problems.)
We had considered using multi-paxos but we ended up doing something cooler (see below).

With these optimizations, we were still hurting for performance, so we split our server into three layers. The bottom layer is Paxos; it does what you suggest; viz. merely decides the node membership of the middle layer. The middle layer is a custom-in-house-high-speed chain consensus protocol, which does consensus & ordering for the DB. (BTW, chain-consensus can be viewed as Vertical Paxos.) The top layer now just maintains the database/locks & client connections. This design has lead to several orders of magnitude latency and throughput improvement.

It would be nice to avoid having all nodes in the system making decisions together. Could a few nodes at each local cluster participate in cross-cluster decisions, and all local nodes communicate using a local Paxos to determine local answers to cross-site questions? The latency would be the same assuming the network is not saturated, but the cross-site network traffic would be much lighter.

Let's say you can split your database's tables along rows, and assign each subset of rows to a subset of nodes. Is it normal to elect a set of nodes to contain each subset of the data using Paxos across all machines in the system, and then only run Paxos between those nodes for all operations dealing with that subset of data?

These two together remind me of the Google Spanner paper. If you skip over the parts about time, it's essentially doing 2PC globally and Paxos on the shards. (IIRC.)

100

answered Sep 25 '22 07:09

Michael Deardeuff

Related questions
                            
                                Strategy to keep local cache see the same "version" of data in a distributed system
                            
                                How to deploy zookeeper across multiple data centers and failover?
                            
                                What is the difference between Sequential Consistency and Eventual Consistency?
                            
                                Replication Modes Definitions?
                            
                                Scaling chat log workers horizontally
                            
                                In Paxos, can an Acceptor accept a different value after it has already accepted one?
                            
                                Distributed rate limiting algorithm [closed]
                            
                                One replicated mnesia table has become out-of-sync
                            
                                spark scalability: what am I doing wrong?
                            
                                How does MPI_IN_PLACE work with MPI_Scatter?
                            
                                suggestions on a project in C++ / distributed systems / networks
                            
                                Erlang: Offload a client process/function to a server?
                            
                                Towards limiting the big RDD
                            
                                Why does this Python 0MQ script for distributed computing hang at a fixed input size?
                            
                                COMPSs Monitor doesn't show any application
                            
                                All jobs failing in C COMPSs execution
                            
                                Can someone please explain the concept of causality in distributed computing?
                            
                                Splitting an array finding minimum difference between the sum of two subarray in distributed environment
                            
                                How do I control output files name and content of an Hadoop streaming job?
                            
                                What are all the available alternatives to WCF?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

avoiding overuse of consensus protocols in a distributed system

Tags:

distributed-computing

paxos

Dan

People also ask

1 Answers

Michael Deardeuff

Recent Activity

Donate For Us