Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

LMAX Replicator Design - How to support high availability?

LMAX Disruptor is generally implemented using the following approach: enter image description here

As in this example, Replicator is responsible for replicating the input events\commands to the slave nodes. Replicating across a set of nodes requires us to apply consensus algorithms, in case we want the system to available in the presence of network failures, master failure and slave failures.

I was thinking of applying RAFT consensus algorithm to this problem. One observation is that: "RAFT requires that the input event\commands are stored to the disk (durable storage) during replication" (Reference this link)

This observation essentially means that we cannot perform a in-memory replication. Hence it appears that we might have to combine the functionality of replicator and journaller to be able to successfully apply RAFT algorithm to LMAX.

There are two options to do this:

Option 1: Using the replicated log as input event queue enter image description here

  • The receiver would read from the network and push the event onto the replicated log instead of the ring buffer
  • A separate "reader" can read from the log and publish the events onto the ring buffer.
  • The log can be replicated across nodes using RAFT. We do not need the replicator and journaller as the functionality is already accomplished by RAFT's replicated log

I think a disadvantage of this option has got to do with fact that we do an additional data copy step (receiver to event queue instead of the ring buffer).

Option 2: Use Replicator to push input events\commands to slave's input log file enter image description here

I was wondering if there is any other solution to design of Replicator? What are the different design options that people have employed for replicators? Particularly any design that can support in-memory replication?

like image 962
coder_bro Avatar asked May 08 '14 07:05

coder_bro


1 Answers

Your intuition is correct about folding the replication and journalling into the Raft component. But, the Raft protocol dictates exactly when things need to be stored on disk.

Here are two different ways to look at it.

I'm assuming there is no is hefty computation, such as a transaction processing, before the replication because you don't have any in your diagrams.

I, personally, would do the first because it separates concerns into different processes. If I was implementing Raft for myself I would take the first half of the second scenario and put it in its own process.

External Raft Replication

In which Raft is implement by an external process.

The replication component outsources to an external Raft process the business of replication. After some time, Raft responds to the replication component that it is, in fact, replicated. The replication component updates the items in the ring buffer, and moves its published cursor forward. The business logic sees the published cursor (via waitFor) and consumes the freshly replicated data.

In this scenario, the replication component probably has a lot of inflight events, so it's read cursor is far ahead of the cursor it publishes to the business logic.

There is no need for a journalling component in this scenario because the external raft system does the journalling for you.

Note, the replication may be the slowest component of the system!

Integrated Raft Replication

In which raft is implemented in the same process as the "Real Business Logic."

In terms of Raft, replication is the business logic. Actually, you have multiple levels of business logic, or equivalently, multiple stages of business logic.

I'm going to use two input disruptors and two output disruptors for this to emphasize the separate business logic. You can combine, split, or rearrange to your heart's content. Or your profiler's content.

The first stage, as I mentioned, is Raft replication. Client events go into the Replication Input Disruptor. The Raft logic picks it up, perhaps in batches, and sends out to the Followers on the Replication Output Disruptor. All Raft messages also go into the Replication Input Disruptor. The Raft logic also picks these up and sends the appropriate responses to the appropriate Followers/Master on the Replication Output Disruptor).

A journaller component hangs off the Input Ring Buffer; it only has to handle certain types of messages as dictated by Raft. This will likely be the slowest part of the system.

When the data is considered replicated, it is moved to the second stage, via the "Real Business Logic" Input Disruptor. There it is processed, sent to the Client Outbound Disruptor, and then sent to one of your millions of happy paying customers.

like image 122
Michael Deardeuff Avatar answered Oct 14 '22 13:10

Michael Deardeuff