Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Kafka: Mirroring vs. Replication

Tags:

Mirroring is replicating data between Kafka cluster, while Replication is for replicating nodes within a Kafka cluster.

Is there any specific use of Replication, if Mirroring has already been setup?

like image 627
dev Avatar asked Apr 15 '16 07:04

dev


1 Answers

They are used for different use cases. Let's try to clarify.

As described in the documentation,

The purpose of adding replication in Kafka is for stronger durability and higher availability. We want to guarantee that any successfully published message will not be lost and can be consumed, even when there are server failures. Such failures can be caused by machine error, program error, or more commonly, software upgrades. We have the following high-level goals:

Inside a cluster there might be network partitions (a single server fails, and so forth), therefore we want to provide replication between the nodes. Given a setup of three nodes and one cluster, if server1 fails, there are two replicas Kafka can choose from. Same cluster implies same response times (ok, it also depends on how these servers are configured, sure, but in a normal scenario they should not differ so much).

Mirroring, on the other hand, seems to be very valuable, for example, when you are migrating a data center, or when you have multiple data centers (e.g., AWS in the US and AWS in Ireland). Of course, these are just a couple of use cases. So what you do here is to give applications belonging to the same data center a faster and better way to access data - data locality in some contexts is everything.

If you have one node in each cluster, in case of failure, you might have way higher response times to go, let's say, from AWS located in Ireland to AWS in the US.

You might claim that in order to achieve data locality (services in cluster one read from kafka in cluster one) one still needs to copy the data from one cluster to the other. That's definitely true, but the advantages you might get with mirroring could be higher than those you would get by reading directly (via an SSH tunnel?) from Kafka located in another data center, for example single connections down, clients connection/session times longer (depending on the location of the data center), legislation (some data can be collected in a country while some other data shouldn't).

Replication is the basis of higher availability. You shouldn't use Mirroring to handle high availability in a context where data locality matters. At the same time, you should not use just Replication where you need to duplicate data across data centers (I don't even know if you can without Mirroring/an ssh tunnel).

like image 64
Markon Avatar answered Sep 28 '22 03:09

Markon