Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why use GTIDs in MySQL replication?

When it comes to database replication, what is the use of global transaction identifiers? Why do we need it to prevent concurrency across the servers? How is that prevention achieved exactly?

I tried to read the documentation at http://dev.mysql.com/doc/refman/5.7/en/replication-gtids.html but still could not understand it clearly. This may sound very basic but I would really appreciate it if someone could explain the concepts to me.

like image 663
Chong Lip Phang Avatar asked May 06 '14 04:05

Chong Lip Phang


People also ask

What is MySQL GTIDs?

A global transaction identifier (GTID) is a unique identifier created and associated with each transaction committed on the server of origin (source). This identifier is unique not only to the server on which it originated, but is unique across all servers in a given replication setup.

How do you use Gtid replication?

Step 1: Synchronize the servers. This step is only required when working with servers which are already replicating without using GTIDs. For new servers proceed to Step 3. Make the servers read-only by setting the read_only system variable to ON on each server by issuing the following: mysql> SET @@GLOBAL.

Why do we need replication in MySQL?

MySQL's transaction-based replication has a number of benefits over its traditional replication method. For example, because both a source and its replicas preserve GTIDs, if either the source or a replica encounter a transaction with a GTID that they have processed before they will skip that transaction.

How do I find my MySQL Gtid?

The GTID for a transaction is shown in the output from mysqlbinlog, and it is used to identify an individual transaction in the Performance Schema replication status tables, for example, replication_applier_status_by_worker . The value stored by the gtid_next system variable ( @@GLOBAL. gtid_next ) is a single GTID.


2 Answers

The reason for the Global Transaction ID is to allow a MySQL slave to know if it has applied a given transaction or not, to keep things in sync between Master and Slave. It can also be used for restarting a slave if a connection goes down, again to know the point in time. Without using GTIDs, replication must be controlled based on the position in a given binary transaction log file (bin log). This is much harder to manage than the GTID method.

A master is the only server that is typically written to, so that slaves merely rebuild a copy of the master by applying each transaction in sequence.

It is also important to understand that MySQL replication can run in one of 3 modes:

  • Statement-based: Each SQL statement is logged to the binlog and replicated as a statement to the slave. This can be in some cases ambiguous at the slave causing the data to not match exactly. (Most of the time it is fine for common uses).
  • Row-based: In this mode MySQL replicates the actual data changes to each table, with a "before" and "after" picture of each row, which is fully accurate. This can result in a much larger binlog, for example if you have a bulk update query, like: UPDATE t1 SET c1 = 'a' WHERE c2 = 'b'.
  • Mixed: In this mode, MySQL will use a mix of statement-based and row-based logging in the binlog.

I only mention the modes of replication, because it is mentioned in the doc you referenced that Row-based is the recommended option if you are using GTIDs.

There is another option called Master-Master replication, where you can write to two masters (each acting as a slave for the other), but this requires a special configuration to ensure that the data written to each master is unique. It is much trickier to manage than a typical Master/Slave setup.

Therefore, the prevention of writes to a Slave is something that you must ensure from your application for a typical replication process to function correctly. It is fine to read from a Slave, but you should not write to it. Note that the Slave can be behind the Master if you are using it for reads, so it is best to perform queries for things that can be behind the Master (like reports that are not critical up to the second or millisecond). You can ensure no writes to the Slave by making your common application user a read-only user for the Slave server, and a read-write user for the Master.

like image 63
dbschwartz Avatar answered Sep 24 '22 16:09

dbschwartz


Why do we need to prevent concurrency across the servers?

If I understood the question correctly, you are talking about consistency. If so, the answer is that you need keep a consistent state in a distributed system. For example, if my bank account information is replicated throughout several different servers it is fundamental that they have exactly the same € balance. Now imagine that I perform multiple money transactions (deposits/spendings) and at each one I was connected to a different server: concurrency problems would cause my account balance to be different at each server, which is unacceptable.

How is that prevention achieved exactly?

Using a master/slave approach. Amongst the servers, you have one server (the master) that is responsible for handling every writing operation, meaning that modifications to the database must be handled only by this server. The database of this master server is replicated to all other servers (the slaves), which are not allowed to modify the database but can be used to read the database (e.g. SELECT operations). Knowing that there is only one server allowed to modify the database, you do not have consistency issues.

what is the use of global transaction identifiers?

Communication between servers is asynchronous and a slave server is not required to be connected with the master at all times. Therefore, once a slave server reconnects with the master server, it may find that the master's database has been modified in the meanwhile, thus it must update its own database. The problem now is knowing amongst all modifications performed by the master server, which are the ones that the slave server already performed in a previous date and which are the ones that were not performed yet.

GTIDs address this issue: they uniquely identify each transaction performed by the master server. Now, the slave server can identify amongst all the transactions performed by the master server, which are the ones that were not seen before.

like image 36
João Matos Avatar answered Sep 22 '22 16:09

João Matos