Kafka Connect JDBC vs Debezium CDC

Tags:

What are the differences between JDBC Connector and Debezium SQL Server CDC Connector (or any other relational database connector) and when should I choose one over another, searching for a solution to sync between two relational databases?

Not sure if this discussion should be about CDC vs JDBC Connector, and not Debezium SQL Server CDC Connector, or even just Debezium, looking forward for later editing, depends on the given answers (Though my case is about SQL Server sink).

Sharing with you my research about this topic which led me to the question (as an answer)

626

asked Mar 29 '20 00:03

Ofek Hod

2 Answers

This explanation focuses on the differences between Debezium SQL Server CDC Connector and JDBC Connector, with more general interpretation about Debezium and CDC.

tl;dr- scroll down :)

enter image description here

Debezium

Debezium is used only as a source connector, records all row-level changes.
Debezium Documentation says:

Debezium is a set of distributed services to capture changes in your databases so that your applications can see those changes and respond to them. Debezium records all row-level changes within each database table in a change event stream, and applications simply read these streams to see the change events in the same order in which they occurred.

Debezium Connector for SQL Server first records a snapshot of the database and then sending records of row-level changes to Kafka, each table to different Kafka topic.
Debezium Connector for SQL Server Documentation says:

Debezium’s SQL Server Connector can monitor and record the row-level changes in the schemas of a SQL Server database.

The first time it connects to a SQL Server database/cluster, it reads a consistent snapshot of all of the schemas. When that snapshot is complete, the connector continuously streams the changes that were committed to SQL Server and generates corresponding insert, update and delete events. All of the events for each table are recorded in a separate Kafka topic, where they can be easily consumed by applications and services.

enter image description here

Kafka Connect JDBC

Kafka Connect JDBC can be used either as a source or a sink connector to Kafka, supports any database with JDBC driver.
JDBC Connector Documentation says:

You can use the Kafka Connect JDBC source connector to import data from any relational database with a JDBC driver into Apache Kafka® topics. You can use the JDBC sink connector to export data from Kafka topics to any relational database with a JDBC driver. The JDBC connector supports a wide variety of databases without requiring custom code for each one.

They have some specifications about installing on Microsoft SQL Server which I find non relevant for this discussion.

So if JDBC Connector supports both source and sink and Debezium supports only source (not sink), we understand that in order to write data from Kafka to databases with a JDBC driver (sink), the JDBC Connector is the way to go (including SQL Server).

Now the comparison should be narrowed only to the sources field.
JDBC Source Connector Documentation doesn't say much more at first sight:

Data is loaded by periodically executing a SQL query and creating an output record for each row in the result set. By default, all tables in a database are copied, each to its own output topic. The database is monitored for new or deleted tables and adapts automatically. When copying data from a table, the connector can load only new or modified rows by specifying which columns should be used to detect new or modified data.

Searching a little further in order to understand their differences, in this Debezium blog which uses Debezium MySQL Connector as a source and JDBC Connector as a sink, there is an explanation about the differences between the two, which generally telling us that Debezium provides records with more information about the database changes, while JDBC Connector provides records which are more focused about converting the database changes into simple insert/upsert commands:

The Debezium MySQL Connector was designed to specifically capture database changes and provide as much information as possible about those events beyond just the new state of each row. Meanwhile, the Confluent JDBC Sink Connector was designed to simply convert each message into a database insert/upsert based upon the structure of the message. So, the two connectors have different structures for the messages, but they also use different topic naming conventions and behavior of representing deleted records.

Moreover, they have different topic naming and different delete methods:

Debezium uses fully qualified naming for target topics representing each table it manages. The naming follows the pattern [logical-name].[database-name].[table-name]. Kafka Connect JDBC Connector works with simple names [table-name].

...

When the Debezium connector detects a row is deleted, it creates two event messages: a delete event and a tombstone message. The delete message has an envelope with the state of the deleted row in the before field, and an after field that is null. The tombstone message contains same key as the delete message, but the entire message value is null, and Kafka’s log compaction utilizes this to know that it can remove any earlier messages with the same key. A number of sink connectors, including the Confluent’s JDBC Sink Connector, are not expecting these messages and will instead fail if they see either kind of message.

This Confluent blog explains more how CDC and JDBC Connector works, it (JDBC Connector) executing queries to the source database every fixed interval, which is not very scalable solution, while CDC has higher frequency, streaming from the database transaction log:

The connector works by executing a query, over JDBC, against the source database. It does this to pull in all rows (bulk) or those that changed since previously (incremental). This query is executed at the interval defined in poll.interval.ms. Depending on the volumes of data involved, the physical database design (indexing, etc.), and other workload on the database, this may not prove to be the most scalable option.

...

Done properly, CDC basically enables you to stream every single event from a database into Kafka. Broadly put, relational databases use a transaction log (also called a binlog or redo log depending on DB flavour), to which every event in the database is written. Update a row, insert a row, delete a row – it all goes to the database’s transaction log. CDC tools generally work by utilising this transaction log to extract at very low latency and low impact the events that are occurring on the database (or a schema/table within it).

This blog also states the differences between CDC and JDBC Connector, mainly says that JDBC Connector doesn't support syncing deleted records thus fits for prototyping, and CDC fits for more mature systems:

The JDBC Connector cannot fetch deleted rows. Because, how do you query for data that doesn’t exist?

...

My general steer on CDC vs JDBC is that JDBC is great for prototyping, and fine low-volume workloads. Things to consider if using the JDBC connector:

Doesn’t give true CDC (capture delete records, want before/after record versions) Latency in detecting new events Impact of polling the source database continually (and balancing this with the desired latency) Unless you’re doing a bulk pull from a table, you need to have an ID and/or timestamp that you can use to spot new records. If you don’t own the schema, this becomes a problem.

tl;dr Conclusion

The main differences between Debezium and JDBC Connector are:

Debezium is used only as a Kafka source and JDBC Connector can be used as Kafka source and sink.

For sources:

JDBC Connector doesn't support syncing deleted records, while Debezium does.
JDBC Connector queries the database every fixed interval, which is not very scalable solution, while CDC has higher frequency, streaming from the database transaction log.
Debezium provides records with more information about the database changes, and JDBC Connector provides records which are more focused about converting the database changes into simple insert/upsert commands.
Different topic naming.

119

answered Sep 28 '22 09:09

Ofek Hod

Simply we can say, CDC is kind of log based streaming similarly kafka connect jdbc source connectors are query based streaming.. :)

answered Sep 28 '22 10:09

saravanan

Related questions
                            
                                Limit amount of records retrieved when using Doctrine DQL in Symfony2
                            
                                Using 'case expression column' in where clause
                            
                                How does one escape an apostrophe in db2 sql
                            
                                SELECT inside a COUNT
                            
                                SQL Current month/ year question
                            
                                How do I add a column to large sql server table
                            
                                SQL Server 2005 Unique constraint on two columns
                            
                                MySQL - DATE_ADD month interval
                            
                                The "X" property on "Y" could not be set to a 'null' value. You must set this property to a non-null value of type 'Int32'
                            
                                How can I get the month number (not month name) from a date in SQL Server?
                            
                                Is there an Oracle SQL query that aggregates multiple rows into one row? [duplicate]
                            
                                Selecting entries by date - >= NOW(), MySQL
                            
                                UNIX_TIMESTAMP in SQL Server
                            
                                Selecting most recent and specific version in each group of records, for multiple groups
                            
                                MySQL select formatted date from millisecond field
                            
                                How can I remove a unique constraint from a database column in Rails?
                            
                                ORA-00942: table or view does not exist (works when a separate sql, but does not work inside a oracle function)
                            
                                In Oracle SQL Devleoper, when I copy the results, how can I copy the column headings too?
                            
                                Removing duplicate rows (based on values from multiple columns) from SQL table
                            
                                Get data type of field in select statement in ORACLE

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Kafka Connect JDBC vs Debezium CDC

Tags:

sql

jdbc

cdc

apache-kafka-connect

debezium

Ofek Hod

People also ask

2 Answers

tl;dr- scroll down :)

Debezium

Kafka Connect JDBC

tl;dr Conclusion

Ofek Hod

saravanan

Recent Activity

Donate For Us