Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Kafka JDBC connector load all data, then incremental

I am trying to figure out how to fetch all data from a query initially, then incrementally only changes using kafka connector. The reason for this is i want to load all data into elastic search, then keep es in sync with my kafka streams. Currently, i am doing this by first using connector with mode = bulk, then i change it to timestamp. This works fine.

However, if we ever want to reload all data to the Streams and ES, it means we have to write some scripts that somehow cleans or deletes kafka streams and es indices data, modify the connect ini's to set mode as bulk, restart everything, give it time to load all that data, then modify scripts again to timestamp mode, then restart everything once more(reason for needing such a script is that occasionally, bulk updates happen to correct historic data through an etl process we do not yet have control over, and this process does not update timestamps)

Is anyone doing something similar and have found a more elegant solution?

like image 209
mike01010 Avatar asked May 04 '17 00:05

mike01010


People also ask

How does Kafka JDBC connector work?

The JDBC connector gives you the option to stream into Kafka just the rows from a table that have changed in the period since it was last polled. It can do this based either on an incrementing column (e.g., incrementing primary key) and/or a timestamp (e.g., last updated timestamp).

What is Kafka connect offset?

Kafka Connect in distributed mode uses Kafka itself to persist the offsets of any source connectors. This is a great way to do things as it means that you can easily add more workers, rebuild existing ones, etc without having to worry about where the state is persisted.

What is the difference between Kafka and Kafka connect?

Kafka Streams is an API for writing client applications that transform data in Apache Kafka. You usually do this by publishing the transformed data onto a new topic. The data processing itself happens within your client application, not on a Kafka broker. Kafka Connect is an API for moving data into and out of Kafka.

What is sink connector in Kafka?

The Kafka Connect JDBC Sink connector allows you to export data from Apache Kafka® topics to any relational database with a JDBC driver. This connector can support a wide variety of databases. The connector polls data from Kafka to write to the database based on the topics subscription.


1 Answers

coming back to this after a long time. The way was able to solve this and never have to use bulk mode

  1. stop connectors
  2. wipe offset files for each connector jvm
  3. (optional) if you want to do a complete wipe and load, you want to probably also delete your topics use the kafka/connect utils/rest api (and dont forget the state topics)
  4. restart connects.
like image 55
mike01010 Avatar answered Oct 08 '22 02:10

mike01010