Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What should I use: Kafka Stream or Kafka consumer api or Kafka connect

I would like to know what would be best for me: Kafka stream or Kafka consumer api or Kafka connect?

I want to read data from topic then do some processing and write to database. So I have written consumers but I feel I can write Kafka stream application and use it's stateful processor to perform any changes and write it to database which can eliminate my consumer code and only have to write db code.

Databases I want to insert my records are: HDFS - (insert raw JSON) MSSQL - (processed json)

Another option is Kafka connect but I have found there is no json support as of now for hdfs sink and jdbc sink connector.(I don't want to write in avro) and creating schema is also pain for complex nested messages.

Or should I write custom Kafka connect to do this.

So need you opinion on whether I should write Kafka consumer or Kafka stream or Kafka connect? And what will be better in terms of performance and have less overhead?

like image 632
Nandish Kotadia Avatar asked Sep 04 '17 08:09

Nandish Kotadia


People also ask

What is difference between Kafka connect and Kafka streams?

Kafka Streams is an API for writing client applications that transform data in Apache Kafka. You usually do this by publishing the transformed data onto a new topic. The data processing itself happens within your client application, not on a Kafka broker. Kafka Connect is an API for moving data into and out of Kafka.

When should you not use Kafka streams?

This one is pretty straightforward and related to the above section. Kafka is not a deterministic system. Safety-critical applications cannot use it for a car engine control system, a medical system such as a heart pacemaker, or an industrial process controller.

Is a Kafka consumer an API?

The Kafka Producer API allows applications to send streams of data to the Kafka cluster. The Kafka Consumer API allows applications to read streams of data from the cluster.


1 Answers

You can use a combination of them all

I have tried HDFS sink for JSON but not able to use org.apache.kafka.connect.json.JsonConverter

Not clear why not. But I would assume you forgot to set schemas.enabled=false.

when I set org.apache.kafka.connect.storage.StringConverter it works but it writes the json object in string escaped format. For eg. {"name":"hello"} is written into hdfs as "{\"name\":\"hello\"}"

Yes, it will string-escape the JSON

Processing I want to do is basic validation and few field values transformation

Kafka Streams or Consumer API is capable of validation. Connect is capable of Simple Message Transforms (SMT)


Some use cases, you need to "duplicate data" onto Kafka; process your "raw" topic, read it using a consumer, then produce it back into a "cleaned" topic, from which you can use Kafka Connect to write to a database or filesystem.

like image 194
OneCricketeer Avatar answered Nov 07 '22 09:11

OneCricketeer