Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Kafka vs StreamSets

I was reading articles related to Kafka and StreamSets and my understanding was

  1. Kafka acts as a broker between Producer system and subscriber. Producer push the data into Kafka cluster, subscriber pull the data from Kafka

  2. StreamsSets is a technology to move data from one source to another through a pipeline

Now, below are my questions, Please help to clarify

  1. What is the fundamental difference between Kafka and StreamSets? Is that Kafka doesn't move data but StreamSets moves the data?

  2. If Kafka doesn't move the data, what is Kafka used for? If it moves data like ETL solutions, how it is different from SSIS, Informatica etc?

  3. How is StreamSets different from SSIS, Informatica etc?

like image 985
NikRED Avatar asked Jun 02 '19 14:06

NikRED


People also ask

Is StreamSets an ETL tool?

StreamSets enables next-generation ETL through the StreamSets Transformer tool. The product provides enterprises with the flexibility to create ETL pipelines for both batch and streaming data as well as clear visibility into their data processing operation and performance across both cloud and on-prem systems.

Can Kafka be used for stream processing?

Kafka Streams is a client library for processing and analyzing data stored in Kafka. It builds upon important stream processing concepts such as properly distinguishing between event time and processing time, windowing support, and simple yet efficient management and real-time querying of application state.

What is a StreamSets data collector?

StreamSets Data Collector Engine is an easy-to-use data pipeline engine for streaming, CDC and batch ingestion from any source to any destination. “Data Collector Helps Speed Up Development Time.”

What is difference between Kafka and Kinesis?

Kafka is more highly configurable compared to Kinesis. With Kafka, it's possible to write data to a single server. On the other hand, Kinesis is designed to write simultaneously to three servers – a constraint that makes Kafka a better performing solution.


2 Answers

  1. In StreamSets most of the time we create "data pipelines", think of a pipeline like an application which can consist of multiple steps/tasks, first task can be read data from a database or kafka or any number of data sources, second step can be modify the data, third step can be run a script ... etc and finally it can save the transformed data into a destination that could be a database or any other cloud storage. So Kafka and StreamSets can work together where StreamSets can read data from and write to Kafka

  2. I think of Kafka as a place where data from multiple sources is collected and is available for consumers for a certain time. For example Kafka can read from a database table periodically and store the changes in a "topic", read from a web service periodically and then store this data into another topic. These topics are now available to consumers, a developer now can create an application that reads data from the first topic and do something with the data, Kafka can keep track of what the consumer has read by using offsets and offers replication and other options. It removes the need to write custom code that integrates multiple sources and destinations, instead you can configure this part.

StreamSets can read from and write to Kafka. StreamSets does not store the data in its own system while Kafka stores the data for a configurable period of time.

  1. SSIS is similar to StreamSets in that it is used to create pipelines/packages that consist of multiple tasks, each task can take the data/result from the previous tasks and then does something with it. Both StreamSets and SSIS can connect to many kinds of data sources and destinations.

My personal view on how StreamSets and SSIS are different is:

  • StreamSets is web based while SSIS needs Visual Studio, StreamSets GUI is easier to use and does not require a special software to be installed for each developer.
  • Deploying StreamSets pipelines to production with source control was easier than SSIS packages.
  • SSIS is a Microsoft product so it integrates very well with other Microsoft products. StreamSets can be installed on any platform which makes it ideal for the AWS cloud.
  • If you want to write SSIS scripting tasks you have to use C#/DotNet. StreamSets script tasks can be written in Jython and JavaScript
  • SSIS is older and has tons of documentation online.
like image 170
Gth lala Avatar answered Sep 20 '22 00:09

Gth lala


StreamSets is a graphical tool that contains components that allow for data movement, which happen to include Kafka producers and consumers, but you're not required to use them.

They're complementary, and by using Kafka, you can allow for back-pressure in streaming systems or have non-StreamSets producers/consumers interacting with other Kafka topics. No, Kafka doesn't move the data (except for internal replication), the clients that interact with the brokers do.

I've not used Informatica or SSIS, but I'm sure if you contacted someone at StreamSets, they could answer how they compare

like image 31
OneCricketeer Avatar answered Sep 18 '22 00:09

OneCricketeer