I was reading articles related to Kafka and StreamSets and my understanding was
Kafka acts as a broker between Producer system and subscriber. Producer push the data into Kafka cluster, subscriber pull the data from Kafka
StreamsSets is a technology to move data from one source to another through a pipeline
Now, below are my questions, Please help to clarify
What is the fundamental difference between Kafka and StreamSets? Is that Kafka doesn't move data but StreamSets moves the data?
If Kafka doesn't move the data, what is Kafka used for? If it moves data like ETL solutions, how it is different from SSIS, Informatica etc?
How is StreamSets different from SSIS, Informatica etc?
StreamSets enables next-generation ETL through the StreamSets Transformer tool. The product provides enterprises with the flexibility to create ETL pipelines for both batch and streaming data as well as clear visibility into their data processing operation and performance across both cloud and on-prem systems.
Kafka Streams is a client library for processing and analyzing data stored in Kafka. It builds upon important stream processing concepts such as properly distinguishing between event time and processing time, windowing support, and simple yet efficient management and real-time querying of application state.
StreamSets Data Collector Engine is an easy-to-use data pipeline engine for streaming, CDC and batch ingestion from any source to any destination. “Data Collector Helps Speed Up Development Time.”
Kafka is more highly configurable compared to Kinesis. With Kafka, it's possible to write data to a single server. On the other hand, Kinesis is designed to write simultaneously to three servers – a constraint that makes Kafka a better performing solution.
In StreamSets most of the time we create "data pipelines", think of a pipeline like an application which can consist of multiple steps/tasks, first task can be read data from a database or kafka or any number of data sources, second step can be modify the data, third step can be run a script ... etc and finally it can save the transformed data into a destination that could be a database or any other cloud storage. So Kafka and StreamSets can work together where StreamSets can read data from and write to Kafka
I think of Kafka as a place where data from multiple sources is collected and is available for consumers for a certain time. For example Kafka can read from a database table periodically and store the changes in a "topic", read from a web service periodically and then store this data into another topic. These topics are now available to consumers, a developer now can create an application that reads data from the first topic and do something with the data, Kafka can keep track of what the consumer has read by using offsets and offers replication and other options. It removes the need to write custom code that integrates multiple sources and destinations, instead you can configure this part.
StreamSets can read from and write to Kafka. StreamSets does not store the data in its own system while Kafka stores the data for a configurable period of time.
My personal view on how StreamSets and SSIS are different is:
StreamSets is a graphical tool that contains components that allow for data movement, which happen to include Kafka producers and consumers, but you're not required to use them.
They're complementary, and by using Kafka, you can allow for back-pressure in streaming systems or have non-StreamSets producers/consumers interacting with other Kafka topics. No, Kafka doesn't move the data (except for internal replication), the clients that interact with the brokers do.
I've not used Informatica or SSIS, but I'm sure if you contacted someone at StreamSets, they could answer how they compare
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With