I was reading articles related to Kafka and StreamSets and my understanding was <ol> <li>Kafka acts as a broker between Producer system and subscriber. Producer push the data into Kafka cluster, subscriber pull the data from Kafka</li> <li>StreamsSets is a technology to move data from one source to another through a pipeline</li> </ol> Now, below are my questions, Please help to clarify <ol> <li>What is the fundamental difference between Kafka and StreamSets? Is that Kafka doesn't move data but StreamSets moves the data? </li> <li>If Kafka doesn't move the data, what is Kafka used for? If it moves data like ETL solutions, how it is different from SSIS, Informatica etc?</li> <li>How is StreamSets different from SSIS, Informatica etc?</li> </ol>

<ol> <li> In StreamSets most of the time we create "data pipelines", think of a pipeline like an application which can consist of multiple steps/tasks, first task can be read data from a database or kafka or any number of data sources, second step can be modify the data, third step can be run a script ... etc and finally it can save the transformed data into a destination that could be a database or any other cloud storage. So Kafka and StreamSets can work together where StreamSets can read data from and write to Kafka </li> <li> I think of Kafka as a place where data from multiple sources is collected and is available for consumers for a certain time. For example Kafka can read from a database table periodically and store the changes in a "topic", read from a web service periodically and then store this data into another topic. These topics are now available to consumers, a developer now can create an application that reads data from the first topic and do something with the data, Kafka can keep track of what the consumer has read by using offsets and offers replication and other options. It removes the need to write custom code that integrates multiple sources and destinations, instead you can configure this part. </li> </ol> StreamSets can read from and write to Kafka. StreamSets does not store the data in its own system while Kafka stores the data for a configurable period of time. <ol start="3"> <li>SSIS is similar to StreamSets in that it is used to create pipelines/packages that consist of multiple tasks, each task can take the data/result from the previous tasks and then does something with it. Both StreamSets and SSIS can connect to many kinds of data sources and destinations.</li> </ol> My personal view on how StreamSets and SSIS are different is: <ul> <li>StreamSets is web based while SSIS needs Visual Studio, StreamSets GUI is easier to use and does not require a special software to be installed for each developer.</li> <li>Deploying StreamSets pipelines to production with source control was easier than SSIS packages.</li> <li>SSIS is a Microsoft product so it integrates very well with other Microsoft products. StreamSets can be installed on any platform which makes it ideal for the AWS cloud.</li> <li>If you want to write SSIS scripting tasks you have to use C#/DotNet. StreamSets script tasks can be written in Jython and JavaScript</li> <li>SSIS is older and has tons of documentation online.</li> </ul>

Kafka vs StreamSets

2 Answers

In StreamSets most of the time we create "data pipelines", think of a pipeline like an application which can consist of multiple steps/tasks, first task can be read data from a database or kafka or any number of data sources, second step can be modify the data, third step can be run a script ... etc and finally it can save the transformed data into a destination that could be a database or any other cloud storage. So Kafka and StreamSets can work together where StreamSets can read data from and write to Kafka
I think of Kafka as a place where data from multiple sources is collected and is available for consumers for a certain time. For example Kafka can read from a database table periodically and store the changes in a "topic", read from a web service periodically and then store this data into another topic. These topics are now available to consumers, a developer now can create an application that reads data from the first topic and do something with the data, Kafka can keep track of what the consumer has read by using offsets and offers replication and other options. It removes the need to write custom code that integrates multiple sources and destinations, instead you can configure this part.

StreamSets can read from and write to Kafka. StreamSets does not store the data in its own system while Kafka stores the data for a configurable period of time.

SSIS is similar to StreamSets in that it is used to create pipelines/packages that consist of multiple tasks, each task can take the data/result from the previous tasks and then does something with it. Both StreamSets and SSIS can connect to many kinds of data sources and destinations.

My personal view on how StreamSets and SSIS are different is:

StreamSets is web based while SSIS needs Visual Studio, StreamSets GUI is easier to use and does not require a special software to be installed for each developer.
Deploying StreamSets pipelines to production with source control was easier than SSIS packages.
SSIS is a Microsoft product so it integrates very well with other Microsoft products. StreamSets can be installed on any platform which makes it ideal for the AWS cloud.
If you want to write SSIS scripting tasks you have to use C#/DotNet. StreamSets script tasks can be written in Jython and JavaScript
SSIS is older and has tons of documentation online.

170

answered Sep 20 '22 00:09

Gth lala

StreamSets is a graphical tool that contains components that allow for data movement, which happen to include Kafka producers and consumers, but you're not required to use them.

They're complementary, and by using Kafka, you can allow for back-pressure in streaming systems or have non-StreamSets producers/consumers interacting with other Kafka topics. No, Kafka doesn't move the data (except for internal replication), the clients that interact with the brokers do.

I've not used Informatica or SSIS, but I'm sure if you contacted someone at StreamSets, they could answer how they compare

answered Sep 18 '22 00:09

OneCricketeer

Related questions
                            
                                How can I set an expression to the FileSpec property on Foreach File enumerator?
                            
                                SSIS return value of Stored Procedure within an OLE DB Command
                            
                                TextFieldParser ignoring header row C#
                            
                                How to increase MaximumErrorCount in SQL Server 2008 Jobs or Packages?
                            
                                Can someone please explain data mining, SSIS, BI, ETL and other related technologies?
                            
                                Import Data Wizard Does Not Like Data Type I Choose For A Column
                            
                                SSIS script task fails on server with error "Cannot load script for execution"
                            
                                How to get total of top 10 sales in SSRS 2012
                            
                                SSIS Convert string to a guid when importing and saving data
                            
                                Comma within fields in CSV file -import to DB using SSIS
                            
                                SSIS Error: VS_NEEDSNEWMETADATA
                            
                                SSIS Package runs for 500x longer on one server
                            
                                SSIS connection manager login fails
                            
                                SSDT 2012 - ssis deployment error
                            
                                SQL Server 2014: SSISDB vs MSDB for package deployment
                            
                                Data profiling Task - custom Profile Request
                            
                                How to migrate DTS packages to SSIS 2012?
                            
                                Can you run an SSIS task from .net?
                            
                                SSIS set result set from data flow to variable
                            
                                "The license for Visual Studio has expired." when compiling with Visual Studio from TFS build

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Kafka vs StreamSets

Tags:

apache-kafka

ssis

streamsets

informatica

NikRED

People also ask

2 Answers

Gth lala

OneCricketeer

Recent Activity

Donate For Us