AWS Data Pipeline vs Step Functions

Tags:

I am working on a problem where we intend to perform multiple transformations on data using EMR (SparkSQL).

After going through the documentation of AWS Data Pipelines and AWS Step Functions, I am slightly confused as to what is the use-case each tries to solve. I looked around but did not find a authoritative comparison between both. There are multiple resources that show how I can use them both to schedule and trigger Spark jobs on an EMR cluster.

Which one should I use for scheduling and orchestrating my processing EMR jobs?
More generally, in what situation would one be a better choice over the other as far as ETL/data processing is concerned?

796

asked Mar 08 '19 10:03

archilius

1 Answers

Yes, there are many ways to achieve the same thing, and the difference is in the details and in your use case. I am going to even offer yet one more alternative :)

If you are doing a sequence of transformations and all of them are on an EMR cluster, maybe all you need is either to create the cluster with steps, or submit an API job with several steps. Steps will execute in order on your cluster.

If you have different sources of data, or you want to handle more complex scenarios, then both AWS Data Pipeline and AWS Step Functions would work. AWS Step Functions is a generic way of implementing workflows, while Data Pipelines is a specialized workflow for working with Data.

That means that Data Pipeline will be better integrated when it comes to deal with data sources and outputs, and to work directly with tools like S3, EMR, DynamoDB, Redshift, or RDS. So for a pure data pipeline problem, chances are AWS Data Pipeline is a better candidate.

Having said so, AWS Data Pipeline is not very flexible. If the data source you need is not supported, or if you want to execute some activity which is not integrated, then you need to hack your way around with shell scripts.

On the other hand, AWS Step Functions are not specialized and have good integration with some AWS Services and with AWS Lambda, meaning you can easily integrate with anything via serverless apis.

So it really depends on what you need to achieve and the type of workload you have.

114

answered Sep 20 '22 13:09

Javier Ramirez

Related questions
                            
                                Virtualenv on Ubuntu with no site-packages
                            
                                Learning C# after C++
                            
                                How do I find out when a stored procedure was last modified or compiled in Oracle?
                            
                                Dump Django site to static HTML? [closed]
                            
                                Stackless python and multicores?
                            
                                Does calling a method on a value type result in boxing in .NET?
                            
                                How to create composite UNIQUE constraint in FluentNHibernate?
                            
                                How do you create a static template member function that performs actions on a template class?
                            
                                Is there an easy way to compare how close two colors are to each other?
                            
                                In COM: should I call AddRef after CoCreateInstance?
                            
                                How do I programmatically find out my PermGen space usage?
                            
                                What is a good way to handle a version number in a Java application?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With