Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can Spark Replace ETL Tool

Existing process - raw structure data are copied into a staging layer of Redshift. Then use ETL tools such as Informatica, Telend to do incremental loading into Fact and Dimension table of Datamart/datawarehouse. All joins happen within database layer(ETL pushes queries into DB) - Can Spark replace ETL tool and do the same processing and load data into Redshift? - What are the advantages and disadvantages of this architecture?

like image 241
Dipankar Avatar asked Nov 25 '16 21:11

Dipankar


People also ask

Can Spark be used as ETL?

Apache Spark provides the framework to up the ETL game. Data pipelines enable organizations to make faster data-driven decisions through automation.

Can we use PySpark for ETL?

There are many ETL tools available in the market that can carry out this process. A standard ETL tool like PySpark, supports all basic data transformation features like sorting, mapping, joins, operations, etc. PySpark's ability to rapidly process massive amounts of data is a key advantage.

Is spark a data integration tool?

Apache Spark as data integration tool especially when you are dealing with data objects like structured and unstructured flat-files, is definitely a hassle free ride, by just setting up a cluster of multiple nodes. Due to its nature of in-memory data processing engine, it is claimed to be lightning fast.


2 Answers

I have worked extensively on projects to migrate the existing ETL jobs into spark for last 4 years.

The problem of ETL jobs were as follows

  1. They didn't give us a strict SLA. The jobs were sharing the same resource pool, thus prioritizing was hard. Everyone made their jobs as business critical.

  2. Another important problem was the cost of the ETL based job was high as we were paying the provider.

  3. Scale was another important issue. We required ETL on gigantic scale, that we found too expensive.

Thus, we migrated all the ETLs to spark jobs. Spark and hadoop both being open source we didn't have any additional cost issue except the compute.

Spark support for SQL improved dramatically over the time. You can run ML/Graph queries and normal ETL on the same data frame. Spark joins are fast and can be optimized for different dataset. You get more fine-grained control over your transformations and join.

We started by using a Long running cluster with the support for spark and other big data tools. We unified the platform so that all the customer can use it. We slowly migrated all the ETL jobs to spark jobs.

We do use Redshift for reporting but all the heavy lifting of finding insights from data, joins, managing incoming data and merge that with existing snapshot all done in spark.

We were able to save millions of dollars by moving away from existing ETL jobs and migrating them to Spark.

My two pennies on this is that, eventually spark, hive big data, hadoop will eventually outrun the ETL jobs. I am not saying ETLs will be eviscerated but definitely the open source solution will become the dominant force in this domain.

like image 121
Avishek Bhattacharya Avatar answered Oct 20 '22 00:10

Avishek Bhattacharya


May I know the reason for replacing Informatica with Spark. Informatica BDM 10.1 edition comes with Spark execution engine, This converts the Informatica mappings into Spark equivalent (Scala code) and executes this on the cluster. Also, in my opinion, Spark is more suitable for the data that does not intermediately, where as in case of ETL, the data changes from transformation to transformation!!!

like image 28
Volamr Avatar answered Oct 20 '22 00:10

Volamr