Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Move and transform data between databases using Airflow

Tags:

airflow

Using airflow, I extract data from a MySQL database, transform it with python and load it into a Redshift cluster.

Currently I use 3 airflow tasks : they pass the data by writing CSV on local disk.

How could I do this without writing to disk ? Should I write one big task in python ? ( That would lower visibility )

Edit: this is a question about Airflow, and best practice for choosing the granularity of tasks and how to pass data between them. It is not a general question about data migration or ETL. In this question ETL is only used as an exemple of workload for airflow tasks.

like image 620
Hugo Avatar asked Jan 10 '17 12:01

Hugo


People also ask

Can Airflow be used for ETL?

Airflow ETL is one such popular framework that helps in workflow management. It has excellent scheduling capabilities and graph-based execution flow makes it a great alternative for running ETL jobs.

Is Airflow a data warehouse?

Conclusion. As data warehouse becomes a common solution to fulfill various data-driven companies' needs, deciding the right architecture and tools is an essential thing. BigQuery and Apache Airflow can be the answer for us to ease the ETL procedure in building our Data Warehouse.


1 Answers

There are different ways you can achieve this:

If you are using AWS RDS service for MySQL, you can use AWS Data Pipeline to transfer data from MySQL to Redshift. They have inbuilt template in AWS Data Pipeline to do that. You can even schedule the incremental data transfer from MySQL to Redshift

http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-template-redshift.html

How large is your table? If your table is not too large and you can read the whole table into python using Pandas DataFrame or tuples and then transfer it Redshift. Even if you have large table still you can read that table in chunks and push each chunk to Redshift. Pandas are little inefficient in terms of memory usage if you read table into it.

Creating different tasks in Airflow will not help much. Either you can create a single function and call that function in dag using PythonOperator or create a python script and execute it using BashOperator in dag

like image 165
liferacer Avatar answered Oct 11 '22 14:10

liferacer