Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Move data from Postgres/MySQL to S3 using Airflow

We are trying to move from Pentaho Kettle, to Apache AIrflow to do ETL and centralize all data processes under 1 tool.

We use Kettle to daily read data from Postgres/Mysql databases, and move the data to S3 -> Redshift.

What is the easiest way to do this? I do not see Operator that could directly do this; so Should i use MySQL/Postgres operator to put data in a local file, and the use S3 operator to move data to S3?

Thank you

like image 473
Srdjan Nikitovic Avatar asked Oct 13 '17 14:10

Srdjan Nikitovic


People also ask

How do I connect to PostgreSQL database in Airflow?

Configuring the ConnectionSpecify the schema name to be used in the database. Specify the user name to connect. Specify the password to connect. Specify the extra parameters (as json dictionary) that can be used in postgres connection.

Does Airflow use Postgres?

Airflow was built to interact with its metadata using SqlAlchemy with MySQL, Postgres and SQLite as supported backends (SQLite is used primarily for development purpose).

What is Postgres Airflow?

Airflow is essentially a graph (Directed Acyclic Graph) made up of tasks (nodes) and dependencies (edges). A task defined or implemented by a operator is a unit of work in your data pipeline. The purpose of Postgres Operator is to define tasks involving interactions with a PostgreSQL database.


2 Answers

You can build your own operator 'mysql_to_s3' and add it as a plugin to Airflow.

There is an operator to archive data from Mysql to gcs:

mysql_to_gcs.py

You can let all the code with a little change on def _upload_to_gcs using s3_hook instead: s3_hook.py.

Documentation about custom plugins:

Airflow plugins: Blog article

Airflow plugins: Official documentation

like image 160
Omar14 Avatar answered Oct 23 '22 03:10

Omar14


airflow-plugins (by Astronomer) has a MySqlToS3Operator that will take the resultset of a mysql query and place it on s3 as either csv or json.

The plugin can be found here: https://github.com/airflow-plugins/mysql_plugin/blob/master/operators/mysql_to_s3_operator.py

From there you might be able to use the s3_to_redshift operator to load data from S3 into redshift: https://airflow.readthedocs.io/en/latest/_modules/airflow/operators/s3_to_redshift_operator.html

like image 26
jbielick Avatar answered Oct 23 '22 03:10

jbielick