Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Airflow export all tables of a postgres DB to BigQuery

I'm currently using Airflow PostgresToGoogleCloudStorageOperator and GoogleCloudStorageToBigQueryOperator to export every table of my Postgres DB (hosted on AWS RDS) to BigQuery. It works but I have 75 tables, so Airflow creates 75 * 2 jobs. Since I'm new to Airflow, I don't know if this is a good practice.

In any case, I'd like to find a way to export all tables at once (pg_dump?) to GCS and then import them into BigQuery.

like image 851
Nicoowr Avatar asked Nov 07 '22 10:11

Nicoowr


1 Answers

This is a matter of opinion really but I think it is fine to have a job with 72 x 2 tasks. It sounds like this job will be a once-off or non-frequent schedule? Bonus points if you use a YAML file to specify the tables and have your DAG look at the YAML file to get the list of tables to export.

If it were something you ran frequently then I think it would be better to use a a BashOperator and pg_dump as you suggested since Airflow can take quite a bit of time to schedule tasks after others have completed. You'd have to do some testing with your Airflow parallelism and scheduler interval (probably some other Airflow config options) to get it running efficiently.

like image 100
Simon D Avatar answered Nov 11 '22 08:11

Simon D