I'm currently using Airflow PostgresToGoogleCloudStorageOperator
and GoogleCloudStorageToBigQueryOperator
to export every table of my Postgres DB (hosted on AWS RDS) to BigQuery. It works but I have 75 tables, so Airflow creates 75 * 2 jobs. Since I'm new to Airflow, I don't know if this is a good practice.
In any case, I'd like to find a way to export all tables at once (pg_dump?) to GCS and then import them into BigQuery.
This is a matter of opinion really but I think it is fine to have a job with 72 x 2 tasks. It sounds like this job will be a once-off or non-frequent schedule? Bonus points if you use a YAML file to specify the tables and have your DAG look at the YAML file to get the list of tables to export.
If it were something you ran frequently then I think it would be better to use a a BashOperator
and pg_dump
as you suggested since Airflow can take quite a bit of time to schedule tasks after others have completed. You'd have to do some testing with your Airflow parallelism and scheduler interval (probably some other Airflow config options) to get it running efficiently.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With