We have an Airflow deployment with Celery executors.
Many of our DAGs require a local processing step of some file in a BashOperator
or PythonOperator
.
However, in our understanding the tasks of a given DAG may not always be scheduled on the same machine.
The options for state sharing between tasks I've gathered so far:
Use Local Executors
- this may suffice for one team, depending on the load, but may not scale to the wider company
Use XCom
- does this have a size limit? Probably unsuitable for large files
Write custom Operators for every combination of tasks that need local processing in between. This approach reduces modularity of tasks and requires replicating existing operators' code.
Use Celery queues to route DAGs to the same worker (docs) - This option seems attractive at first, but what would be an appropriate way to set it up in order to avoid routing everything to one executor, or crafting a million queues?
Use a shared network storage in all machines that run executors - Seems like an additional infrastructure burden, but is a possibility.
What is the recommended way to do sharing of large intermediate state, such as files, between tasks in Airflow?
Parallelism: This is the maximum number of tasks that can run at the same time in a single Airflow environment. If this setting is set to 32, for example, no more than 32 tasks can run concurrently across all DAGs.
Conclusion. Today you've successfully written your first Airflow DAG that runs the tasks in parallel. It's a huge milestone, especially because you can be more efficient now. Most of the time you don't need to run similar tasks one after the other, so running them in parallel is a huge time saver.
To clarify something: No matter how you setup airflow, there will only be one executor running.
However, the question you are asking does apply to Celery workers. If you use Celery Executor you will probably have multiple celery workers.
Using network shared storage solves multiple problems:
I would use network storage, and write the output file name to xcom. Then when you need to input the output from a previous task, you would read the file name from that task's Xcom and process that file.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With