Sharing large intermediate state between Airflow tasks

1 Answers

To clarify something: No matter how you setup airflow, there will only be one executor running.

The executor runs on the same machine as the scheduler.
Currently (current is airflow 1.9.0 at time of writing) there is no safe way to run multiple schedulers, so there will only ever be one executor running.
Local executor executes the task on the same machine as the scheduler.
Celery Executor just puts tasks in a queue to be worked on the celery workers.

However, the question you are asking does apply to Celery workers. If you use Celery Executor you will probably have multiple celery workers.

Using network shared storage solves multiple problems:

Each worker machine sees the same dags because they have the same dags folder
Results of operators can be stored on a shared file system
The scheduler and webserver can also share the dags folder and run on different machines

I would use network storage, and write the output file name to xcom. Then when you need to input the output from a previous task, you would read the file name from that task's Xcom and process that file.

124

answered Oct 12 '22 11:10

jhnclvr

Related questions
                            
                                Is there documentation for Airflow log file return codes?
                            
                                Submit and monitor SLURM jobs using Apache Airflow
                            
                                Airflow 1.9.0 - Long Delay Between Task Execution
                            
                                What are the differences between airflow and Kubeflow pipeline?
                            
                                Why do we need airflow hooks?
                            
                                Integration of Kubernetes with Apache Airflow
                            
                                Apache Airflow: Executor reports task instance finished (failed) although the task says its queued
                            
                                Limit Airflow DAG Visibility By AD/LDAP Groups
                            
                                How to uninstall Airflow?
                            
                                Running `airflow scheduler` launches 33 scheduler processes
                            
                                How to run jupyter notebook in airflow
                            
                                Airflow 1.10.3 SubDag can only run 1 task in parallel even the concurrency is 8
                            
                                Airflow - got an unexpected keyword argument 'conf'
                            
                                How to stop/kill airflow scheduler started in daemon mode
                            
                                How to use airflow with Celery
                            
                                Dynamic task definition in Airflow
                            
                                Airflow : Run a task when some upstream is skipped by shortcircuit
                            
                                airflow sending sigterms to tasks randomly
                            
                                Airflow scheduler fails to start with kubernetes executor
                            
                                Move and transform data between databases using Airflow

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Sharing large intermediate state between Airflow tasks

Tags:

airflow

roldugin

People also ask

1 Answers

jhnclvr

Recent Activity

Donate For Us