Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sharing large intermediate state between Airflow tasks

Tags:

airflow

We have an Airflow deployment with Celery executors.

Many of our DAGs require a local processing step of some file in a BashOperator or PythonOperator.

However, in our understanding the tasks of a given DAG may not always be scheduled on the same machine.

The options for state sharing between tasks I've gathered so far:

  1. Use Local Executors - this may suffice for one team, depending on the load, but may not scale to the wider company

  2. Use XCom - does this have a size limit? Probably unsuitable for large files

  3. Write custom Operators for every combination of tasks that need local processing in between. This approach reduces modularity of tasks and requires replicating existing operators' code.

  4. Use Celery queues to route DAGs to the same worker (docs) - This option seems attractive at first, but what would be an appropriate way to set it up in order to avoid routing everything to one executor, or crafting a million queues?

  5. Use a shared network storage in all machines that run executors - Seems like an additional infrastructure burden, but is a possibility.

What is the recommended way to do sharing of large intermediate state, such as files, between tasks in Airflow?

like image 582
roldugin Avatar asked Feb 12 '18 21:02

roldugin


People also ask

How many tasks can you execute in parallel in Airflow?

Parallelism: This is the maximum number of tasks that can run at the same time in a single Airflow environment. If this setting is set to 32, for example, no more than 32 tasks can run concurrently across all DAGs.

Does Airflow run tasks in parallel?

Conclusion. Today you've successfully written your first Airflow DAG that runs the tasks in parallel. It's a huge milestone, especially because you can be more efficient now. Most of the time you don't need to run similar tasks one after the other, so running them in parallel is a huge time saver.


1 Answers

To clarify something: No matter how you setup airflow, there will only be one executor running.

  • The executor runs on the same machine as the scheduler.
  • Currently (current is airflow 1.9.0 at time of writing) there is no safe way to run multiple schedulers, so there will only ever be one executor running.
  • Local executor executes the task on the same machine as the scheduler.
  • Celery Executor just puts tasks in a queue to be worked on the celery workers.

However, the question you are asking does apply to Celery workers. If you use Celery Executor you will probably have multiple celery workers.

Using network shared storage solves multiple problems:

  • Each worker machine sees the same dags because they have the same dags folder
  • Results of operators can be stored on a shared file system
  • The scheduler and webserver can also share the dags folder and run on different machines

I would use network storage, and write the output file name to xcom. Then when you need to input the output from a previous task, you would read the file name from that task's Xcom and process that file.

like image 124
jhnclvr Avatar answered Oct 12 '22 11:10

jhnclvr