Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Architecturing Airflow DAG that needs contextual throttling

  • I have a group of job units (workers) that I want to run as a DAG
  • Group1 has 10 workers and each worker does multiple table extracts from a DB. Note that each worker maps to a single DB instance and each worker needs to successfully deal with 100 tables in total before it can successfully mark itself as complete
  • Group1 has a limitation that says no more than 5 tables across all those 10 workers should be consumed at a time. For example:
    • Worker1 is extracting 2 tables
    • Worker2 is extracting 2 tables
    • Worker3 is extracting 1 table
    • Worker4...Worker10 need to wait until Worker1...Worker3 relinquishes the threads
    • Worker4...Worker10 can pick up tables as soon as threads in step1 frees up
    • As each worker completes all the 100 tables, it proceeds to step2 without waiting. Step2 has no concurrency limits

I should be able to create a single node Group1 that caters to the throttling and also have

  • 10 independent nodes of workers so I can restart them in case if anyone of it fails

I have tried explaining this in the following diagram: enter image description here

  • If any of the worker fails, I can restart it without affecting other workers. It still uses the same thread pool from Group1 so the concurrency limits are enforced
  • Group1 would complete once all elements of step1 and step2 are complete
  • Step2 doesn't have any concurrency measures

How do I implement such a hierarchy in Airflow for a Spring Boot Java application? Is it possible to design this kind of DAG using Airflow constructs and dynamically tell Java application how many tables it can extract at a time. For instance, if all workers except Worker1 are finished, Worker1 can now use all 5 threads available while everything else will proceed to step2.

like image 947
Rafay Avatar asked May 25 '20 21:05

Rafay


People also ask

Is Start_date mandatory in Airflow DAG?

This is no longer required. Airflow will now auto align the start_date and the schedule , by using the start_date as the moment to start looking.

How do I create a dynamic DAG in Airflow?

In Airflow, DAGs are defined as Python code. Airflow executes all Python code in the dags_folder and loads any DAG objects that appear in globals() . The simplest way to create a DAG is to write it as a static Python file.


1 Answers

These constraints cannot be modeled as a directed acyclic graph, and thus cannot implemented in airflow exactly as described. However, they can be modeled as queues, and thus could be implemented with a job queue framework. Here are your two options:

Implement suboptimally as airflow DAG:

from airflow.models import DAG
from airflow.operators.subdag_operator import SubDagOperator
# Executors that inherit from BaseExecutor take a parallelism parameter
from wherever import SomeExecutor, SomeOperator

# Table load jobs are done with parallelism 5
load_tables = SubDagOperator(subdag=DAG("load_tables"), executor=SomeExecutor(parallelism=5))

# Each table load must be it's own job, or must be split into sets of tables of predetermined size, such that num_tables_per_job * parallelism = 5
for table in tables:
    load_table = SomeOperator(task_id=f"load_table_{table}", dag=load_tables)

# Jobs done afterwards are done with higher parallelism
afterwards = SubDagOperator(
    subdag=DAG("afterwards"), executor=SomeExecutor(parallelism=high_parallelism)
)

for job in jobs:
    afterward_job = SomeOperator(task_id=f"job_{job}", dag=afterwards)

# After _all_ table load jobs are complete, start the jobs that should be done afterwards

load_tables > afterwards

The suboptimal aspect here, is that, for the first half of the DAG, the cluster will be underutilized by higher_parallelism - 5.

Implement optimally with job queue:

# This is pseudocode, but could be easily adapted to a framework like Celery

# You need two queues
# The table load queue should be initialized with the job items
table_load_queue = Queue(initialize_with_tables)
# The queue for jobs to do afterwards starts empty
afterwards_queue = Queue()

def worker():

    # Work while there's at least one item in either queue
    while not table_load_queue.empty() or not afterwards_queue.empty():
        working_on_table_load = [worker.is_working_table_load for worker in scheduler.active()]

        # Work table loads if we haven't reached capacity, otherwise work the jobs afterwards
        if sum(working_on_table_load) < 5:
            is_working_table_load = True
            task = table_load_queue.dequeue()
        else
            is_working_table_load = False
            task = afterwards_queue.dequeue()

        if task:
            after = work(task)
            if is_working_table_load:

                # After working a table load, create the job to work afterwards
                afterwards_queue.enqueue(after)

# Use all the parallelism available
scheduler.start(worker, num_workers=high_parallelism)

Using this approach, the cluster won't be underutilized.

like image 135
Dave Avatar answered Sep 29 '22 23:09

Dave