<ul> <li>I have a group of job units (workers) that I want to run as a DAG</li> <li>Group1 has 10 workers and each worker does multiple table extracts from a DB. Note that each worker maps to a single DB instance and each worker needs to successfully deal with 100 tables in total before it can successfully mark itself as complete</li> <li>Group1 has a limitation that says no more than 5 tables across all those 10 workers should be consumed at a time. For example: <ul> <li>Worker1 is extracting 2 tables</li> <li>Worker2 is extracting 2 tables</li> <li>Worker3 is extracting 1 table</li> <li>Worker4...Worker10 need to wait until Worker1...Worker3 relinquishes the threads</li> <li>Worker4...Worker10 can pick up tables as soon as threads in step1 frees up</li> <li>As each worker completes all the 100 tables, it proceeds to step2 without waiting. Step2 has no concurrency limits</li> </ul> </li> </ul> <p>I should be able to create a single node Group1 that caters to the throttling and also have</p> <ul> <li>10 independent nodes of workers so I can restart them in case if anyone of it fails</li> </ul> <p>I have tried explaining this in the following diagram: <img src="https://i.stack.imgur.com/Ug4nG.png" alt="enter image description here"></p> <ul> <li>If any of the worker fails, I can restart it without affecting other workers. It still uses the same thread pool from Group1 so the concurrency limits are enforced</li> <li>Group1 would complete once all elements of step1 and step2 are complete</li> <li>Step2 doesn't have any concurrency measures</li> </ul> <p>How do I implement such a hierarchy in Airflow for a Spring Boot Java application? Is it possible to design this kind of DAG using Airflow constructs and dynamically tell Java application how many tables it can extract at a time. For instance, if all workers except Worker1 are finished, Worker1 can now use all 5 threads available while everything else will proceed to step2.</p>

<p>These constraints cannot be modeled as a directed acyclic graph, and thus cannot implemented in airflow exactly as described. However, they can be modeled as queues, and thus could be implemented with a job queue framework. Here are your two options:</p> <h3>Implement suboptimally as airflow DAG:</h3> <pre class="prettyprint lang-py prettyprint-override"><code>from airflow.models import DAG from airflow.operators.subdag_operator import SubDagOperator # Executors that inherit from BaseExecutor take a parallelism parameter from wherever import SomeExecutor, SomeOperator # Table load jobs are done with parallelism 5 load_tables = SubDagOperator(subdag=DAG("load_tables"), executor=SomeExecutor(parallelism=5)) # Each table load must be it's own job, or must be split into sets of tables of predetermined size, such that num_tables_per_job * parallelism = 5 for table in tables: load_table = SomeOperator(task_id=f"load_table_{table}", dag=load_tables) # Jobs done afterwards are done with higher parallelism afterwards = SubDagOperator( subdag=DAG("afterwards"), executor=SomeExecutor(parallelism=high_parallelism) ) for job in jobs: afterward_job = SomeOperator(task_id=f"job_{job}", dag=afterwards) # After _all_ table load jobs are complete, start the jobs that should be done afterwards load_tables > afterwards </code></pre> <p>The suboptimal aspect here, is that, for the first half of the DAG, the cluster will be underutilized by <code>higher_parallelism - 5</code>.</p> <h3>Implement optimally with job queue:</h3> <pre class="prettyprint lang-py prettyprint-override"><code># This is pseudocode, but could be easily adapted to a framework like Celery # You need two queues # The table load queue should be initialized with the job items table_load_queue = Queue(initialize_with_tables) # The queue for jobs to do afterwards starts empty afterwards_queue = Queue() def worker(): # Work while there's at least one item in either queue while not table_load_queue.empty() or not afterwards_queue.empty(): working_on_table_load = [worker.is_working_table_load for worker in scheduler.active()] # Work table loads if we haven't reached capacity, otherwise work the jobs afterwards if sum(working_on_table_load) < 5: is_working_table_load = True task = table_load_queue.dequeue() else is_working_table_load = False task = afterwards_queue.dequeue() if task: after = work(task) if is_working_table_load: # After working a table load, create the job to work afterwards afterwards_queue.enqueue(after) # Use all the parallelism available scheduler.start(worker, num_workers=high_parallelism) </code></pre> <p>Using this approach, the cluster won't be underutilized.</p>

Architecturing Airflow DAG that needs contextual throttling

Tags:

java

multithreading

etl

airflow

airflow-scheduler

I have a group of job units (workers) that I want to run as a DAG
Group1 has 10 workers and each worker does multiple table extracts from a DB. Note that each worker maps to a single DB instance and each worker needs to successfully deal with 100 tables in total before it can successfully mark itself as complete
Group1 has a limitation that says no more than 5 tables across all those 10 workers should be consumed at a time. For example:
- Worker1 is extracting 2 tables
- Worker2 is extracting 2 tables
- Worker3 is extracting 1 table
- Worker4...Worker10 need to wait until Worker1...Worker3 relinquishes the threads
- Worker4...Worker10 can pick up tables as soon as threads in step1 frees up
- As each worker completes all the 100 tables, it proceeds to step2 without waiting. Step2 has no concurrency limits

I should be able to create a single node Group1 that caters to the throttling and also have

10 independent nodes of workers so I can restart them in case if anyone of it fails

I have tried explaining this in the following diagram: enter image description here

If any of the worker fails, I can restart it without affecting other workers. It still uses the same thread pool from Group1 so the concurrency limits are enforced
Group1 would complete once all elements of step1 and step2 are complete
Step2 doesn't have any concurrency measures

How do I implement such a hierarchy in Airflow for a Spring Boot Java application? Is it possible to design this kind of DAG using Airflow constructs and dynamically tell Java application how many tables it can extract at a time. For instance, if all workers except Worker1 are finished, Worker1 can now use all 5 threads available while everything else will proceed to step2.

947

asked May 25 '20 21:05

Rafay

1 Answers

These constraints cannot be modeled as a directed acyclic graph, and thus cannot implemented in airflow exactly as described. However, they can be modeled as queues, and thus could be implemented with a job queue framework. Here are your two options:

Implement suboptimally as airflow DAG:

from airflow.models import DAG
from airflow.operators.subdag_operator import SubDagOperator
# Executors that inherit from BaseExecutor take a parallelism parameter
from wherever import SomeExecutor, SomeOperator

# Table load jobs are done with parallelism 5
load_tables = SubDagOperator(subdag=DAG("load_tables"), executor=SomeExecutor(parallelism=5))

# Each table load must be it's own job, or must be split into sets of tables of predetermined size, such that num_tables_per_job * parallelism = 5
for table in tables:
    load_table = SomeOperator(task_id=f"load_table_{table}", dag=load_tables)

# Jobs done afterwards are done with higher parallelism
afterwards = SubDagOperator(
    subdag=DAG("afterwards"), executor=SomeExecutor(parallelism=high_parallelism)
)

for job in jobs:
    afterward_job = SomeOperator(task_id=f"job_{job}", dag=afterwards)

# After _all_ table load jobs are complete, start the jobs that should be done afterwards

load_tables > afterwards

The suboptimal aspect here, is that, for the first half of the DAG, the cluster will be underutilized by higher_parallelism - 5.

Implement optimally with job queue:

# This is pseudocode, but could be easily adapted to a framework like Celery

# You need two queues
# The table load queue should be initialized with the job items
table_load_queue = Queue(initialize_with_tables)
# The queue for jobs to do afterwards starts empty
afterwards_queue = Queue()

def worker():

    # Work while there's at least one item in either queue
    while not table_load_queue.empty() or not afterwards_queue.empty():
        working_on_table_load = [worker.is_working_table_load for worker in scheduler.active()]

        # Work table loads if we haven't reached capacity, otherwise work the jobs afterwards
        if sum(working_on_table_load) < 5:
            is_working_table_load = True
            task = table_load_queue.dequeue()
        else
            is_working_table_load = False
            task = afterwards_queue.dequeue()

        if task:
            after = work(task)
            if is_working_table_load:

                # After working a table load, create the job to work afterwards
                afterwards_queue.enqueue(after)

# Use all the parallelism available
scheduler.start(worker, num_workers=high_parallelism)

Using this approach, the cluster won't be underutilized.

135

answered Sep 29 '22 23:09

Dave

Related questions
                            
                                jdk.nashorn.internal.ir.annotations does not exist
                            
                                Keep bluetooth service running for all fragments
                            
                                Spring weirdly depends on class location
                            
                                Message is received from Google Pub/Sub subscription again and again after acknowledge[Heisenbug]
                            
                                Java 11 module-info and annotation processors
                            
                                What is the proper replacement for ViewModelProviders deprecation?
                            
                                Hamcrest Matcher signer information does not match signer information of other classes in the same package
                            
                                AutocompleteTextView click listener called twice
                            
                                JDK 11 vs JDK 13 performance
                            
                                Could not handle mustUnderstand headers: {http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd}Security. Returning fault
                            
                                vsCode java.test.config vmArgs not working
                            
                                How to perform queries on hierarchical entities using querydsl or spring data jpa specification?
                            
                                Android - Open PDF Doc with Intent is not saving after close
                            
                                Kotlin result type in java
                            
                                How could I copy the RAM of a running application, save it, and reload it into RAM later on?
                            
                                Gradle task `bootrun` setup in IntelliJ
                            
                                How to customize pie chart using MPAndroidChart in Android?
                            
                                Make MySQL Connector reuse a SSLSessionContextImpl
                            
                                Not using IsolationActivity via AndroidBenchmarkRunner
                            
                                DynamoDb query with filter expression in Java

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With