The current problem that I am facing is that I have documents in a MongoDB collection which each need to be processed and updated by tasks which need to run in an acyclic dependency graph. If a task upstream fails to process a document, then none of the dependent tasks may process that document, as that document has not been updated with the prerequisite information. If I were to use Airflow, this leaves me with two solutions: <ol> <li>Trigger a DAG for each document, and pass in the document ID with <code>--conf</code>. The problem with this is that this is not the intended way for Airflow to be used; I would never be running a scheduled process, and based on how documents appear in the collection, I would be making 1440 Dagruns per day.</li> <li>Run a DAG every period for processing all documents created in the collection for that period. This follows how Airflow is expected to work, but the problem is that if a task fails to process a single document, none of the dependent tasks may process any of the other documents. Also, if a document takes longer than other documents do to be processed by a task, those other documents are waiting on that single document to continue down the DAG.</li> </ol> Is there a better method than Airflow? Or is there a better way to handle this in Airflow than the two methods I currently see?

From the knowledge I gained in my attempt to answer this question, I've come to the conclusion that Airflow is just not the tool for the job. Airflow is designed for scheduled, idempotent DAGs. A DagRun must also have a unique <code>execution_date</code>; this means running the same DAG at the exact same start time (in the case that we receive two documents at the same time is quite literally impossible. Of course, we can schedule the next DagRun immediately in succession, but this limitation should demonstrate that any attempt to use Airflow in this fashion will always be, to an extent, a hack. The most viable solution I've found is to instead use Prefect, which was developed with the intention of overcoming some of the limitations of Airflow: "Prefect assumes that flows can be run at any time, for any reason." Prefect's equivalent of a DAG is a Flow; one key advantage of a flow that we may take advantage of is the ease of parametriziation. Then, with some threads, we're able to have a Flow run for each element in a stream. Here is an example streaming ETL pipeline: <pre class="prettyprint"><code>import time from prefect import task, Flow, Parameter from threading import Thread def stream(): for x in range(10): yield x time.sleep(1) @task def extract(x): # If 'x' referenced a document, in this step we could load that document return x @task def transform(x): return x * 2 @task def load(y): print("Received y: {}".format(y)) with Flow("ETL") as flow: x_param = Parameter('x') e = extract(x_param) t = transform(e) l = load(t) for x in stream(): thread = Thread(target=flow.run, kwargs={"x": x}) thread.start() </code></pre>

Airflow Dagrun for each datum instead of scheduled

Tags:

airflow

prefect

The current problem that I am facing is that I have documents in a MongoDB collection which each need to be processed and updated by tasks which need to run in an acyclic dependency graph. If a task upstream fails to process a document, then none of the dependent tasks may process that document, as that document has not been updated with the prerequisite information.

If I were to use Airflow, this leaves me with two solutions:

Trigger a DAG for each document, and pass in the document ID with --conf. The problem with this is that this is not the intended way for Airflow to be used; I would never be running a scheduled process, and based on how documents appear in the collection, I would be making 1440 Dagruns per day.
Run a DAG every period for processing all documents created in the collection for that period. This follows how Airflow is expected to work, but the problem is that if a task fails to process a single document, none of the dependent tasks may process any of the other documents. Also, if a document takes longer than other documents do to be processed by a task, those other documents are waiting on that single document to continue down the DAG.

Is there a better method than Airflow? Or is there a better way to handle this in Airflow than the two methods I currently see?

594

asked Oct 16 '19 18:10

Sebastian Mendez

1 Answers

From the knowledge I gained in my attempt to answer this question, I've come to the conclusion that Airflow is just not the tool for the job.

Airflow is designed for scheduled, idempotent DAGs. A DagRun must also have a unique execution_date; this means running the same DAG at the exact same start time (in the case that we receive two documents at the same time is quite literally impossible. Of course, we can schedule the next DagRun immediately in succession, but this limitation should demonstrate that any attempt to use Airflow in this fashion will always be, to an extent, a hack.

The most viable solution I've found is to instead use Prefect, which was developed with the intention of overcoming some of the limitations of Airflow:

"Prefect assumes that flows can be run at any time, for any reason."

Prefect's equivalent of a DAG is a Flow; one key advantage of a flow that we may take advantage of is the ease of parametriziation. Then, with some threads, we're able to have a Flow run for each element in a stream. Here is an example streaming ETL pipeline:

import time
from prefect import task, Flow, Parameter
from threading import Thread


def stream():
    for x in range(10):
        yield x
        time.sleep(1)


@task
def extract(x):
    # If 'x' referenced a document, in this step we could load that document
    return x


@task
def transform(x):
    return x * 2


@task
def load(y):
    print("Received y: {}".format(y))


with Flow("ETL") as flow:
    x_param = Parameter('x')
    e = extract(x_param)
    t = transform(e)
    l = load(t)

for x in stream():
    thread = Thread(target=flow.run, kwargs={"x": x})
    thread.start()

answered Sep 20 '22 02:09

Sebastian Mendez

Related questions
                            
                                Airflow: Can't connect to ('0.0.0.0', 8080)
                            
                                Ooops... AttributeError when clearing failed task state in airflow
                            
                                How do I add a new connection type to Airflow?
                            
                                Airflow: what's the standard way of delaying a day's DAG run?
                            
                                Private images with Airflow KubernetesPodOperator
                            
                                Airflow task with null status
                            
                                How to have a mix of both Celery Executor and Kubernetes Executor in Apache Airflow?
                            
                                Airflow stack webserver failing to resolve postgres related attribute, fails to start
                            
                                How can one use HashiCorp Vault in Airflow?
                            
                                Unable to Start Scheduler
                            
                                Unable to execute a command in Ubuntu container, from Airflow container using DockerOperator
                            
                                Return value from one Airflow DAG into another one
                            
                                How do I set an environment variable for airflow to use?
                            
                                Airflow kills my tasks after 1 minute
                            
                                airflow plugins not getting picked up correctly
                            
                                how to clear failing DAGs using the CLI in airflow
                            
                                Airflow Audit Logs
                            
                                Python tasks and DAGs with different conda environments
                            
                                Fusing operators together
                            
                                How to provide an async function in PythonOperator's python_callable in Airflow?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Airflow Dagrun for each datum instead of scheduled

Tags:

airflow

prefect

Sebastian Mendez

People also ask

1 Answers

Sebastian Mendez

Recent Activity

Donate For Us