I have the following task to solve: <blockquote> Files are being sent at irregular times through an endpoint and stored locally. I need to trigger a DAG run for each of these files. For each file the same tasks will be performed </blockquote> Overall the flows looks as follows: For each file, run tasks A->B->C->D Files are being processed in batch. While this task seemed trivial to me, I have found several ways to do this and I am confused about which one is the "proper" one (if any). <h3>First pattern: Use experimental REST API to trigger dag.</h3> That is, expose a web service which ingests the request and the file, stores it to a folder, and uses the experimental REST api to trigger the DAG, by passing the file_id as conf Cons: REST apis are still experimental, not sure how Airflow can handle a load test with many requests coming at one point (which shouldn't happen, but, what if it does?) <h3>Second pattern: 2 dags. One senses and triggers with TriggerDagOperator, one processes.</h3> Always using the same ws as described before, but this time it justs stores the file. Then we have: <ul> <li>First dag: Uses a FileSensor along with the TriggerDagOperator to trigger N dags given N files</li> <li>Second dag: Task A->B->C </li> </ul> Cons: Need to avoid that the same files are being sent to two different DAG runs. Example: Files in folder x.json Sensor finds x, triggers DAG (1) Sensor goes back and scheduled again. If DAG (1) did not process/move the file, the sensor DAG might reschedule a new DAG run with the same file. Which is unwanted. <h3>Third pattern: for file in files, task A->B->C</h3> As seen in this question. Cons: This could work, however what I dislike is that the UI will probably get messed up because every DAG run will not look the same but it will change with the number of files being processed. Also if there are 1000 files to be processed the run would probably be very difficult to read <h3>Fourth pattern: Use subdags</h3> I am not yet sure how they completely work as I have seen they are not encouraged (at the end), however it should be possible to spawn a subdag for each file and have it running. Similar to this question. Cons: Seems like subdags can only be used with the sequential executor. <hr> Am I missing something and over-thinking something that should be (in my mind) quite straight-forward? Thanks

I know I am late, but I would choose the second pattern: "2 dags. One senses and triggers with TriggerDagOperator, one processes", because: <ul> <li>Every file can be executed in parallel</li> <li>The first DAG could pick a file to process, rename it (adding a suffix '_processing' or moving it to a processing folder)</li> <li>If I am a new developer in your company, and I open the workflow, I want to understand what is the logic of workflow doing, rather than which files were processed in the last time was dynamically built</li> <li>If the dag 2, finds an issue with the file, then it renames it (with the '_error' suffix or move it to an error folder)</li> <li>It's a standard way to process files without creating any additional operator</li> <li>it makes de DAG idempotent and easier to test. More info in this article </li> </ul> Renaming and/or moving files is a pretty standard way to process files in every ETL. By the way, I always recommend this article https://medium.com/bluecore-engineering/were-all-using-airflow-wrong-and-how-to-fix-it-a56f14cb0753. It doesn't

Airflow: Proper way to run DAG for each file

First pattern: Use experimental REST API to trigger dag.

That is, expose a web service which ingests the request and the file, stores it to a folder, and uses the experimental REST api to trigger the DAG, by passing the file_id as conf

Cons: REST apis are still experimental, not sure how Airflow can handle a load test with many requests coming at one point (which shouldn't happen, but, what if it does?)

Second pattern: 2 dags. One senses and triggers with TriggerDagOperator, one processes.

Always using the same ws as described before, but this time it justs stores the file. Then we have:

First dag: Uses a FileSensor along with the TriggerDagOperator to trigger N dags given N files
Second dag: Task A->B->C

Cons: Need to avoid that the same files are being sent to two different DAG runs. Example:

Files in folder x.json Sensor finds x, triggers DAG (1)

Sensor goes back and scheduled again. If DAG (1) did not process/move the file, the sensor DAG might reschedule a new DAG run with the same file. Which is unwanted.

Third pattern: for file in files, task A->B->C

As seen in this question.

Cons: This could work, however what I dislike is that the UI will probably get messed up because every DAG run will not look the same but it will change with the number of files being processed. Also if there are 1000 files to be processed the run would probably be very difficult to read

Fourth pattern: Use subdags

I am not yet sure how they completely work as I have seen they are not encouraged (at the end), however it should be possible to spawn a subdag for each file and have it running. Similar to this question.

Cons: Seems like subdags can only be used with the sequential executor.

Am I missing something and over-thinking something that should be (in my mind) quite straight-forward? Thanks

596

asked Feb 05 '20 19:02

arocketman

1 Answers

I know I am late, but I would choose the second pattern: "2 dags. One senses and triggers with TriggerDagOperator, one processes", because:

Every file can be executed in parallel
The first DAG could pick a file to process, rename it (adding a suffix '_processing' or moving it to a processing folder)
If I am a new developer in your company, and I open the workflow, I want to understand what is the logic of workflow doing, rather than which files were processed in the last time was dynamically built
If the dag 2, finds an issue with the file, then it renames it (with the '_error' suffix or move it to an error folder)
It's a standard way to process files without creating any additional operator
it makes de DAG idempotent and easier to test. More info in this article

Renaming and/or moving files is a pretty standard way to process files in every ETL.

By the way, I always recommend this article https://medium.com/bluecore-engineering/were-all-using-airflow-wrong-and-how-to-fix-it-a56f14cb0753. It doesn't

186

answered Sep 24 '22 02:09

Carlos Moreno

Related questions
                            
                                What does pipenv do after installing that takes up so much time and downloads huge amounts of data?
                            
                                Dramatiq doesn't add tasks to the queue
                            
                                Is a Python module import at the bottom ok?
                            
                                Simulation of suicide burn in openai-gym's LunarLander
                            
                                What's difference between using metrics 'acc' and tf.keras.metrics.Accuracy()
                            
                                on_epoch_end() not called in keras fit_generator()
                            
                                How to force all strings to floats? [duplicate]
                            
                                How to remap or revert a point into its former coordinate system after warpAffine has transformed it?
                            
                                how to replace just first instance of max value in dataframe pandas?
                            
                                Post-install script with Python Poetry
                            
                                Clean text images with OpenCV for OCR reading
                            
                                Creating subplots with equal axis scale, Python, matplotlib
                            
                                How do I make more efficient code for a search for multiple strings in column in pandas
                            
                                How to get probability of prediction per entity from Spacy NER model?
                            
                                How to find code that is missing type annotations?
                            
                                Multi-Page Dash App Callbacks Not Registering
                            
                                installing spyder_autopep8 on spyder 4 and getting it to work
                            
                                PyOpenGL how do I import an obj file?
                            
                                RuntimeValueProviderError when creating a google cloud dataflow template with Apache Beam python
                            
                                How to run python on GPU with CuPy?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Airflow: Proper way to run DAG for each file

Tags:

python

batch-processing

etl

airflow

directed-acyclic-graphs