Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use Airflow for frequent tasks

We have been using Airflow for a while, it is just great.

Now we are considering moving some of our very frequent tasks into our airflow server too.

Let's say I have a script running every second.

What's the best practice to schedule it with airflow:

  1. Run this script in DAG that is scheduled every second. I highly doubt this will be the solution, there is significant overhead for a DAGRUN

  2. Run this script in a while loop that stops after 6 hours, then schedule it on Airflow to be run every 6 hour?

  3. Create a DAG with no schedule, put the task in a while True loop with proper sleep time, so the task will never terminates unless there is an error.

  4. Any other suggestions?

  5. Or this kind of task is just not suitable for Airflow? should do it with a lambda function and AWS scheduler?

Cheers!

like image 759
qichao_he Avatar asked Oct 17 '22 20:10

qichao_he


1 Answers

What's the best practice to schedule it

  1. ... this kind of task is just not suitable for Airflow?

It is not suitable.

In particular, your airflow is probably configured to re-examine the set of DAGs every 5 seconds, which doesn't sound like a good fit for a 1-second task. Plus the ratio of scheduling overhead to work performed would not be attractive. I suppose you could schedule five simultaneous tasks, twelve times per minute, and have them sleep zero to four seconds, but that's just crazy. And likely you would need to "lock against yourself" to avoid having simultaneous sibling tasks step on each other's toes.

The six-hour suggestion (2.) is not crazy. I will view it as a sixty-minute @hourly task instead, since overheads are similar. Exiting after an hour and letting airflow respawn has several benefits. Log rolling happens at regular intervals. If your program crashes, it will be restarted before too long. If your host reboots, again your program is restarted before too long. Downside is that your business need may view "more than a minute" as "much too long". And coordinating overlapping tasks, or gap between tasks, at the hour boundary may pose some issues.

Your stated needs exactly match the problem that Supervisor addresses. Just use that. You will always have exactly one copy of your event loop running, even if the app crashes, even if the host crashes. Log rolling and other administrative details have already been addressed. The code base is mature and lots of folks have beat on it and incorporated their feature requests. It fits what you want.

like image 98
J_H Avatar answered Oct 21 '22 00:10

J_H