Say you want to schedule recurring tasks, such as:
And you want to do this for a reasonable number of users in a web app - ie. 100k users each user can decide what they want scheduled when.
And you want to ensure that the scheduled items run, even if they were missed originally - eg. for some reason the email didn't get sent on wednesday at 10am, it should get sent out at the next checking interval, say wednesday at 11am.
How would you design that?
If you use cron to trigger your scheduling app every x minutes, what's a good way to implement the part that decides what should run at each point in time?
The cron-like implementations I've seen compare the current time to the trigger time for all specified items, but I'd like to deal with missed items as well.
I have a feeling there's a more clever design than the one I'm cooking up, so please enlighten me.
There's 2 designs, basically.
One runs regularly and compares the current time to the scheduling spec (i.e. "Does this run now?"), and executes those that qualify.
The other technique takes the current scheduling spec and finds the NEXT time that the item should fire. Then, it compares the current time to all of those items who's "next time" is less than "current time", and fires those. Then, when an item is complete, it is rescheduled for the new "next time".
The first technique can not handle "missed" items, the second technique can only handle those items that were previously scheduled.
Specifically consider you you have a schedule that runs once every hour, at the top of the hour.
So, say, 1pm, 2pm, 3pm, 4pm.
At 1:30pm, the run task is down and not executing any processes. It does not start again until 3:20pm.
Using the first technique, the scheduler will have fired the 1pm task, but not fired the 2pm, and 3pm tasks, as it was not running when those times passed. The next job to run will be the 4pm job, at, well, 4pm.
Using the second technique, the scheduler will have fired the 1pm task, and scheduled the next task at 2pm. Since the system was down, the 2pm task did not run, nor did the 3pm task. But when the system restarted at 3:20, it saw that it "missed" the 2pm task, and fired it off at 3:20, and then scheduled it again for 4pm.
Each technique has it's ups and downs. With the first technique, you miss jobs. With the second technique you can still miss jobs, but it can "catch up" (to a point), but it may also run a job "at the wrong time" (maybe it's supposed to run at the top of the hour for a reason).
A benefit of the second technique is that if you reschedule at the END of the executing job, you don't have to worry about a cascading job problem.
Consider that you have a job that runs every minute. With the first technique, the job gets fired each minute. However, typically, if the job is not FINISHED within it's minute, then you can potentially have 2 jobs running (one late in the process, the other starting up). This can be a problem if the job is not designed to run more than once simultaneously. And it can exacerbate (if there's a real problem, after 10 minutes you have 10 jobs all fighting each other).
With the second technique, if you schedule at the end of the job, then if a job happens to run just over a minute, then you'll "skip" a minute" and start up the following minute rather than run on top of itself. So, you can have a job scheduled for every minute actually run at 1:01pm, 1:03pm, 1:05pm, etc.
Depending on your job design, either of these can be "good" or "bad". There's no right answer here.
Finally, implementing the first technique is really, quite trivial compared to implementing the second. The code to determine if a cron string (say) matches a given time is simple compared to deriving what time a cron string will be valid NEXT. I know, and I have a couple hundred lines of code to prove it. It's not pretty.
In case you want to skip designing and start using have a look at Celery. The scheduler is called celerybeat.
Edit: Also relevant: How to send 100,000 emails weekly?
Using a backing Java process with Quartz scheduler is a likely potential solution. I believe Quartz should scale to this level reasonably well. See this related SO question: "How to scale the Quartz Scheduler"...
If you take a careful look at the Quartz documentation, I think you'll find that your concerns regarding triggering and missed executions are dealt with cleanly, and offer a number of suitable policies to choose from. In terms of scalability, I believe you can store jobs in a JDBC backing store.
Struck out, since the questioner was specifically looking for a design discussion...
If you framed your initial StackOverflow search prior to asking the question in terms of "task schedulers for Python", you would have turned this up: "An enterprise scheduler for python...". I strongly suggest looking for an existing implementation rather than attempting a NIH development for something like this, despite the great observations about how you might do this in the other answer. Given your stated scalability goals, you're biting off a fairly challenging task, and you should eliminate all other options before going down the from-scratch road on a topic as heavily developed as this one. One possible avenue to consider would be adaptation to the well-regarded Quartz
via Jython, and determine whether your use cases could be handled in that context with minimal dipping into the Java world (presumably not your first choice).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With