Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cron on AWS (or distributed systems in general)

I am surprised I was not able to find more on this, but alas, I still cannot find the answer. We recently converted to AWS, moving our simple website to a more robust and reliable system. What is currently baffling me is managing cron jobs on the distributed system, when that cron job gets pushed to every instance in the environment.

Here's the use case:

Background

Setup

We are running a traditional LAMP stack. Probably the first problem, but it's what we got.

DB Tables

table1

 - id int(11)
 - start date
 - interval int(11) (number of seconds)

table2

 - id int(11)
 - table1_id int(11)
 - sent datetime

Goal

The goal is that a script will run once every day and check the following:

  1. The current date is past table1.start
  2. table1.start < current date
  3. table1.interval > 0
  4. today is exactly a whole interval away (so would fail if the interval was 7 days [in seconds] and it is the 6th day)
  5. there is no entry in table2 such that table2.sent is today and table2.table1_id matches the previous checks.

If all these checks pass, we insert an entry into table2 for each table1 that has the interval. This also means we send an email based on the data in table2.

The Problem

Essentially, we have two queries, represented by the aforementioned blocks. The issue is that on a distributed system, each instance will run cron at the same time (or within milliseconds of each other). There is no notion of a "transaction," so each instance will send an email if one doesn't get a chance to insert into table2 before the others run the first query.

Solutions???

I have done a fair amount of research on this, but the only potential solutions I have come up with are detailed below:

The Cron Instance

Set up a single, independent instance responsible for running cron jobs. While this will most certainly (as far as I can see) work, it is very costly for a job that is not terribly expensive and only needs to run once a day, at most.

PHP Scheduler

Set cron to regularly run a PHP script that acts as a scheduler. This was the route we were going down after the research suggested it would be the simplest for our limited time and money. The problem that I ran into was that this just seemed to shift the concurrency problem from consuming jobs to scheduling jobs. When do you schedule the jobs such that multiple jobs aren't scheduled at the same time from each instance running the cron?

This method also seems very "kludgy" (to borrow a favorite word of my friend), and I would have to agree.

Transactions

Although I have researched this quite a bit, concurrency was always solved with atomic transactions on the database, but so far as I can tell, this isn't easy to achieve with LAMP. But perhaps I am wrong, and I would be very happy to be proven so.

Finally

So if anyone can help me figure this one out, I would greatly appreciate it. Perhaps my Googling skills are getting rusty, but I cannot imagine I am the only one suffering from this (probably simple) task.

like image 834
Ryan Avatar asked Jul 16 '12 22:07

Ryan


People also ask

What is cron in AWS?

cron is a Chef resource that represents a cron job. When AWS OpsWorks Stacks runs the recipe on an instance, the associated provider handles the details of setting up the job. job_name is a user-defined name for the cron job, such as weekly report . hour / minute / weekday specify when the commands should run.

What is a distributed cron?

Dkron is a system service for workload automation that runs scheduled jobs, just like unix cron service but distributed in several machines in a cluster. This is the only job scheduler in the market with truly no SPOF. It is open source and available for free.


2 Answers

I had a similar problem. And I also had cron jobs that had to run every minute, but on a single host only

I solved it with this hack, which runs the amazon autoscaling tools to find out if the box on which it runs is the last one instantiated in this auto scaling group. This obviously assumes you use autoscaling, and that the hostname contains the instance ID.

#!/usr/bin/env ruby

AWS_AUTO_SCALING_HOME='/opt/AutoScaling'
AWS_AUTO_SCALING_URL='https://autoscaling.eu-west-1.amazonaws.com'
MY_GROUP = 'Production'

@cmd_out = `bash -c 'AWS_AUTO_SCALING_HOME=#{ AWS_AUTO_SCALING_HOME }\
  AWS_AUTO_SCALING_URL=#{ AWS_AUTO_SCALING_URL }\
  #{ AWS_AUTO_SCALING_HOME }/bin/as-describe-auto-scaling-instances'`

raise "Output empty, should not happen!" if @cmd_out.empty?
@lines = @cmd_out.split(/\r?\n/)
@last = @lines.select {|l| l.match MY_GROUP }.reverse.
  detect { |l| l =~ /^INSTANCE\s+\S+\s+\S+\s+\S+\s+InService\s+HEALTHY/ }
raise "No suitable host in autoscaling group!" unless @last
@last_host = @last.match(/^INSTANCE\s+(\S+)/)[1]
@hostname = `hostname`
if @hostname.index(@last_host)
  puts "It's me!"
  exit(0)
else
  puts "Someone else will do it!"
  exit(1)
end

Saved it as /usr/bin/lastonly, and then in cron jobs I do:

lastonly && do_my_stuff

Clearly it's not perfect, but it works for me, and it's simple!

like image 156
loop Avatar answered Sep 28 '22 23:09

loop


Take a look at the Gearman project http://www.gearman.org. The basic architecture is you'll have one machine that's a job server and all the other machines become clients of the server.

You can setup the crontab on the job server to send commands to execute to all of the clients connected through Gearman. You can then use PHP to slice and dice your cron jobs and get as deep into Map/Reduce as you want.

Here's a good tutorial on the concepts and how it works: http://www.lornajane.net/posts/2011/Using-Gearman-from-PHP

Don't get disheartened about working with something like Gearman right away. Distributed cron systems can be complex, but once you get your head around it you'll be ok.

FWIW, we process thousands of cron scripts every minute amongst a Gearman worker farm on Amazon's EC2. We absolutely love it.

like image 41
Michael Taggart Avatar answered Sep 29 '22 00:09

Michael Taggart