Cron on AWS (or distributed systems in general)

Tags:

I am surprised I was not able to find more on this, but alas, I still cannot find the answer. We recently converted to AWS, moving our simple website to a more robust and reliable system. What is currently baffling me is managing cron jobs on the distributed system, when that cron job gets pushed to every instance in the environment.

Here's the use case:

Background

Setup

We are running a traditional LAMP stack. Probably the first problem, but it's what we got.

DB Tables

table1

 - id int(11)
 - start date
 - interval int(11) (number of seconds)

table2

 - id int(11)
 - table1_id int(11)
 - sent datetime

Goal

The goal is that a script will run once every day and check the following:

The current date is past table1.start
table1.start < current date
table1.interval > 0
today is exactly a whole interval away (so would fail if the interval was 7 days [in seconds] and it is the 6th day)
there is no entry in table2 such that table2.sent is today and table2.table1_id matches the previous checks.

If all these checks pass, we insert an entry into table2 for each table1 that has the interval. This also means we send an email based on the data in table2.

The Problem

Essentially, we have two queries, represented by the aforementioned blocks. The issue is that on a distributed system, each instance will run cron at the same time (or within milliseconds of each other). There is no notion of a "transaction," so each instance will send an email if one doesn't get a chance to insert into table2 before the others run the first query.

Solutions???

I have done a fair amount of research on this, but the only potential solutions I have come up with are detailed below:

The Cron Instance

Set up a single, independent instance responsible for running cron jobs. While this will most certainly (as far as I can see) work, it is very costly for a job that is not terribly expensive and only needs to run once a day, at most.

PHP Scheduler

Set cron to regularly run a PHP script that acts as a scheduler. This was the route we were going down after the research suggested it would be the simplest for our limited time and money. The problem that I ran into was that this just seemed to shift the concurrency problem from consuming jobs to scheduling jobs. When do you schedule the jobs such that multiple jobs aren't scheduled at the same time from each instance running the cron?

This method also seems very "kludgy" (to borrow a favorite word of my friend), and I would have to agree.

Transactions

Although I have researched this quite a bit, concurrency was always solved with atomic transactions on the database, but so far as I can tell, this isn't easy to achieve with LAMP. But perhaps I am wrong, and I would be very happy to be proven so.

Finally

So if anyone can help me figure this one out, I would greatly appreciate it. Perhaps my Googling skills are getting rusty, but I cannot imagine I am the only one suffering from this (probably simple) task.

834

asked Jul 16 '12 22:07

Ryan

2 Answers

I had a similar problem. And I also had cron jobs that had to run every minute, but on a single host only

I solved it with this hack, which runs the amazon autoscaling tools to find out if the box on which it runs is the last one instantiated in this auto scaling group. This obviously assumes you use autoscaling, and that the hostname contains the instance ID.

#!/usr/bin/env ruby

AWS_AUTO_SCALING_HOME='/opt/AutoScaling'
AWS_AUTO_SCALING_URL='https://autoscaling.eu-west-1.amazonaws.com'
MY_GROUP = 'Production'

@cmd_out = `bash -c 'AWS_AUTO_SCALING_HOME=#{ AWS_AUTO_SCALING_HOME }\
  AWS_AUTO_SCALING_URL=#{ AWS_AUTO_SCALING_URL }\
  #{ AWS_AUTO_SCALING_HOME }/bin/as-describe-auto-scaling-instances'`

raise "Output empty, should not happen!" if @cmd_out.empty?
@lines = @cmd_out.split(/\r?\n/)
@last = @lines.select {|l| l.match MY_GROUP }.reverse.
  detect { |l| l =~ /^INSTANCE\s+\S+\s+\S+\s+\S+\s+InService\s+HEALTHY/ }
raise "No suitable host in autoscaling group!" unless @last
@last_host = @last.match(/^INSTANCE\s+(\S+)/)[1]
@hostname = `hostname`
if @hostname.index(@last_host)
  puts "It's me!"
  exit(0)
else
  puts "Someone else will do it!"
  exit(1)
end

Saved it as /usr/bin/lastonly, and then in cron jobs I do:

lastonly && do_my_stuff

Clearly it's not perfect, but it works for me, and it's simple!

156

answered Sep 28 '22 23:09

loop

Take a look at the Gearman project http://www.gearman.org. The basic architecture is you'll have one machine that's a job server and all the other machines become clients of the server.

You can setup the crontab on the job server to send commands to execute to all of the clients connected through Gearman. You can then use PHP to slice and dice your cron jobs and get as deep into Map/Reduce as you want.

Here's a good tutorial on the concepts and how it works: http://www.lornajane.net/posts/2011/Using-Gearman-from-PHP

Don't get disheartened about working with something like Gearman right away. Distributed cron systems can be complex, but once you get your head around it you'll be ok.

FWIW, we process thousands of cron scripts every minute amongst a Gearman worker farm on Amazon's EC2. We absolutely love it.

answered Sep 29 '22 00:09

Michael Taggart

Related questions
                            
                                Convert image to string (for Symfony2 Response)
                            
                                PHP sessions expiring early
                            
                                passing custom php.ini to phpunit
                            
                                Is there a way to store Arabic dates with Postgres?
                            
                                How can I complete this Objective-C implementation of a cartesian product function?
                            
                                Disable Login Prompts in mod_auth_sspi_1.0.4-2.2.2
                            
                                PoEdit keywords - plurals
                            
                                Searching the LinkedIn API without authorizing a user
                            
                                magento cannot override core model
                            
                                Does PHP 5.4 object dereferencing successfully mitigate the drawbacks of static storage parameter in this DI container?
                            
                                PHP-How To Pair Up items in Array based on condition
                            
                                Where does the default mb_internal_encoding() value come from?
                            
                                How to get the original path of a symbolic link in PHP?
                            
                                Symfony2 Subdomain Routing - Different Bundles
                            
                                How to generate client PHP code for SOAP using SOAP UI?
                            
                                How to detect MIME type of plain text files: CSS, Javascript, ini, sql?
                            
                                How to securely store files on a server
                            
                                Is there an IPC transport implementation for Thrift ? or low latency SOA solutions
                            
                                PHP Red Bean ORM Performance issue
                            
                                MySQL server at 'reading initial communication packet', system error: 111 [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With