Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Running a cron job in Elastic Beanstalk

So I have a functionality in a Django Elastic Beanstalk app that works like so:

  • Download a file
  • Parse the file, run some calls to API's with the data from the file
  • Update the database of the EB instance with the new data

In testing instances where I just set up a local cron job. I just called wget on a specific URL of my Django application and it will run the command.

My problem is how to handle this in a multi-instanced Elastic Beanstalk application. Only one instance of my EB application should run this command. I want to avoid race conditions on the database and redundant calls to external API's from multiple instances. i.e. only one instance should be writing to the databe.

However, Googling around shows setting up cron jobs is awkward, particularly if your new to EB like I am. The most promising sounding method seems to be the cron.yaml method, but there does not seem to be an example of setting up a cron worker environment anywhere on the web from what I can see.

My understanding is:

  • You include a cron.yaml file in the root of your EB project.
  • Deploy the project
  • The cron jobs are automatically set up in a worker environment (?).
  • The command you defined is ran at the specified time(s).

My question is how do you make sure that only one instance will run this command? Do I have the right idea on how cron.yaml works or is there something I'm missing

like image 685
GreenGodot Avatar asked Jul 22 '16 09:07

GreenGodot


1 Answers

Only one instance will run the command because the cron job does not actually run in a cron daemon per-se.

There are few concepts that might help you quickly grok amazon's Elastic Beanstalk mindset.

  • An elastic beanstalk environment must elect a leader instance of which there must only ever be one (And it must be a healthy instance etc).
  • A worker environment allocates work via an SQS (Simple Queue Service) queue.
  • Once a message has been read from the queue it is considered 'in-flight' until the worker returns 200 or the request times out/fails. In the first scenario the message is deleted, and in the latter scenario it re-enters the queue. (Redrive policies can determine how many times a message can fail before it is sent to the Dead Letter Queue)
  • In flight messages cannot be read again (Unless returned).

A message in the queue is picked up only once by one of the instances in the worker environment at a time.

Now the cron.yaml file actually just tells the leader to create a message in the queue with special attributes, at the times specified in the schedule. When it then finds this message, it's dispatched to one instance only as a POST request to the specified URL.

When I use Django in a worker environment I create a cron app with views that map to the action I want. For example if I wanted to periodically poll a Facebook endpoint I might have a path /cron/facebook/poll/ which calls a poll_facebook() function in views.py

That way if I have a cron.yaml as follows, it'll poll Facebook once every hour:

version: 1
cron:
 - name: "pollfacebook"
   url: "/cron/facebook/poll/"
   schedule: "0 * * * *"
like image 85
Anthony Manning-Franklin Avatar answered Sep 21 '22 12:09

Anthony Manning-Franklin