Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tasks, Cron jobs or Backends for an app

I'm trying to construct a non-trivial GAE app and I'm not sure if a cron job, tasks, backends or a mix of all is what I need to use based on the request time-out limit that GAE has for HTTP requests.

The distinct steps I need to do are:

1) I have upwards of 15,000 sites I need to pull data from at a regular schedule and without any user interaction. The total number of sites isn't going to static but they're all saved in the datastore [Table0] along side the interval at which they're read at. The interval may vary as regular as every day to every 30 days.

2) For each site from step #1 that fits the "pull" schedule criteria, I need to fetch data from it via HTTP GET (again, it might be all of them or as few as 2 or 3 sites). Once I get the response back from the site, parse the result and save this data into the datastore as [Table1].

3) For all of the data that was recently put into the datastore in [Table1] (they'll have a special flag), I need to issue additional HTTP request to a 3rd party site to do some additional processing. As soon as I receive data from this site, I store all of the relevant info into another table [Table2] in the datastore.

4) As soon as data is available and ready from step #3, I need to take all of it and perform some additional transformation and update the original table [Table1] in the datastore.

I'm not certain which of the different components I need to use to ensure that I can complete each piece of the work without exceeding the response deadline that's placed on the web requests of GAE. For requests initiated by cron jobs and tasks, I believe you're allowed 10 mins to complete it, whereas typical user-driven requests are allowed 30 seconds.

like image 250
Dan Holman Avatar asked Aug 30 '11 02:08

Dan Holman


People also ask

What are cron jobs good for?

cron is a Linux utility that schedules a command or script on your server to run automatically at a specified time and date. A cron job is the scheduled task itself. Cron jobs can be very useful to automate repetitive tasks.

Is cron job and background job same?

For that purpose, as a system administrator, you can define background jobs (for example, database clean-ups) which are executed without any need for user interaction. These jobs are typically referred to as cron jobs.

Are cron jobs reliable?

Several aspects of the cron service are notable from a reliability perspective: Cron's failure domain is essentially just one machine. If the machine is not running, neither the cron scheduler nor the jobs it launches can run.

What is the function of scheduler in App Engine?

The App Engine Cron Service allows you to configure regularly scheduled tasks that operate at defined times or regular intervals. These tasks are commonly known as cron jobs. These cron jobs are automatically triggered by the App Engine Cron Service.


2 Answers

Task queues are the best way to do this in general, but you might want to check out the App Engine Pipeline API, which is designed for exactly the sort of workflow you're talking about.

like image 178
Nick Johnson Avatar answered Oct 16 '22 15:10

Nick Johnson


GAE is a tough platform for your use-case. But, out of extreme masochism, I am attempting something similar. So here are my two cents, based on my experience so far:

  1. Backends -- Use them for any long-running, I/O intensive tasks you may have (Web-Crawling is a good example, assuming you can defer compute-intensive processing for later).
  2. Mapreduce API -- excellent for compute-intensive/parallel jobs such as stats collection, indexing etc. Until recently, this library only had a mapper implementation, but recently Google also released an in-memory Shuffler that is good for jobs that fit in about 100MB.
  3. Task Queues -- For when everything else fails :-).
  4. Cron -- mostly to kick off periodic tasks -- which context you execute them in, is up to you.

It might be a good idea to design your backend tasks so that they can be scheduled (manually, or perhaps by querying your current quota usage) in the "Frontend" context using task queues, if you have spare Frontend CPU cycles.

like image 38
tooringtest Avatar answered Oct 16 '22 17:10

tooringtest