Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How long can a php cron job run for / am I doing it right?

Tags:

php

cron

I have created a php/mysql scraper, which is running fine, and have no idea how to most-efficiently run it as a cron job.

There are 300 sites, each with between 20 - 200 pages being scraped. It takes between 4 - 7 hours to scrape all the sites (depending on network latency and other factors). The scraper needs to do a complete run once daily.

Should I run this as 1 cron job which runs for the entire 4 - 7 hours, or run it every hour 7 times, or run it every 10 minutes until complete?

The script is set up to run from the cron like this:

while($starttime+600 > time()){
   do_scrape();
}

Which will run the do_scrape() function, which scrapes 10 urls at a time, until (in this case) 600 seconds has passed. The do_scrape can take between 5 - 60 seconds to run.

I am asking here as I cant find any information on the web about how to run this, and am kind of wary about getting this running daily, as php isnt really designed to be run as a single script for 7 hours.

I wrote it in vanilla PHP/mysql, and it is running on cut down debian VPS with only lighttpd/mysql/php5 installed. I have run it with a timeout of 6000 seconds (100 minutes) without any issue (the server didnt fall over).

Any advice on how to go about this task is appreciated. What should I be watching out for etc..? or am i going about executing this all wrong?

Thanks!

like image 640
Michael Avatar asked Sep 29 '11 03:09

Michael


People also ask

How long can cron jobs run for?

We limit cron jobs to running no more often than every 5 minutes, which means a task that needs to be done "now, but not in the web request" may happen as long as 5 minutes later. A running cron task blocks a new code deploy.

Can a PHP script run forever?

You can make it run forever by either setting the value or call set_time_limit in your script (http://php.net/manual/en/function.set-time-limit.php).

How do I know if a cron job is successful?

Method # 1: By Checking the Status of Cron ServiceRunning the “systemctl” command along with the status flag will check the status of the Cron service as shown in the image below. If the status is “Active (Running)” then it will be confirmed that crontab is working perfectly well, otherwise not.

Does cron have a timeout?

Limit the time a cronjob can run/bin/timeout : the command. -s 2 : the signal to send when the timer has exceeded, it can be a number or the name.


2 Answers

There's nothing wrong with running a well-written PHP script for long periods. I have some scripts that have literally been running continuously for months. Just watch your memory usage, and you should be fine.

That said, your architecture is pretty basic, and is unlikely scale very well.

You might consider moving from a big monolithic script to a divide-and-conquer strategy. For instance, it sounds like your script is making synchronous requests for every URL is scrapes. If that's true, then most of that 7 hour run time is spent idly waiting for a response from some remote server.

In an ideal world, you wouldn't write this kind of thing PHP. Some language that handles threads and can easily do asynchronous http requests with callback would be much better suited.

That said, if I were doing this in PHP, I'd be aiming at having a script that kicks of N children who grab data from URLs, and stick the response data in some kind of work queue, and then another script that pretty much runs all the time, processing any work it finds in the queue.

Then you just cron your fetcher-script-manager to run once an hour, it manages some worker processes that fetch the data (in parellel, so latency doesn't kill you), and stick the work on the queue. Then the queue-cruncher sees the work on the queue and crunches it.

Depending on how you implement the queue, this could scale pretty well. You could have multiple boxes fetching remote data, and sticking it on some central queue box (with a queue implemented in mysql, or memcache, or whatever). You could even conceivably have multiple boxes taking work from the queue and doing the work.

Of course, the devil is in the details, but this design is generally more scalable and usually more robust than a single-threaded fetch-process-repeat script.

like image 63
timdev Avatar answered Oct 17 '22 06:10

timdev


You shouldn't have a problem running it once a day to completion. That's the way I would do it. Timeouts are a big issue if php is being served through a web server, but since you are interpreting directly through the php executable this is ok. I would advise you to use python or something else that is more task-friendly, though.

like image 33
Matt Williamson Avatar answered Oct 17 '22 07:10

Matt Williamson