Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to set Robots.txt or Apache to allow crawlers only at certain hours?

As traffic is distributed unevenly over 24 hours, I would like to disallow crawlers during peek hours and allow them at non-busy hours.

Is there a method to achieve this?

edit: thanks for all the good advice.

This is another solution we found.

2bits.com has an article on setting IPTables firewall to limit the number of connections from certain IP addresses.

the article

the setting of IPTables:

  • Using connlimit

In newer Linux kernels, there is a connlimit module for iptables. It can be used like this:

iptables -I INPUT -p tcp -m connlimit --connlimit-above 5 -j REJECT

This limits connections from each IP address to no more than 5 simultaneous connections. This sort of "rations" connections, and prevents crawlers from hitting the site simultaneously. *

like image 349
Joel Box Avatar asked Jan 18 '11 23:01

Joel Box


3 Answers

You cannot determine what time the crawlers do their work, however with Crawl-delay you may be able to reduce the frequency in which they request pages. This can be useful to prevent them from rapidly requesting pages.

For Example:

User-agent: *
Crawl-delay: 5
like image 188
UnkwnTech Avatar answered Sep 28 '22 07:09

UnkwnTech


You can't control that in the robots.txt file. It's possible that some crawlers might support something like that, but none of the big ones do (as far as I know).

Dynamically changing the robots.txt file is also a bad idea in a case like this. Most crawlers cache the robots.txt file for a certain time, and continue using it until they refresh the cache. If they cache it at the "right" time, they might crawl normally all day. If they cache it at the "wrong" time, they would stop crawling altogether (and perhaps even remove indexed URLs from their index). For instance, Google generally caches the robots.txt file for a day, meaning that changes during the course of a day would not be visible to Googlebot.

If crawling is causing too much load on your server, you can sometimes adjust the crawl rate for individual crawlers. For instance, for Googlebot you can do this in Google Webmaster Tools.

Additionally, when crawlers attempt to crawl during times of high load, you can always just serve them a 503 HTTP result code. This tells crawlers to check back at some later time (you can also specify a retry-after HTTP header if you know when they should come back). While I'd try to avoid doing this strictly on a time-of-day basis (this can block many other features, such as Sitemaps, contextual ads, or website verification and can slow down crawling in general), in exceptional cases it might make sense to do that. For the long run, I'd strongly recommend only doing this when your server load is really much too high to successfully return content to crawlers.

like image 29
John Mueller Avatar answered Sep 28 '22 06:09

John Mueller


This is not possible using some robots.txt syntax - the feature simply isn't there.

You might be able to influence crawlers by actually altering the robots.txt file depending on the time of day. I expect Google will check the file immediately before crawling, for example. But obviously, there is the huge danger of scaring crawlers away for good that way - the risk of that being probably more problematic than whatever load you get right now.

like image 29
Pekka Avatar answered Sep 28 '22 05:09

Pekka