Below is a sample robots.txt file to Allow multiple user agents with multiple crawl delays for each user agent. The Crawl-delay values are for illustration purposes and will be different in a real robots.txt file.
I have searched all over the web for proper answers but could not find one. There are too many mixed suggestions and I do not know which is the correct / proper method.
Questions:
(1) Can each user agent have it's own crawl-delay? (I assume yes)
(2) Where do you put the crawl-delay line for each user agent, before or after the Allow / Dissallow line?
(3) Does there have to be a blank like between each user agent group.
References:
http://www.seopt.com/2013/01/robots-text-file/
http://help.yandex.com/webmaster/?id=1113851#1113858
Essentially, I am looking to find out how the final robots.txt file should look using the values in the sample below.
Thanks in advance.
# Allow only major search spiders
User-agent: Mediapartners-Google
Disallow:
Crawl-delay: 11
User-agent: Googlebot
Disallow:
Crawl-delay: 12
User-agent: Adsbot-Google
Disallow:
Crawl-delay: 13
User-agent: Googlebot-Image
Disallow:
Crawl-delay: 14
User-agent: Googlebot-Mobile
Disallow:
Crawl-delay: 15
User-agent: MSNBot
Disallow:
Crawl-delay: 16
User-agent: bingbot
Disallow:
Crawl-delay: 17
User-agent: Slurp
Disallow:
Crawl-delay: 18
User-agent: Yahoo! Slurp
Disallow:
Crawl-delay: 19
# Block all other spiders
User-agent: *
Disallow: /
# Block Directories for all spiders
User-agent: *
Disallow: /ads/
Disallow: /cgi-bin/
Disallow: /scripts/
(4) If I want to set all of the user agents to have crawl delay of 10 seconds, would the following be correct?
# Allow only major search spiders
User-agent: *
Crawl-delay: 10
User-agent: Mediapartners-Google
Disallow:
User-agent: Googlebot
Disallow:
User-agent: Adsbot-Google
Disallow:
User-agent: Googlebot-Image
Disallow:
User-agent: Googlebot-Mobile
Disallow:
User-agent: MSNBot
Disallow:
User-agent: bingbot
Disallow:
User-agent: Slurp
Disallow:
User-agent: Yahoo! Slurp
Disallow:
# Block all other spiders
User-agent: *
Disallow: /
# Block Directories for all spiders
User-agent: *
Disallow: /ads/
Disallow: /cgi-bin/
Disallow: /scripts/
Crawl delay A robots. txt file may specify a “crawl delay” directive for one or more user agents, which tells a bot how quickly it can request pages from a website. For example, a crawl delay of 10 specifies that a crawler should not request a new page more than every 10 seconds.
Do take care when using the crawl-delay directive. By setting a crawl delay of ten seconds, you only allow these search engines to access 8,640 pages a day. This might seem plenty for a small site; it isn't very much on large sites.
Google doesn't support the crawl-delay directive, so her crawlers will just ignore it. Log onto the old Google Search Console (opens in a new tab). Choose the website you want to define the crawl rate for. There's only one setting you can tweak: Crawl rate , with a slider where you can set the preferred crawl rate.
A robots. txt file tells search engine crawlers which URLs the crawler can access on your site. This is used mainly to avoid overloading your site with requests; it is not a mechanism for keeping a web page out of Google. To keep a web page out of Google, block indexing with noindex or password-protect the page.
(1) Can each user agent have it's own crawl-delay?
Yes. Each record, started by one or more User-agent
lines, can have a Crawl-delay
line. Note that Crawl-delay
is not part of the original robots.txt specification. But it’s no problem to include them for those parsers that understand it, as the spec defines:
Unrecognised headers are ignored.
So older robots.txt parsers will simply ignore your Crawl-delay
lines.
(2) Where do you put the crawl-delay line for each user agent, before or after the Allow / Dissallow line?
Doesn’t matter.
(3) Does there have to be a blank like between each user agent group.
Yes. Records have to be separated by one or more new lines. See the original spec:
The file consists of one or more records separated by one or more blank lines (terminated by CR,CR/NL, or NL).
(4) If I want to set all of the user agents to have crawl delay of 10 seconds, would the following be correct?
No. Bots look for records that match their user-agent. Only if they don’t find a record, they will use the User-agent: *
record. So in your example all the listed bots (like Googlebot
, MSNBot
, Yahoo! Slurp
etc.) will have no Crawl-delay
.
Also note that you can’t have several records with User-agent: *
:
If the value is '*', the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the "/robots.txt" file.
So parsers might look (if no other record matched) for the first record with User-agent: *
and ignore the following ones. For your first example that would mean that URLs beginning with /ads/
, /cgi-bin/
and /scripts/
are not blocked.
And even if you have only one record with User-agent: *
, those Disallow
lines are only for bots that have no other record match! As your comment # Block Directories for all spiders
suggest, you want these URL paths to be blocked for all spiders, so you’d have to repeat the Disallow
lines for every record.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With