Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Robots.txt - What is the proper format for a Crawl Delay for multiple user agents?

Tags:

Below is a sample robots.txt file to Allow multiple user agents with multiple crawl delays for each user agent. The Crawl-delay values are for illustration purposes and will be different in a real robots.txt file.

I have searched all over the web for proper answers but could not find one. There are too many mixed suggestions and I do not know which is the correct / proper method.

Questions:

(1) Can each user agent have it's own crawl-delay? (I assume yes)

(2) Where do you put the crawl-delay line for each user agent, before or after the Allow / Dissallow line?

(3) Does there have to be a blank like between each user agent group.

References:

http://www.seopt.com/2013/01/robots-text-file/

http://help.yandex.com/webmaster/?id=1113851#1113858

Essentially, I am looking to find out how the final robots.txt file should look using the values in the sample below.

Thanks in advance.

# Allow only major search spiders    
User-agent: Mediapartners-Google
Disallow:
Crawl-delay: 11

User-agent: Googlebot
Disallow:
Crawl-delay: 12

User-agent: Adsbot-Google
Disallow:
Crawl-delay: 13

User-agent: Googlebot-Image
Disallow:
Crawl-delay: 14

User-agent: Googlebot-Mobile
Disallow:
Crawl-delay: 15

User-agent: MSNBot
Disallow:
Crawl-delay: 16

User-agent: bingbot
Disallow:
Crawl-delay: 17

User-agent: Slurp
Disallow:
Crawl-delay: 18

User-agent: Yahoo! Slurp
Disallow:
Crawl-delay: 19

# Block all other spiders
User-agent: *
Disallow: /

# Block Directories for all spiders
User-agent: *
Disallow: /ads/
Disallow: /cgi-bin/
Disallow: /scripts/

(4) If I want to set all of the user agents to have crawl delay of 10 seconds, would the following be correct?

# Allow only major search spiders
User-agent: *
Crawl-delay: 10

User-agent: Mediapartners-Google
Disallow:

User-agent: Googlebot
Disallow:

User-agent: Adsbot-Google
Disallow:

User-agent: Googlebot-Image
Disallow:

User-agent: Googlebot-Mobile
Disallow:

User-agent: MSNBot
Disallow:

User-agent: bingbot
Disallow:

User-agent: Slurp
Disallow:

User-agent: Yahoo! Slurp
Disallow:

# Block all other spiders
User-agent: *
Disallow: /

# Block Directories for all spiders
User-agent: *
Disallow: /ads/
Disallow: /cgi-bin/
Disallow: /scripts/
like image 550
Sammy Avatar asked Jun 29 '13 07:06

Sammy


People also ask

What is crawl delay in robots txt?

Crawl delay A robots. txt file may specify a “crawl delay” directive for one or more user agents, which tells a bot how quickly it can request pages from a website. For example, a crawl delay of 10 specifies that a crawler should not request a new page more than every 10 seconds.

What is a good crawl delay?

Do take care when using the crawl-delay directive. By setting a crawl delay of ten seconds, you only allow these search engines to access 8,640 pages a day. This might seem plenty for a small site; it isn't very much on large sites.

How do you implement a crawl delay?

Google doesn't support the crawl-delay directive, so her crawlers will just ignore it. Log onto the old Google Search Console (opens in a new tab). Choose the website you want to define the crawl rate for. There's only one setting you can tweak: Crawl rate , with a slider where you can set the preferred crawl rate.

What is a robots txt file write down the uses for a robots txt file following technical robots txt syntax?

A robots. txt file tells search engine crawlers which URLs the crawler can access on your site. This is used mainly to avoid overloading your site with requests; it is not a mechanism for keeping a web page out of Google. To keep a web page out of Google, block indexing with noindex or password-protect the page.


1 Answers

(1) Can each user agent have it's own crawl-delay?

Yes. Each record, started by one or more User-agent lines, can have a Crawl-delay line. Note that Crawl-delay is not part of the original robots.txt specification. But it’s no problem to include them for those parsers that understand it, as the spec defines:

Unrecognised headers are ignored.

So older robots.txt parsers will simply ignore your Crawl-delay lines.


(2) Where do you put the crawl-delay line for each user agent, before or after the Allow / Dissallow line?

Doesn’t matter.


(3) Does there have to be a blank like between each user agent group.

Yes. Records have to be separated by one or more new lines. See the original spec:

The file consists of one or more records separated by one or more blank lines (terminated by CR,CR/NL, or NL).


(4) If I want to set all of the user agents to have crawl delay of 10 seconds, would the following be correct?

No. Bots look for records that match their user-agent. Only if they don’t find a record, they will use the User-agent: * record. So in your example all the listed bots (like Googlebot, MSNBot, Yahoo! Slurp etc.) will have no Crawl-delay.


Also note that you can’t have several records with User-agent: *:

If the value is '*', the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the "/robots.txt" file.

So parsers might look (if no other record matched) for the first record with User-agent: * and ignore the following ones. For your first example that would mean that URLs beginning with /ads/, /cgi-bin/ and /scripts/ are not blocked.

And even if you have only one record with User-agent: *, those Disallow lines are only for bots that have no other record match! As your comment # Block Directories for all spiders suggest, you want these URL paths to be blocked for all spiders, so you’d have to repeat the Disallow lines for every record.

like image 57
unor Avatar answered Sep 28 '22 11:09

unor