Searching for specific information on the <code>robots.txt</code>, I stumbled upon a Yandex help page&Dagger; on this topic. It suggests that I could use the <code>Host</code> directive to tell crawlers my preferred mirror domain: <pre class="prettyprint"><code>User-Agent: * Disallow: /dir/ Host: www.example.com </code></pre> Also, the Wikipedia article states that Google too understands the <code>Host</code> directive, but there wasn’t much (i.e. none) information. At robotstxt.org, I didn’t find anything on <code>Host</code> (or <code>Crawl-delay</code> as stated on Wikipedia). <ol> <li>Is it encouraged to use the <code>Host</code> directive at all?</li> <li>Are there any resources at Google on this <code>robots.txt</code> specific?</li> <li>How is compatibility with other crawlers?</li> </ol> &Dagger;At least since the beginning of 2021, the linked entry does not deal with the directive in question any longer.

The original robots.txt specification says: <blockquote> Unrecognised headers are ignored. </blockquote> They call it "headers" but this term is not defined anywhere. But as it’s mentioned in the section about the format, and in the same paragraph as <code>User-agent</code> and <code>Disallow</code>, it seems safe to assume that "headers" means "field names". So yes, you can use <code>Host</code> or any other field name. <ul> <li>Robots.txt parsers that support such fields, well, support them.</li> <li>Robots.txt parsers that don’t support such fields must ignore them.</li> </ul> But keep in mind: As they are not specified by the robots.txt project, you can’t be sure that different parsers support this field in the same way. So you’d have to check every supporting parser manually.

Can I use the “Host” directive in robots.txt?

Tags:

seo

robots.txt

Searching for specific information on the robots.txt, I stumbled upon a Yandex help page^‡ on this topic. It suggests that I could use the Host directive to tell crawlers my preferred mirror domain:

User-Agent: *
Disallow: /dir/
Host: www.example.com

Also, the Wikipedia article states that Google too understands the Host directive, but there wasn’t much (i.e. none) information.

At robotstxt.org, I didn’t find anything on Host (or Crawl-delay as stated on Wikipedia).

Is it encouraged to use the Host directive at all?
Are there any resources at Google on this robots.txt specific?
How is compatibility with other crawlers?

^‡At least since the beginning of 2021, the linked entry does not deal with the directive in question any longer.

984

asked Feb 25 '14 10:02

dakab

1 Answers

The original robots.txt specification says:

Unrecognised headers are ignored.

They call it "headers" but this term is not defined anywhere. But as it’s mentioned in the section about the format, and in the same paragraph as User-agent and Disallow, it seems safe to assume that "headers" means "field names".

So yes, you can use Host or any other field name.

Robots.txt parsers that support such fields, well, support them.
Robots.txt parsers that don’t support such fields must ignore them.

But keep in mind: As they are not specified by the robots.txt project, you can’t be sure that different parsers support this field in the same way. So you’d have to check every supporting parser manually.

129

answered Oct 06 '22 17:10

unor

Related questions
                            
                                JSONP vs IFrame?
                            
                                Does Google crawl content inside HTML5 template tags?
                            
                                Dublin Core and Google SEO
                            
                                Vue @click doesn't work on an anchor tag with href present
                            
                                Dynamic URLs - with or without a trailing slash? [closed]
                            
                                Is it possible to use the same meta tag for opengraph and schema.org
                            
                                SEO/Rails - How to add the title tag to every "link_to"
                            
                                Dynamic robots.txt
                            
                                "rel=nofollow noopener" - Possible to have both at the same time?
                            
                                SEO affected Changing Title Tag by Javascript [duplicate]
                            
                                How to allow crawlers access to index.php only, using robots.txt?
                            
                                Is there anyway to use JSON-LD Schema not inlined
                            
                                Custom parameters in URL for show action
                            
                                <Header> tag HTML5 inside div
                            
                                Is there a way to prevent Googlebot from indexing certain parts of a page?
                            
                                Best way to load Google fonts <link/> , @import or javascript
                            
                                ÅÄÖ: what is considered more SEO friendly URL
                            
                                SEO implications of a multi lingual site with detection of system culture
                            
                                Google + Meta Description + Line break = Possible?
                            
                                Can I combine two itemscopes to describe a single item?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With