I want one page of my site to be crawled and no others. Also, if it's any different than the answer above, I would also like to know the syntax for disallowing everything but the root (index) of the website is. <pre class="prettyprint"><code># robots.txt for http://example.com/ User-agent: * Disallow: /style-guide Disallow: /splash Disallow: /etc Disallow: /etc Disallow: /etc Disallow: /etc Disallow: /etc </code></pre> Or can I do like this? <pre class="prettyprint"><code># robots.txt for http://example.com/ User-agent: * Disallow: / Allow: /under-construction </code></pre> Also I should mention that this is a WordPress install, so "under-construction," for example, is set to the front page. So in that case it acts as the index. I think what I need is to have <code>http://example.com</code> craweld, but no other pages. <pre class="prettyprint"><code># robots.txt for http://example.com/ User-agent: * Disallow: /* </code></pre> Would this mean disallow anything after the root?

The easiest way to allow access to just one page would be: <pre class="prettyprint"><code>User-agent: * Allow: /under-construction Disallow: / </code></pre> The original robots.txt specification says that crawlers should read robots.txt from top to bottom, and use the first matching rule. If you put the <code>Disallow</code> first, then many bots will see it as saying they can't crawl anything. By putting the <code>Allow</code> first, those that apply the rules from top to bottom will see that they can access that page. The expression rules are simple: the expression <code>Disallow: /</code> says "disallow anything that starts with a slash." So that means everything on the site. Your <code>Disallow: /*</code> means the same thing to Googlebot and Bingbot, but bots that don't support wildcards could see the <code>/*</code> and think that you meant a literal <code>*</code>. So they could assume that it was okay to crawl <code>/*foo/bar.html</code>. If you just want to crawl <code>http://example.com</code>, but nothing else, you might try: <pre class="prettyprint"><code>Allow: /$ Disallow: / </code></pre> The <code>$</code> means "end of string," just like in regular expressions. Again, that'll work for Google and Bing, but won't work for other crawlers if they don't support wildcards.

If you log into Google Webmaster Tools, from the left panel go to crawling, then go to Fetch as Google. Here you can test how Google will crawl each page. In the case of blocking everything but the homepage: <pre class="prettyprint"><code>User-agent: * Allow: /$ Disallow: / </code></pre> will work.

robots.txt to disallow all pages except one? Do they override and cascade?

Tags:

robots.txt

I want one page of my site to be crawled and no others.

Also, if it's any different than the answer above, I would also like to know the syntax for disallowing everything but the root (index) of the website is.

# robots.txt for http://example.com/

User-agent: *
Disallow: /style-guide
Disallow: /splash
Disallow: /etc
Disallow: /etc
Disallow: /etc
Disallow: /etc
Disallow: /etc

Or can I do like this?

# robots.txt for http://example.com/

User-agent: *
Disallow: /
Allow: /under-construction

Also I should mention that this is a WordPress install, so "under-construction," for example, is set to the front page. So in that case it acts as the index.

I think what I need is to have http://example.com craweld, but no other pages.

# robots.txt for http://example.com/

User-agent: *
Disallow: /*

Would this mean disallow anything after the root?

813

asked Nov 08 '13 21:11

nouveau

3 Answers

The easiest way to allow access to just one page would be:

User-agent: * Allow: /under-construction Disallow: /

The original robots.txt specification says that crawlers should read robots.txt from top to bottom, and use the first matching rule. If you put the Disallow first, then many bots will see it as saying they can't crawl anything. By putting the Allow first, those that apply the rules from top to bottom will see that they can access that page.

The expression rules are simple: the expression Disallow: / says "disallow anything that starts with a slash." So that means everything on the site.

Your Disallow: /* means the same thing to Googlebot and Bingbot, but bots that don't support wildcards could see the /* and think that you meant a literal *. So they could assume that it was okay to crawl /*foo/bar.html.

If you just want to crawl http://example.com, but nothing else, you might try:

Allow: /$ Disallow: /

The $ means "end of string," just like in regular expressions. Again, that'll work for Google and Bing, but won't work for other crawlers if they don't support wildcards.

answered Sep 28 '22 08:09

Jim Mischel

If you log into Google Webmaster Tools, from the left panel go to crawling, then go to Fetch as Google. Here you can test how Google will crawl each page.

In the case of blocking everything but the homepage:

User-agent: * Allow: /$ Disallow: /

will work.

answered Sep 28 '22 09:09

Kohjah Breese

you can use this below both will work

User-agent: *
Allow: /$
Disallow: /

User-agent: *
Allow: /index.php
Disallow: /

the Allow must be before the Disallow because the file is read from top to bottom

Disallow: / says "disallow anything that starts with a slash." So that means everything on the site.

The $ means "end of string," like in regular expressions. so the result of Allow : /$ is your homepage /index

answered Sep 28 '22 09:09

Aominé

Related questions
                            
                                Facebook and Crawl-delay in Robots.txt?
                            
                                How can I serve robots.txt on an SPA using React with Firebase hosting?
                            
                                Is it possible to control the crawl speed by robots.txt?
                            
                                robots.txt in subdirectory
                            
                                Disallow or Noindex on Subdomain with robots.txt
                            
                                Web Crawler - Ignore Robots.txt file?
                            
                                Does robots.txt apply to subdomains?
                            
                                Is this robots.txt syntax with an empty "Disallow:" correct?
                            
                                How do I prevent Bing from swamping my site with traffic irregularly?
                            
                                Robots.txt syntax not understood [closed]
                            
                                HTTP header to detect a preload request by Google Chrome
                            
                                How to stop search engines from crawling the whole website?
                            
                                how to prevent staging to be indexed in search engines
                            
                                Rails robots.txt folders
                            
                                Meta tag vs robots.txt
                            
                                django serving robots.txt efficiently
                            
                                How do I disallow specific page from robots.txt
                            
                                Ethics of robots.txt [closed]
                            
                                Multiple Sitemap: entries in robots.txt?
                            
                                What is the use of the hackers.txt file?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With