I can't seem to get this to work but it seems really basic. I want the domain root to be crawled <pre class="prettyprint"><code>http://www.example.com </code></pre> But nothing else to be crawled and all subdirectories are dynamic <pre class="prettyprint"><code>http://www.example.com/* </code></pre> I tried <pre class="prettyprint"><code>User-agent: * Allow: / Disallow: /*/ </code></pre> but the Google webmaster test tool says all subdirectories are allowed. Anyone have a solution for this? Thanks :)

According to the Backus-Naur Form (BNF) parsing definitions in Google's robots.txt documentation, the order of the <code>Allow</code> and <code>Disallow</code> directives doesn't matter. So changing the order really won't help you. Instead, use the <code>$</code> operator to indicate the closing of your path. <code>$</code> means 'the end of the line' (i.e. don't match anything from this point on) Test this robots.txt. I'm certain it should work for you (I've also verified in Google Search Console): <pre class="prettyprint"><code>user-agent: * Allow: /$ Disallow: / </code></pre> This will allow <code>http://www.example.com</code> and <code>http://www.example.com/</code> to be crawled but everything else blocked. note: that the <code>Allow</code> directive satisfies your particular use case, but if you have <code>index.html</code> or <code>default.php</code>, these URLs will not be crawled. side note: I'm only really familiar with Googlebot and bingbot behaviors. If there are any other engines you are targeting, they may or may not have specific rules on how the directives are listed out. So if you want to be "extra" sure, you can always swap the positions of the <code>Allow</code> and <code>Disallow</code> directive blocks, I just set them that way to debunk some of the comments.

robots.txt allow root only, disallow everything else?

Tags:

robots.txt

I can't seem to get this to work but it seems really basic.

I want the domain root to be crawled

http://www.example.com

But nothing else to be crawled and all subdirectories are dynamic

http://www.example.com/*

I tried

User-agent: * Allow: / Disallow: /*/

but the Google webmaster test tool says all subdirectories are allowed.

Anyone have a solution for this? Thanks :)

529

asked Aug 29 '11 05:08

cotopaxi

1 Answers

According to the Backus-Naur Form (BNF) parsing definitions in Google's robots.txt documentation, the order of the Allow and Disallow directives doesn't matter. So changing the order really won't help you.

Instead, use the $ operator to indicate the closing of your path. $ means 'the end of the line' (i.e. don't match anything from this point on)

Test this robots.txt. I'm certain it should work for you (I've also verified in Google Search Console):

user-agent: * Allow: /$ Disallow: /

This will allow http://www.example.com and http://www.example.com/ to be crawled but everything else blocked.

note: that the Allow directive satisfies your particular use case, but if you have index.html or default.php, these URLs will not be crawled.

side note: I'm only really familiar with Googlebot and bingbot behaviors. If there are any other engines you are targeting, they may or may not have specific rules on how the directives are listed out. So if you want to be "extra" sure, you can always swap the positions of the Allow and Disallow directive blocks, I just set them that way to debunk some of the comments.

answered Sep 29 '22 18:09

eywu

Related questions
                            
                                How can I serve robots.txt on an SPA using React with Firebase hosting?
                            
                                Is it possible to control the crawl speed by robots.txt?
                            
                                robots.txt in subdirectory
                            
                                Disallow or Noindex on Subdomain with robots.txt
                            
                                Web Crawler - Ignore Robots.txt file?
                            
                                Does robots.txt apply to subdomains?
                            
                                Is this robots.txt syntax with an empty "Disallow:" correct?
                            
                                How do I prevent Bing from swamping my site with traffic irregularly?
                            
                                Robots.txt syntax not understood [closed]
                            
                                HTTP header to detect a preload request by Google Chrome
                            
                                How to stop search engines from crawling the whole website?
                            
                                how to prevent staging to be indexed in search engines
                            
                                Rails robots.txt folders
                            
                                Meta tag vs robots.txt
                            
                                django serving robots.txt efficiently
                            
                                How do I disallow specific page from robots.txt
                            
                                Ethics of robots.txt [closed]
                            
                                Multiple Sitemap: entries in robots.txt?
                            
                                What is the use of the hackers.txt file?
                            
                                robots.txt to disallow all pages except one? Do they override and cascade?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With