Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

robots.txt allow root only, disallow everything else?

Tags:

robots.txt

I can't seem to get this to work but it seems really basic.

I want the domain root to be crawled

http://www.example.com 

But nothing else to be crawled and all subdirectories are dynamic

http://www.example.com/* 

I tried

User-agent: * Allow: / Disallow: /*/ 

but the Google webmaster test tool says all subdirectories are allowed.

Anyone have a solution for this? Thanks :)

like image 529
cotopaxi Avatar asked Aug 29 '11 05:08

cotopaxi


People also ask

What should be disallowed in robots txt?

Disallow directive in robots. txt. You can tell search engines not to access certain files, pages or sections of your website. This is done using the Disallow directive.

How do I allow everything in robots txt?

From the robots documentation for meta tags, You can use the following meta tag on all your pages on your site to let the Bots know that these pages are not supposed to be indexed. In order for this to be applied to your entire site, You will have to add this meta tag for all of your pages.

How do I bypass robots txt disallow?

If you don't want your crawler to respect robots. txt then just write it so it doesn't. You might be using a library that respects robots. txt automatically, if so then you will have to disable that (which will usually be an option you pass to the library when you call it).

How do you access robots txt disallow?

The “User-agent: *” part means that it applies to all robots. The “Disallow: /” part means that it applies to your entire website. In effect, this will tell all robots and web crawlers that they are not allowed to access or crawl your site.


1 Answers

According to the Backus-Naur Form (BNF) parsing definitions in Google's robots.txt documentation, the order of the Allow and Disallow directives doesn't matter. So changing the order really won't help you.

Instead, use the $ operator to indicate the closing of your path. $ means 'the end of the line' (i.e. don't match anything from this point on)

Test this robots.txt. I'm certain it should work for you (I've also verified in Google Search Console):

user-agent: * Allow: /$ Disallow: / 

This will allow http://www.example.com and http://www.example.com/ to be crawled but everything else blocked.

note: that the Allow directive satisfies your particular use case, but if you have index.html or default.php, these URLs will not be crawled.

side note: I'm only really familiar with Googlebot and bingbot behaviors. If there are any other engines you are targeting, they may or may not have specific rules on how the directives are listed out. So if you want to be "extra" sure, you can always swap the positions of the Allow and Disallow directive blocks, I just set them that way to debunk some of the comments.

like image 52
eywu Avatar answered Sep 29 '22 18:09

eywu