Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Robots.txt, how to allow access only to domain root, and no deeper? [closed]

Tags:

robots.txt

I want to allow crawlers to access my domain's root directory (i.e. the index.html file), but nothing deeper (i.e. no subdirectories). I do not want to have to list and deny every subdirectory individually within the robots.txt file. Currently I have the following, but I think it is blocking everything, including stuff in the domain's root.

User-agent: *
Allow: /$
Disallow: /

How can I write my robots.txt to accomplish what I am trying for?

Thanks in advance!

like image 838
WASa2 Avatar asked Mar 05 '11 20:03

WASa2


People also ask

What do you enter in robots.txt to block specific directory and everything?

Useful robots.txt rules Append a forward slash to the directory name to disallow crawling of a whole directory. Caution: Remember, don't use robots.txt to block access to private content; use proper authentication instead.

What does disallow WP admin mean?

You do that with two core commands: User-agent – this lets you target specific bots. User agents are what bots use to identify themselves. With them, you could, for example, create a rule that applies to Bing, but not to Google. Disallow – this lets you tell robots not to access certain areas of your site.

How do I fix robots.txt error?

To fix this issue, move your robots. txt file to your root directory. It's worth noting that this will need you to have root access to your server. Some content management systems will upload files to a 'media' subdirectory (or something similar) by default, so you might need to circumvent this to get your robots.


1 Answers

There's nothing that will work for all crawlers. There are two options that might be useful to you.

Robots that allow wildcards should support something like:

Disallow: /*/

The major search engine crawlers understand the wildcards, but unfortunately most of the smaller ones don't.

If you have relatively few files in the root and you don't often add new files, you could use Allow to allow access to just those files, and then use Disallow: / to restrict everything else. That is:

User-agent: *
Allow: /index.html
Allow: /coolstuff.jpg
Allow: /morecoolstuff.html
Disallow: /

The order here is important. Crawlers are supposed to take the first match. So if your first rule was Disallow: /, a properly behaving crawler wouldn't get to the following Allow lines.

If a crawler doesn't support Allow, then it's going to see the Disallow: / and not crawl anything on your site. Providing, of course, that it ignores things in robots.txt that it doesn't understand.

All the major search engine crawlers support Allow, and a lot of the smaller ones do, too. It's easy to implement.

like image 149
Jim Mischel Avatar answered Nov 07 '22 14:11

Jim Mischel