Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to restrict the site from being indexed

I know this question was being asked many times but I want to be more specific.

I have a development domain and moved the site there to a subfolder. Let's say from:

http://www.example.com/

To:

http://www.example.com/backup

So I want the subfolder to not be indexed by search engines at all. I've put robots.txt with the following contents in the subfolder (can I put it in a subfolder or it has to be at the root always, because I want the content at the root to be visible to search engines):

User-agent: *
Disallow: /

Maybe I need to replace it and put in the root the following:

User-agent: *
Disallow: /backup

The other thing is, I read somewhere that certain robots don't respect the robots.txt file so would just putting an .htaccess file in the /backup folder do the job?

Order deny,allow
Deny from all

Any ideas?

like image 406
Ilian Andreev Avatar asked May 26 '12 10:05

Ilian Andreev


People also ask

Why is my site blocked from indexing?

Search engines can only show pages in their search results if those pages don't explicitly block indexing by search engine crawlers. Some HTTP headers and meta tags tell crawlers that a page shouldn't be indexed.

How do I block a website from search engines?

Exclude sites from your search engine:Click Add under Sites to exclude. Enter the URL you want to exclude and select whether you want to include any pages that match or only that specific page. See the table below for explanations if you aren't sure which one you want. Click Save.


1 Answers

This would prevent that directory from being indexed:

User-agent: *
Disallow: /backup/

Additionally, your robots.txt file must be placed in the root of your domain, so in this case, the file would be placed where you can access it in your browser by going to http://example.com/robots.txt

As an aside, you may want to consider setting up a subdomain for your development site, something like http://dev.example.com. Doing so would allow you to completely separate the dev stuff from the production environment and would also ensure that your environments more closely match.

For instance, any absolute paths to JavaScript files, CSS, images or other resources may not work the same from dev to production, and this may cause some issues down the road.

For more information on how to configure this file, see the robotstxt.org site. Good luck!

As a last and final note Google Webmaster Tools has a section where you can see what is blocked by the robots.txt file:

To see which URLs Google has been blocked from crawling, visit the Blocked URLs page of the Health section of Webmaster Tools.

I strongly suggest you use this tool, as an incorrectly configured robots.txt file could have a significant impact on the performance of your website.

like image 154
jmort253 Avatar answered Nov 02 '22 22:11

jmort253