Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to block search engines from indexing all urls beginning with origin.domainname.com

I have www.domainname.com, origin.domainname.com pointing to the same codebase. Is there a way, I can prevent all urls of basename origin.domainname.com from getting indexed.

Is there some rule in robot.txt to do it. Both the urls are pointing to the same folder. Also, I tried redirecting origin.domainname.com to www.domainname.com in htaccess file but it doesnt seem to work..

If anyone who has had a similar kind of problem and can help, I shall be grateful.

Thanks

like image 304
Loveleen Kaur Avatar asked Oct 05 '10 06:10

Loveleen Kaur


People also ask

How do I stop search engines from indexing?

You can prevent a page or other resource from appearing in Google Search by including a noindex meta tag or header in the HTTP response. When Googlebot next crawls that page and sees the tag or header, Google will drop that page entirely from Google Search results, regardless of whether other sites link to it.

How do I block a website from search engines?

Exclude sites from your search engine:Click Add under Sites to exclude. Enter the URL you want to exclude and select whether you want to include any pages that match or only that specific page. See the table below for explanations if you aren't sure which one you want. Click Save.

What does it mean to discourage search engines from indexing this site?

When you tick “Discourage search engines from indexing this site,” WordPress modifies your robots. txt file (a file that gives instructions to spiders on how to crawl your site). It can also add a meta tag to your site's header that tells Google and other search engines not to index any content on your entire site.

Can I block a search engine?

Blocking Search Engines with Meta Tags. Understand HTML robots meta tags. The robots meta tag allows programmers to set parameters for bots, or search engine spiders. These tags are used to block bots from indexing and crawling an entire site or just parts of the site.


1 Answers

You can rewrite robots.txt to an other file (let's name this 'robots_no.txt' containing:

User-Agent: *
Disallow: /

(source: http://www.robotstxt.org/robotstxt.html)

The .htaccess file would look like this:

RewriteEngine On
RewriteCond %{HTTP_HOST} !^www.example.com$
RewriteRule ^robots.txt$ robots_no.txt

Use customized robots.txt for each (sub)domain:

RewriteEngine On
RewriteCond %{HTTP_HOST} ^www.example.com$ [OR]
RewriteCond %{HTTP_HOST} ^sub.example.com$ [OR]
RewriteCond %{HTTP_HOST} ^example.com$ [OR]
RewriteCond %{HTTP_HOST} ^www.example.org$ [OR]
RewriteCond %{HTTP_HOST} ^example.org$
# Rewrites the above (sub)domains <domain> to robots_<domain>.txt
# example.org -> robots_example.org.txt
RewriteRule ^robots.txt$ robots_${HTTP_HOST}.txt [L]
# in all other cases, use default 'robots.txt'
RewriteRule ^robots.txt$ - [L]

Instead of asking search engines to block all pages on for pages other than www.example.com, you can use <link rel="canonical"> too.

If http://example.com/page.html and http://example.org/~example/page.html both point to http://www.example.com/page.html, put the next tag in the <head>:

<link rel="canonical" href="http://www.example.com/page.html">

See also Googles article about rel="canonical"

like image 117
Lekensteyn Avatar answered Sep 22 '22 22:09

Lekensteyn