Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to prevent staging to be indexed in search engines

I would like my staging web sites to no being indexed by search engines (Google as first).

I have heard Wordpress is good at doing this but I would like to be technology agnostic.

Does the robots.txt is enough ? We would like to keep anonymous access to let the customer see it's website without having to be logged in.

Do I have to add nofollow to every pages ?

like image 715
toutpt Avatar asked Aug 30 '12 13:08

toutpt


People also ask

How do you stop a staging site from indexing?

If a publisher is testing a new web design, they can create a subdomain and test the new website design there. Traditionally, the most common way to block Google from indexing a staging site was to create a robots. txt file that keeps Google from crawling the staging site.

How do I stop search engines from indexing?

You can prevent a page or other resource from appearing in Google Search by including a noindex meta tag or header in the HTTP response. When Googlebot next crawls that page and sees the tag or header, Google will drop that page entirely from Google Search results, regardless of whether other sites link to it.


2 Answers

I'm normally against exposing staging servers to the public web, but if that's the best solution for your workflow, here are a few things you can consider:

Minimal Approach

  • Create new domain for staging server (e.g. example-stage.com)
  • Add robots.txt => Disallow: /
  • Verify domain in Google & Bing Webmaster Tools

The minimal approach is the very basics to make sure you don't shoot yourself in the foot with having duplicate content everywhere. By registering a separate domain, it's a clean division to the user of what is stage and what isn't. It also is a bit cleaner when you need to move environments around, but that's more operational. CNAMEs will work as well, but remember to register each CNAME with Google and Bing Webmaster Tools. This way you can use the domain removal tool if you need to.

Advised Approach

  • Add Authentication (HTTP or otherwise) infront of requests
  • Respond with appropriate response code if not permitted (e.g. 401 Unauthorized)
  • Everything else in the Basic Approach above

By adding a robots.txt it prevents search engines from accessing and indexing the content. However, that doesn't mean they won't index the URL. If a search engine knows about a given URL, it may add it to the search result index. You'll sometimes see these in the search results. The title tends to be the URL with no description. To prevent this from happening, the search engines need to be told not to show the content or URLs. By adding Authentication infront and not responding with a 200 OK status code it is a strong signal to the engines not to add these URLs to their index. From my experience I haven't ever seen a 401 response code page listed in a search engine index.

Preferred Approach

  • Put staging sites behind IP tables (e.g. accessible only from a given IP range)
  • Add meta or x-robots commands to each page with a value of NOINDEX, NOFOLLOW
  • Everything else in the Advised Approach

By putting the staging sites behind an IP filter ensures that only your clients are able to access the site. This can be a problem if they want to access it from other computers, and sometimes a maintenance headache but it's the best approach if you don't want to get your staging environment indexed. A word of caution, you'll want to make sure that all other requests (e.g. search engines and non-clients), doesn't serve anything back. They should receive a timeout response and never serve a 200 OK. By serving other information, it could be mistaken for cloaking which you won't want.

Additionally to be extra safe, I would also add a meta robots or x-robots-header command to each page to NOINDEX, NOFOLLOW just in case IP tables fails from a misconfiguation or if Authentication ever fails ... it's rare, but it happens when there are people touching the configurations for other reasons. Like the robots.txt file, you can really shoot yourself in the foot with these page level robots commands if they ever get pushed out to production. So just make sure your dev / staging environments are in a cleanly separated configuration. Otherwise pushing out a NOINDEX, NOFOLLOW or a Disallow: / would be disastrous for your production site.

like image 138
eywu Avatar answered Oct 25 '22 05:10

eywu


You can disable this server wide by adding the below setting in globally in apache conf or the same parameters can be used in vhost for disabling it for particular vhost only.

Header set X-Robots-Tag "noindex, nofollow"

Once this is done you can test it by verifying apache headers returned.

curl -I staging.mywebsite.com HTTP/1.1 302 Found Date: Sat, 26 Nov 2016 22:36:33 GMT Server: Apache/2.4.18 (Ubuntu) Location: /pages/ X-Robots-Tag: noindex, nofollow Content-Type: text/html; charset=UTF-8
like image 25
nisamudeen97 Avatar answered Oct 25 '22 05:10

nisamudeen97