I would like my staging web sites to no being indexed by search engines (Google as first).
I have heard Wordpress is good at doing this but I would like to be technology agnostic.
Does the robots.txt is enough ? We would like to keep anonymous access to let the customer see it's website without having to be logged in.
Do I have to add nofollow to every pages ?
If a publisher is testing a new web design, they can create a subdomain and test the new website design there. Traditionally, the most common way to block Google from indexing a staging site was to create a robots. txt file that keeps Google from crawling the staging site.
You can prevent a page or other resource from appearing in Google Search by including a noindex meta tag or header in the HTTP response. When Googlebot next crawls that page and sees the tag or header, Google will drop that page entirely from Google Search results, regardless of whether other sites link to it.
I'm normally against exposing staging servers to the public web, but if that's the best solution for your workflow, here are a few things you can consider:
Minimal Approach
Disallow: /
The minimal approach is the very basics to make sure you don't shoot yourself in the foot with having duplicate content everywhere. By registering a separate domain, it's a clean division to the user of what is stage and what isn't. It also is a bit cleaner when you need to move environments around, but that's more operational. CNAMEs will work as well, but remember to register each CNAME with Google and Bing Webmaster Tools. This way you can use the domain removal tool if you need to.
Advised Approach
By adding a robots.txt it prevents search engines from accessing and indexing the content. However, that doesn't mean they won't index the URL. If a search engine knows about a given URL, it may add it to the search result index. You'll sometimes see these in the search results. The title tends to be the URL with no description. To prevent this from happening, the search engines need to be told not to show the content or URLs. By adding Authentication infront and not responding with a 200 OK status code it is a strong signal to the engines not to add these URLs to their index. From my experience I haven't ever seen a 401 response code page listed in a search engine index.
Preferred Approach
By putting the staging sites behind an IP filter ensures that only your clients are able to access the site. This can be a problem if they want to access it from other computers, and sometimes a maintenance headache but it's the best approach if you don't want to get your staging environment indexed. A word of caution, you'll want to make sure that all other requests (e.g. search engines and non-clients), doesn't serve anything back. They should receive a timeout response and never serve a 200 OK. By serving other information, it could be mistaken for cloaking which you won't want.
Additionally to be extra safe, I would also add a meta robots or x-robots-header command to each page to NOINDEX, NOFOLLOW just in case IP tables fails from a misconfiguation or if Authentication ever fails ... it's rare, but it happens when there are people touching the configurations for other reasons. Like the robots.txt file, you can really shoot yourself in the foot with these page level robots commands if they ever get pushed out to production. So just make sure your dev / staging environments are in a cleanly separated configuration. Otherwise pushing out a NOINDEX, NOFOLLOW or a Disallow: /
would be disastrous for your production site.
You can disable this server wide by adding the below setting in globally in apache conf or the same parameters can be used in vhost for disabling it for particular vhost only.
Header set X-Robots-Tag "noindex, nofollow"
Once this is done you can test it by verifying apache headers returned.
curl -I staging.mywebsite.com HTTP/1.1 302 Found Date: Sat, 26 Nov 2016 22:36:33 GMT Server: Apache/2.4.18 (Ubuntu) Location: /pages/ X-Robots-Tag: noindex, nofollow Content-Type: text/html; charset=UTF-8
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With