Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to stop search engines from crawling the whole website?

I want to stop search engines from crawling my whole website.

I have a web application for members of a company to use. This is hosted on a web server so that the employees of the company can access it. No one else (the public) would need it or find it useful.

So I want to add another layer of security (In Theory) to try and prevent unauthorized access by totally removing access to it by all search engine bots/crawlers. Having Google index our site to make it searchable is pointless from the business perspective and just adds another way for a hacker to find the website in the first place to try and hack it.

I know in the robots.txt you can tell search engines not to crawl certain directories.

Is it possible to tell bots not to crawl the whole site without having to list all the directories not to crawl?

Is this best done with robots.txt or is it better done by .htaccess or other?

like image 449
Iain Simpson Avatar asked Feb 01 '12 20:02

Iain Simpson


People also ask

How do I stop Google crawlers?

You can block access in the following ways: To prevent your site from appearing in Google News, block access to Googlebot-News using a robots. txt file. To prevent your site from appearing in Google News and Google Search, block access to Googlebot using a robots.

How do you stop robots from looking at things on a website?

To prevent specific articles on your site from being indexed by all robots, use the following meta tag: <meta name="robots" content="noindex, nofollow">. To prevent robots from crawling images on a specific article, use the following meta tag: <meta name="robots" content="noimageindex">.

Does Google crawl all websites?

Google's crawlers are also programmed such that they try not to crawl the site too fast to avoid overloading it. This mechanism is based on the responses of the site (for example, HTTP 500 errors mean "slow down") and settings in Search Console. However, Googlebot doesn't crawl all the pages it discovered.

What does it mean when Google crawls a website?

Crawling is the process of finding new or updated pages to add to Google (Google crawled my website). One of the Google crawling engines crawls (requests) the page. The terms "crawl" and "index" are often used interchangeably, although they are different (but closely related) actions.


1 Answers

It is best handled with a robots.txt file, for just bots that respect the file.

To block the whole site add this to robots.txt in the root directory of your site:

User-agent: *
Disallow: /

To limit access to your site for everyone else, .htaccess is better, but you would need to define access rules, by IP address for example.

Below are the .htaccess rules to restrict everyone except your people from your company IP:

Order allow,deny
# Enter your companies IP address here
Allow from 255.1.1.1
Deny from all 
like image 65
Ulrich Palha Avatar answered Oct 05 '22 22:10

Ulrich Palha