Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Block all bots/crawlers/spiders for a special directory with htaccess

I'm trying to block all bots/crawlers/spiders for a special directory. How can I do that with htaccess? I searched a little bit and found a solution by blocking based on the user agent:

RewriteCond %{HTTP_USER_AGENT} googlebot

Now I would need more user agents (for all bots known) and the rule should be only valid for my separate directory. I have already a robots.txt but not all crawlers take a look at it ... Blocking by IP address is not an option. Or are there other solutions? I know the password protection but I have to ask first if this would be an option. Nevertheless, I look for a solution based on the user agent.

like image 597
testing Avatar asked May 24 '12 10:05

testing


People also ask

How do I block bots and crawlers?

One option to reduce server load from bots, spiders, and other crawlers is to create a robots. txt file at the root of your website. This tells search engines what content on your site they should and should not index.

How do I stop Google bots from crawling my site?

You can prevent a page or other resource from appearing in Google Search by including a noindex meta tag or header in the HTTP response. When Googlebot next crawls that page and sees the tag or header, Google will drop that page entirely from Google Search results, regardless of whether other sites link to it.

Can you stop a bot from crawling a website?

They can do this by utilizing robots. txt to block common bots that SEO professionals use to assess their competition. For example Semrush and Ahrefs. This will block AhrefsBot from crawling your entire site.


2 Answers

You need to have mod_rewrite enabled. Placed it in .htaccess in that folder. If placed elsewhere (e.g. parent folder) then RewriteRule pattern need to be slightly modified to include that folder name).

RewriteEngine On

RewriteCond %{HTTP_USER_AGENT} (googlebot|bingbot|Baiduspider) [NC]
RewriteRule .* - [R=403,L]
  1. I have entered only few bots -- you add any other yourself (letter case does not matter).
  2. This rule will respond with "403 Access Forbidden" result code for such requests. You can change to another response HTTP code if you really want (403 is most appropriate here considering your requirements).
like image 129
LazyOne Avatar answered Sep 18 '22 06:09

LazyOne


Why use .htaccess or mod_rewrite for a job that is specifically meant for robots.txt? Here is the robots.txt snippet you will need t block a specific set of directories.

User-agent: *
Disallow: /subdir1/
Disallow: /subdir2/
Disallow: /subdir3/

This will block all search bots in directories /subdir1/, /subdir2/ and /subdir3/.

For more explanation see here: http://www.robotstxt.org/orig.html

like image 45
anubhava Avatar answered Sep 17 '22 06:09

anubhava