Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to block bad unidentified bots crawling my website?

How can I resist the bad unidentified bots to crawl my website? Some bad bots whose name is not present in cPanel of Apache are badly accessing my website bandwidth.

I had tried robots.txt on batgap.com/robots.txt and also blocked with .htaccess but there is no improvement in bandwidth usage. I don't know the IP of those bots so unable to block them by IP address. These bots are consuming too much bandwidth of site and hence a result I need to increase it from server.

like image 481
Sandeep Kumar Avatar asked Mar 30 '12 11:03

Sandeep Kumar


People also ask

How to identify and stop bots on your website?

Log files can help you in identifying and partially stopping the bots. Every request to the site is recorded in the log files. Using this, one can identify the bots by tracing its IP. You can check IP address of every request and the number of hits on your site.

How do I stop bots from crawling my website?

You can use robots.txt to tell search bots to not crawl your entire website, or specific folders and pages in it. There are quite a few rules available to instruct crawl bots.

How do I stop bot attacks?

Here are nine recommendations to help stop bot attacks. 1. Block or CAPTCHA outdated user agents/browsers The default configurations for many tools and scripts contain user-agent string lists that are largely outdated. This step won’t stop the more advanced attackers, but it might catch and discourage some.

What are the rules available to instruct crawl bots?

There are quite a few rules available to instruct crawl bots. The most common ones are: User-agent: Search bots user User-agent attribute to identify yourself. You can allow/disallow crawl bots by mentioning their user agent names. Disallow: Specifies the files or folders that are not allowed to be crawled.


4 Answers

I'm from Incapsula and we deal with bad bots on a regular basis.

We've recently release a bot-related research that provides insights of the scope of the problem ( http://www.incapsula.com/the-incapsula-blog/item/225-what-google-doesnt-show-you-31-of-website-traffic-can-harm-your-business ) and in light of this data I have to agree with @Leonard Challis - you simply can not handle bot protection manually.

Having said that, there are bot protection solutions, even Free ones (us included) that can help you with bad bots.

BTW - Just like you mentioned, one byproduct of bad bots visits is a loss of bandwidth. We`ve recently became aware of just how surprisingly HUGE bot-related bandwidth usage really is. This is an interesting topic by itself. We believe that by avoiding bad bot traffic, hosting providers can actually greatly improve their efficiency (hopefully using this to drop cost or to improve services). Once you imagine Social and Business implication of this you can understand the real scope of this bad bot problem that goes way beyond the immediate damage done.

like image 143
Igal Zeifman Avatar answered Oct 11 '22 08:10

Igal Zeifman


I block 'bad bots' by using PHP. I filter in IP address primarily, then by User-Agent secondarily. I make the 'bad bot' wait for up to 999 seconds, then return a very small web page. Usually (always) the internet connection times-out and zero (0) bytes are returned. Best of all I have delayed them for a few minutes before the get to the next victim. http://gelm.net/How-to-block-Baidu-with-PHP.htm

like image 21
Chuck Gelm Avatar answered Oct 11 '22 07:10

Chuck Gelm


Unfortunately robots.txt is sometimes ignored by these "bad bots", though if the problem is more things like genuine search engine spiders that you don't want to see they ought to take it in to account. I presume with CPanel you can get in to the web server (apache) logs? In there you can look for two things: the IP and the User-Agent. You can find the culprits in there and add them to your robots.txt and .htaccess. Note that .htaccess rules denying IP addresses are far better that just relying on robots.txt because you are taking the choice out of the bot creator's hands.

If you know specific bots which are doing this you should be able to get IP addresses and user-agents from forums, but if it's a more general thing then really I'm afraid it's more of a manual job.

There are other methods that can be used with varying effect, such as mod_security (http://www.askapache.com/htaccess/modsecurity-htaccess-tricks.html) but this will mean you'll have to access your web server configuration.

Finally, you can check the links that are pointing to your web site (using the link: option on google). Sometimes if you have links on spammy forums or the like this can increase the chances of bots coming to get you. Maybe you can look at the referer URL in the apache logs - but this is all based on a lot of presumptions and you'd probably be lucky if it had a great effect.

like image 32
LeonardChallis Avatar answered Oct 11 '22 09:10

LeonardChallis


Block Unwanted Robots/Spiders visitors via PHP

Instructions:

Place the following PHP Code in the beginning of your index.php file.

The idea here is to place the code in the main site's PHP home page, the main entry point of the site.

If you have other PHP files that are accessed directly via an URL (not including PHP include or require support type files), then place the code in the beginning of those files. For most PHP sites and PHP CMS sites, the root's index.php file is the file that is the main entry point of the site.

Keep in mind that your site statistics, i.e. AWStats, will still log the hits under Unknown robot (identified by 'bot' followed by a space or one of the following characters _+:,.;/-), but these bots will be blocked from accessing your site's content.

<?php
// ---------------------------------------------------------------------------------------------------------------

// Banned IP Addresses and Bots - Redirects banned visitors who make it past the .htaccess and or robots.txt files to an URL.
// The $banned_ip_addresses array can contain both full and partial IP addresses, i.e. Full = 123.456.789.101, Partial = 123.456.789. or 123.456. or 123.
// Use partial IP addresses to include all IP addresses that begin with a partial IP addresses. The partial IP addresses must end with a period.
// The $banned_bots, $banned_unknown_bots, and $good_bots arrays should contain keyword strings found within the User Agent string.
// The $banned_unknown_bots array is used to identify unknown robots (identified by 'bot' followed by a space or one of the following characters _+:,.;/\-).
// The $good_bots array contains keyword strings used as exemptions when checking for $banned_unknown_bots. If you do not want to utilize the $good_bots array such as
// $good_bots = array(), then you must remove the the keywords strings 'bot.','bot/','bot-' from the $banned_unknown_bots array or else the good bots will also be banned.
   $banned_ip_addresses = array('41.','64.79.100.23','5.254.97.75','148.251.236.167','88.180.102.124','62.210.172.77','45.','195.206.253.146');
   $banned_bots = array('.ru','AhrefsBot','crawl','crawler','DotBot','linkdex','majestic','meanpath','PageAnalyzer','robot','rogerbot','semalt','SeznamBot','spider');
   $banned_unknown_bots = array('bot ','bot_','bot+','bot:','bot,','bot;','bot\\','bot.','bot/','bot-');
   $good_bots = array('Google','MSN','bing','Slurp','Yahoo','DuckDuck');
   $banned_redirect_url = 'http://english-1329329990.spampoison.com';

// Visitor's IP address and Browser (User Agent)
   $ip_address = $_SERVER['REMOTE_ADDR'];
   $browser = $_SERVER['HTTP_USER_AGENT'];

// Declared Temporary Variables
   $ipfound = $piece = $botfound = $gbotfound = $ubotfound = '';

// Checks for Banned IP Addresses and Bots
   if($banned_redirect_url != ''){
     // Checks for Banned IP Address
        if(!empty($banned_ip_addresses)){
          if(in_array($ip_address, $banned_ip_addresses)){$ipfound = 'found';}
          if($ipfound != 'found'){
            $ip_pieces = explode('.', $ip_address);
            foreach ($ip_pieces as $value){
              $piece = $piece.$value.'.';
              if(in_array($piece, $banned_ip_addresses)){$ipfound = 'found'; break;}
            }
          }
          if($ipfound == 'found'){header("location: $banned_redirect_url"); exit();}
        }

     // Checks for Banned Bots
        if(!empty($banned_bots)){
          foreach ($banned_bots as $bbvalue){
            $pos1 = stripos($browser, $bbvalue);
            if($pos1 !== false){$botfound = 'found'; break;}
          }
          if($botfound == 'found'){header("location: $banned_redirect_url"); exit();}
        }

     // Checks for Banned Unknown Bots
        if(!empty($good_bots)){
          foreach ($good_bots as $gbvalue){
            $pos2 = stripos($browser, $gbvalue);
            if($pos2 !== false){$gbotfound = 'found'; break;}
          }
        }
        if($gbotfound != 'found'){
          if(!empty($banned_unknown_bots)){
            foreach ($banned_unknown_bots as $bubvalue){
              $pos3 = stripos($browser, $bubvalue);
              if($pos3 !== false){$ubotfound = 'found'; break;}
            }
            if($ubotfound == 'found'){header("location: $banned_redirect_url"); exit();}
          }
        }
   }

// ---------------------------------------------------------------------------------------------------------------
?>
like image 29
Sammy Avatar answered Oct 11 '22 08:10

Sammy