Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to ban crawler 360Spider with robots.txt or .htaccess?

I've got a problems because of 360Spider: this bot makes too many requests per second to my VPS and slows it down (the CPU-usage becomes 10-70%, but usually i have 1-2%). I looked into httpd logs and saw there such lines:

182.118.25.209 - - [06/Sep/2012:19:39:08 +0300] "GET /slovar/znachenie-slova/42957-polovity.html HTTP/1.1" 200 96809 "http://www.hrinchenko.com/slovar/znachenie-slova/42957-polovity.html" "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.8.0.11) Gecko/20070312 Firefox/1.5.0.11; 360Spider
182.118.25.208 - - [06/Sep/2012:19:39:08 +0300] "GET /slovar/znachenie-slova/52614-rospryskaty.html HTTP/1.1" 200 100239 "http://www.hrinchenko.com/slovar/znachenie-slova/52614-rospryskaty.html" "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.8.0.11) Gecko/20070312 Firefox/1.5.0.11; 360Spider

etc.

How can I block this spider completely via robots.txt? Now my robots.txt looks like this:

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/

User-agent: YoudaoBot
Disallow: /

User-agent: sogou spider
Disallow: /

I've added lines:

User-agent: 360Spider
Disallow: /

but that does not seem to work. How to block this angry bot?

If you offer to block it via .htaccess, so mind that it looks now like this:

# Turn on URL rewriting
RewriteEngine On

# Installation directory
RewriteBase /

SetEnvIfNoCase Referer ^360Spider$ block_them
Deny from env=block_them

# Protect hidden files from being viewed
<Files .*>
    Order Deny,Allow
    Deny From All
</Files>

# Protect application and system files from being viewed
RewriteRule ^(?:application|modules|system)\b.* index.php/$0 [L]

# Allow any files or directories that exist to be displayed directly
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d

# Rewrite all other URLs to index.php/URL
RewriteRule .* index.php/$0 [PT]

And, in spite of presence of

SetEnvIfNoCase Referer ^360Spider$ block_them
Deny from env=block_them

this bot still tries to kill my VPS and is logged in access logs.

like image 929
kovpack Avatar asked Sep 06 '12 17:09

kovpack


People also ask

What robots.txt to use to block crawlers?

Are you looking for a way to control how search engine bots crawl your site? Or do you want to make some parts of your website private? You can do it by modifying the robots. txt file with the disallow command.

How do I block bots and crawlers?

One option to reduce server load from bots, spiders, and other crawlers is to create a robots. txt file at the root of your website. This tells search engines what content on your site they should and should not index.

Can you stop a bot from crawling a website?

They can do this by utilizing robots. txt to block common bots that SEO professionals use to assess their competition. For example Semrush and Ahrefs. This will block AhrefsBot from crawling your entire site.

When should you use a robots.txt file?

One of the most common and useful ways to use your robots. txt file is to limit search engine bot access to parts of your website. This can help maximize your crawl budget and prevent unwanted pages from winding up in the search results.


2 Answers

In your .htaccess file simply add the following :

RewriteCond %{REMOTE_ADDR} ^(182\.118\.2)

RewriteRule ^.*$ http://182.118.25.209/take_a_hike_moron [R=301,L]

This will catch ALL the bots being launched from the 182.118.2xx.xxx range and send them back to themself...

The crappy 360 bot is being fired from servers in China... so as long as you don't mind saying bye bye to crappy Chinese traffic from that IP range, this will guaranteed make those puppies disappear from reaching any files on your web site.

The following two lines in your .htaccess file will also pick it off simply by it being stupid enough to proudly put 360spider in its user agent string. This could be handy for when they use other IP ranges then the 182.118.2xx.xxx

RewriteCond %{HTTP_USER_AGENT} .*(360Spider) [NC]

RewriteRule ^.*$ http://182.118.25.209/take_a_hike_moron [R=301,L]

And yes... I hate them too !

like image 110
Sloth Avatar answered Nov 04 '22 20:11

Sloth


Your robots.txt seems right. Some bots just ignore it (malicious bots crawl from any IP address from any botnet of hundreds to millions of infected devices from all around the globe), in this case you can limit the number of requests per second using mod_security module for apache 2.X

Config example here: http://blog.cherouvim.com/simple-dos-protection-with-mod_security/

[EDIT] On linux, iptables also allows restricting tcp:port connections per (x) second(s) per ip, providing conntrack capabilities are enabled on your kernel. See: https://serverfault.com/questions/378357/iptables-dos-limit-for-all-ports

like image 21
NotGaeL Avatar answered Nov 04 '22 19:11

NotGaeL