I've been thinking a while about disallowing every crawler except Ask, Google, Microsoft, and Yahoo! from my site. The reasoning behind this is that I've never seen any traffic being generated by any of the other web-crawlers out there. My questions are: <ol> <li>Is there any reason not to?</li> <li>Has anybody done this?</li> <li>Did you notice any negative effects?</li> </ol> Update: Up till now I used the blacklist approach: if I do not like the crawler, I add them to the disallow list. I'm no fan of blacklisting however as this is a never ending story: there are always more crawlers out there. I'm no so much worried about the real ugly misbehaving crawlers, they are detected and blocked automatically. (and they typically do no ask for robots.txt anyhow :) However, many crawlers are not really misbehaving in any way, they just do not seem to generate any value for me / my customers. There are for example a couple of crawlers that power website who claim they will be The Next Google; Only Better. I've never seen any traffic coming from them and I'm quite sceptical about them becoming better than any of the four search engines mentioned above. Update 2: I've been analysing the traffic to several sites for some time now, and it seems that for reasonable small sites, 100 unique human visitors a day (=visitors that I cannot identify as being not human). About 52% of the generated traffic is by automated processes. 60% of all automated visitors is not reading robots.txt, 40% (21% of total traffic) does request robots.txt. (this includes Ask, Google, Microsoft, and Yahoo!) So my thinking is, If I block all the well behaved crawlers that do not seem to generate any value for me, I could reduce the bandwidth use and server load by around 12% - 17%.

The internet is a publishing mechanism. If you want to whitelist your site, you're against the grain, but that's fine. Do you want to whitelist your site? Bear in mind that badly behaved bots which ignore robots.txt aren't affected anyway (obviously), and well behaved bots are probably there for a good reason, it's just that that's opaque to you.

robots.txt: disallow all but a select few, why not? [closed]

Tags:

seo

web-crawler

robots.txt

I've been thinking a while about disallowing every crawler except Ask, Google, Microsoft, and Yahoo! from my site.

The reasoning behind this is that I've never seen any traffic being generated by any of the other web-crawlers out there.

My questions are:

Is there any reason not to?
Has anybody done this?
Did you notice any negative effects?

Update:
Up till now I used the blacklist approach: if I do not like the crawler, I add them to the disallow list.
I'm no fan of blacklisting however as this is a never ending story: there are always more crawlers out there.

I'm no so much worried about the real ugly misbehaving crawlers, they are detected and blocked automatically. (and they typically do no ask for robots.txt anyhow :)

However, many crawlers are not really misbehaving in any way, they just do not seem to generate any value for me / my customers.
There are for example a couple of crawlers that power website who claim they will be The Next Google; Only Better. I've never seen any traffic coming from them and I'm quite sceptical about them becoming better than any of the four search engines mentioned above.

Update 2:
I've been analysing the traffic to several sites for some time now, and it seems that for reasonable small sites, 100 unique human visitors a day (=visitors that I cannot identify as being not human). About 52% of the generated traffic is by automated processes.

60% of all automated visitors is not reading robots.txt, 40% (21% of total traffic) does request robots.txt. (this includes Ask, Google, Microsoft, and Yahoo!)

So my thinking is, If I block all the well behaved crawlers that do not seem to generate any value for me, I could reduce the bandwidth use and server load by around 12% - 17%.

887

asked Jan 28 '09 12:01

Jacco

2 Answers

The internet is a publishing mechanism. If you want to whitelist your site, you're against the grain, but that's fine.

Do you want to whitelist your site?

Bear in mind that badly behaved bots which ignore robots.txt aren't affected anyway (obviously), and well behaved bots are probably there for a good reason, it's just that that's opaque to you.

182

answered Sep 28 '22 03:09

annakata

Whilst other sites that crawl your sites might not be sending any content your way, its possible that they themselves are being indexed by google et al, and so adding to your page rank, blocking them from your site might affect this.

answered Sep 28 '22 02:09

Sam Cogan

Related questions
                            
                                Do search engines read hCard Microformat data, or should I use Schema.org as well?
                            
                                Vaadin SEO how? [closed]
                            
                                SEO impact using hash urls? [closed]
                            
                                Can I have multiple rel="alternate" tags for a webpage?
                            
                                Can I add rel nofollow to iframe tag?
                            
                                XSD For Sitemap with HREFLANG
                            
                                Why and how does the googlebot use my website's search engine?
                            
                                Org or com domain for open source project site? [closed]
                            
                                Search engine friendly URLs that contain numbers... good or bad?
                            
                                full ajax site and SEO
                            
                                Is it good idea to use URL names with special characters? [closed]
                            
                                Creating yii2 dynamic pages with url: www.example.com/pageName
                            
                                404 Header redirect query
                            
                                Adding microdata or schema.org for breadcrumb SEO in Drupal 7
                            
                                Moving website from HTTP to fully HTTPS and SEO implications
                            
                                Where can I obtain a list of User Agents for SEO bots? [closed]
                            
                                SEO and the use of !# in a url
                            
                                What HTML5 Tag Should be Used for a "Call to Action" Div?
                            
                                How to make a webpage unsearchable?
                            
                                SEO impact on specifying image width and height for responsive website?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With