How can I configure my site to allow crawling from well known robots like google, bing, yahoo, alexa etc. and stop other harmful spammers, robots should i block particular IP? please discuss any pros, cons Anything to be done in web.config or IIS? Can I do it server wide If i have vps with root access? Thanks.

I'd recommend that you take a look the answer I posted to a similar question: How to identify web-crawler? Robots.txt The robots.txt is useful for polite bots, but spammers are generally not polite so they tend to ignore the robots.txt; it's great if you have robots.txt since it can help the polite bots. However, be careful not to block the wrong path as it can block the good bots from crawling content that you actually want them to crawl. User-Agent Blocking by user-agent is not fool-proof either, because spammers often impersonate browsers and other popular user agents (such as the Google bots). As a matter of fact, spoofing the user agent is one of the easiest thing that a spammer can do. Bot Traps This is probably the best way protect yourself from bots that are not polite and that don't correctly identify themselves with the User-Agent. There are at least two types of traps: <ul> <li>The robots.txt trap (which only works if the bot reads the robots.txt): dedicate an off-limits directory in the robots.txt and set up your server to block the IP address of any entity which tries to visit that directory.</li> <li>Create "hidden" links in your web pages that also lead to the forbidden directory and any bot that crawls those links AND doesn't abide by your robots.txt will step into the trap and get the IP blocked.</li> </ul> A hidden link is one which is not visible to a person, such as an anchor tag with no text: <code><a href="http://www.mysite.com/path/to/bot/trap"></a></code>. Alternately, you can have text in the anchor tag, but you can make the font really small and change the text color to match the background color so that humans can't see the link. The hidden link trap can catch any non-human bot, so I'd recommend that you combine it with the robots.txt trap so that you only catch bad bots. Verifying Bots The above steps will probably help you get rid of 99.9% of the spammers, but there might be a handful of bad bots who impersonate a popular bot (such as Googlebot) AND abide by your robots.txt; those bots can eat up the number of requests you've allocated for Googlebot and may cause you to temporarily disallow Google from crawling your website. In that case you have one more option and that's to verify the identity of the bot. Most major crawlers (that you'd want to be crawled by) have a way that you can identify their bots, here is Google's recommendation for verifying their bot: http://googlewebmastercentral.blogspot.com/2006/09/how-to-verify-googlebot.html Any bot that impersonates another major bot and fails verification can be blocked by IP. That should probably get you closer to preventing 99.99% of the bad bots from crawling your site.

how to allow known web crawlers and block spammers and harmful robots from scanning asp.net website

2 Answers

I'd recommend that you take a look the answer I posted to a similar question: How to identify web-crawler?

Robots.txt
The robots.txt is useful for polite bots, but spammers are generally not polite so they tend to ignore the robots.txt; it's great if you have robots.txt since it can help the polite bots. However, be careful not to block the wrong path as it can block the good bots from crawling content that you actually want them to crawl.

User-Agent
Blocking by user-agent is not fool-proof either, because spammers often impersonate browsers and other popular user agents (such as the Google bots). As a matter of fact, spoofing the user agent is one of the easiest thing that a spammer can do.

Bot Traps
This is probably the best way protect yourself from bots that are not polite and that don't correctly identify themselves with the User-Agent. There are at least two types of traps:

The robots.txt trap (which only works if the bot reads the robots.txt): dedicate an off-limits directory in the robots.txt and set up your server to block the IP address of any entity which tries to visit that directory.
Create "hidden" links in your web pages that also lead to the forbidden directory and any bot that crawls those links AND doesn't abide by your robots.txt will step into the trap and get the IP blocked.

A hidden link is one which is not visible to a person, such as an anchor tag with no text: <a href="http://www.mysite.com/path/to/bot/trap"></a>. Alternately, you can have text in the anchor tag, but you can make the font really small and change the text color to match the background color so that humans can't see the link. The hidden link trap can catch any non-human bot, so I'd recommend that you combine it with the robots.txt trap so that you only catch bad bots.

Verifying Bots
The above steps will probably help you get rid of 99.9% of the spammers, but there might be a handful of bad bots who impersonate a popular bot (such as Googlebot) AND abide by your robots.txt; those bots can eat up the number of requests you've allocated for Googlebot and may cause you to temporarily disallow Google from crawling your website. In that case you have one more option and that's to verify the identity of the bot. Most major crawlers (that you'd want to be crawled by) have a way that you can identify their bots, here is Google's recommendation for verifying their bot: http://googlewebmastercentral.blogspot.com/2006/09/how-to-verify-googlebot.html

Any bot that impersonates another major bot and fails verification can be blocked by IP. That should probably get you closer to preventing 99.99% of the bad bots from crawling your site.

134

answered Sep 28 '22 07:09

Kiril

Blocking by IP can be useful, but the method that I use is blocking by user-agent, that way you can trap many different IPs using apps that you don't want, especially site grabbers. I won't provide our list as you need to concentrate on those that affect you. For our use we have identified more than 130 applications that are not web browsers and not search engines that we don't want accessing our web. But you can start with a web search on user-agents for site grabbing.

answered Sep 28 '22 08:09

RogerB

Related questions
                            
                                How to save uploaded images to a MongoDB collection, and retrieve it
                            
                                Access Session in WCF service from WebHttpBinding
                            
                                WCF aspNetCompatibilityEnabled="true" raise an exception (failed to load)
                            
                                In multi-tier architecture with a service layer, is it acceptable to have one service call another service?
                            
                                What is the best way to display error/warning messages on website app?
                            
                                Customizing session timeout per user - ASP.NET
                            
                                Classic ASP error with XMLHTTP request
                            
                                Checking if a page IsValid even when CausesValidation is false
                            
                                IValidatableObject and Dependency Injection support
                            
                                How to filter dropdown list values by another dropdown list in ASP.NET, c#
                            
                                Convert excel workbook to byte[]
                            
                                Foreign key relation ship between two databases in SQL Server 2008
                            
                                Why would HttpContext not contain a "Host" header?
                            
                                How can I implement strong, reversible encryption that inter-operates between ASP.NET 2.0, Coldfusion 5, and Classic ASP?
                            
                                Prevent HealthMonitoring error emails for dangerous Request.Path
                            
                                Prevent duplicate items from being added to a ListBox
                            
                                System.Web.HttpContext.Current.Request.UserHostAddress;
                            
                                How to have multiple authentication cookies for a single app programmatically
                            
                                asp.net: gray out a textbox with enabled = false, but need to get the value
                            
                                How to develop Google maps in Windows Phone

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how to allow known web crawlers and block spammers and harmful robots from scanning asp.net website

Tags:

asp.net

block

web-crawler

v s

People also ask

2 Answers

Kiril

RogerB

Recent Activity

Donate For Us