Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to prevent unauthorized spidering

Tags:

asp.net

iis

I want to prevent automated html scraping from one of our sites while not affecting legitimate spidering (googlebot, etc.). Is there something that already exists to accomplish this? Am I even using the correct terminology?

EDIT: I'm mainly looking to prevent people that would be doing this maliciously. I.e. they aren't going to abide by robots.txt

EDIT2: What about preventing use by "rate of use" ... i.e. captcha to continue browsing if automation is detected and the traffic isn't from a legitimate (google, yahoo, msn, etc.) IP.

like image 749
Kyle West Avatar asked Jan 16 '09 03:01

Kyle West


2 Answers

This is difficult if not impossible to accomplish. Many "rogue" spiders/crawlers do not identify themselves via the user agent string, so it is difficult to identify them. You can try to block them via their IP address, but it is difficult to keep up with adding new IP addresses to your block list. It is also possible to block legitimate users if IP addresses are used since proxies make many different clients appear as a single IP address.

The problem with using robots.txt in this situation is that the spider can just choose to ignore it.

EDIT: Rate limiting is a possibility, but it suffers from some of the same problems of identifying (and keeping track of) "good" and "bad" user agents/IPs. In a system we wrote to do some internal page view/session counting, we eliminate sessions based on page view rate, but we also don't worry about eliminating "good" spiders since we don't want them counted in the data either. We don't do anything about preventing any client from actually viewing the pages.

like image 187
Sean Carpenter Avatar answered Oct 06 '22 02:10

Sean Carpenter


One approach is to set up an HTTP tar pit; embed a link that will only be visible to automated crawlers. The link should go to a page stuffed with random text and links to itself (but with additional page info: /tarpit/foo.html , /tarpit/bar.html , /tarpit/baz.html - but have the script at /tarpit/ handle all requests with the 200 result).

To keep the good guys out of the pit, generate a 302 redirect to your home page if the user agent is google or yahoo.

It isn't perfect, but it will at least slow down the naive ones.

EDIT: As suggested by Constantin, you could mark the tar pit as offlimits in robots.txt. The good guys use web spiders that honor this protocol will stay out of the tar pit. This would probably get rid of the requirement to generate redirects for known good people.

like image 40
Tim Howland Avatar answered Oct 06 '22 01:10

Tim Howland