Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Protecting website content from crawlers

The contents of a commerce website (ASP.NET MVC) are regularly crawled by the competition. These people are programmers and they use sophisticated methods to crawl the site so identifying them by IP is not possible. Unfortunately replacing values with images is not an option because the site should still remain readable by screen readers (JAWS).

My personal idea is using robots.txt: prohibit crawlers from accessing one common URL on the page (this could be disguised as a normal item detail link, but hidden from normal users Valid URL: http://example.com?itemId=1234 Prohibited: http://example.com?itemId=123 under 128). If an IP owner entered the prohibited link show a CAPTCHA validation. A normal user would never follow a link like this because it is not visible, Google does not have to crawl it because it is bogus. The issue with this is that the screen reader still reads the link and I don't think that this would be so effective to be worth implementing.

like image 853
Germstorm Avatar asked Nov 13 '22 20:11

Germstorm


1 Answers

Your idea could possibly work for a few basic crawlers but would be very easy to work around. They would just need to use a proxy and do a get on each link from a new IP.

If you allow anonymous access to your website then you can never fully protect your data. Even if you manage to prevent crawlers with lots of time and effort they could just get a human to browse and capture the content with something like fiddler. The best way to prevent your data being seen by your competitors would be to not put it on a public part of your website.

Forcing users to log in might help matters, at least then you could pick up who is crawling your site and ban them.

like image 140
Tom Squires Avatar answered Dec 18 '22 08:12

Tom Squires