Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Prevent site data from being crawled and ripped

Tags:

I'm looking into building a content site with possibly thousands of different entries, accessible by index and by search.

What are the measures I can take to prevent malicious crawlers from ripping off all the data from my site? I'm less worried about SEO, although I wouldn't want to block legitimate crawlers all together.

For example, I thought about randomly changing small bits of the HTML structure used to display my data, but I guess it wouldn't really be effective.

like image 982
yoavf Avatar asked Oct 07 '08 07:10

yoavf


People also ask

Which file can prevent web spidering of a website?

txt file used for? You can use a robots. txt file for web pages (HTML, PDF, or other non-media formats that Google can read), to manage crawling traffic if you think your server will be overwhelmed by requests from Google's crawler, or to avoid crawling unimportant or similar pages on your site.


2 Answers

Any site that it visible by human eyes is, in theory, potentially rippable. If you're going to even try to be accessible then this, by definition, must be the case (how else will speaking browsers be able to deliver your content if it isn't machine readable).

Your best bet is to look into watermarking your content, so that at least if it does get ripped you can point to the watermarks and claim ownership.

like image 109
Unsliced Avatar answered Oct 12 '22 14:10

Unsliced


Between this:

What are the measures I can take to prevent malicious crawlers from ripping

and this:

I wouldn't want to block legitimate crawlers all together.

you're asking for a lot. Fact is, if you're going to try and block malicious scrapers, you're going to end up blocking all the "good" crawlers too.

You have to remember that if people want to scrape your content, they're going to put in a lot more manual effort than a search engine bot will... So get your priorities right. You've two choices:

  1. Let the peasants of the internet steal your content. Keep an eye out for it (searching Google for some of your more unique phrases) and sending take-down requests to ISPs. This choice has barely any impact on your apart from the time.
  2. Use AJAX and rolling encryption to request all your content from the server. You'll need to keep the method changing, or even random so each pageload carries a different encryption scheme. But even this will be cracked if somebody wants to crack it. You'll also drop off the face of the search engines and therefore take a hit in traffic of real users.
like image 34
Oli Avatar answered Oct 12 '22 14:10

Oli