Prevent site data from being crawled and ripped

Tags:

I'm looking into building a content site with possibly thousands of different entries, accessible by index and by search.

What are the measures I can take to prevent malicious crawlers from ripping off all the data from my site? I'm less worried about SEO, although I wouldn't want to block legitimate crawlers all together.

For example, I thought about randomly changing small bits of the HTML structure used to display my data, but I guess it wouldn't really be effective.

982

asked Oct 07 '08 07:10

yoavf

2 Answers

Any site that it visible by human eyes is, in theory, potentially rippable. If you're going to even try to be accessible then this, by definition, must be the case (how else will speaking browsers be able to deliver your content if it isn't machine readable).

Your best bet is to look into watermarking your content, so that at least if it does get ripped you can point to the watermarks and claim ownership.

109

answered Oct 12 '22 14:10

Unsliced

Between this:

What are the measures I can take to prevent malicious crawlers from ripping

and this:

I wouldn't want to block legitimate crawlers all together.

you're asking for a lot. Fact is, if you're going to try and block malicious scrapers, you're going to end up blocking all the "good" crawlers too.

You have to remember that if people want to scrape your content, they're going to put in a lot more manual effort than a search engine bot will... So get your priorities right. You've two choices:

Let the peasants of the internet steal your content. Keep an eye out for it (searching Google for some of your more unique phrases) and sending take-down requests to ISPs. This choice has barely any impact on your apart from the time.
Use AJAX and rolling encryption to request all your content from the server. You'll need to keep the method changing, or even random so each pageload carries a different encryption scheme. But even this will be cracked if somebody wants to crack it. You'll also drop off the face of the search engines and therefore take a hit in traffic of real users.

answered Oct 12 '22 14:10

Oli

Related questions
                            
                                Linux cross-compilation for ARM architecture [closed]
                            
                                How can I order entries in a UNION without ORDER BY?
                            
                                Debug a DOMDocument Object in PHP
                            
                                Django QuerySet order
                            
                                Ruby: How do I join elements of an array together with a prefix?
                            
                                Submit button has bevel I cant get rid of
                            
                                New functional languages [closed]
                            
                                Should I use Unicode string by default?
                            
                                Using constants as default function values in PHP
                            
                                Is it possible to change a console window's icon from .net?
                            
                                Getting the IP Address Of A client For a webservice
                            
                                Spring Forms - How to Check for Error on Specific Path

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With