Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Top techniques to avoid 'data scraping' from a website database

I am setting up a site using PHP and MySQL that is essentially just a web front-end to an existing database. Understandably my client is very keen to prevent anyone from being able to make a copy of the data in the database yet at the same time wants everything publicly available and even a "view all" link to display every record in the db.

Whilst I have put everything in place to prevent attacks such as SQL injection attacks, there is nothing to prevent anyone from viewing all the records as html and running some sort of script to parse this data back into another database. Even if I was to remove the "view all" link, someone could still, in theory, use an automated process to go through each record one by one and compile these into a new database, essentially pinching all the information.

Does anyone have any good tactics for preventing or even just detering this that they could share.

like image 418
Addsy Avatar asked Sep 11 '25 02:09

Addsy


2 Answers

While there's nothing to stop a determined person from scraping publically available content, you can do a few basic things to mitigate the client's concerns:

  • Rate limit by user account, IP address, user agent, etc... - this means you restrict the amount of data a particular user group can download in a certain period of time. If you detect a large amount of data being transferred, you shut down the account or IP address.

  • Require JavaScript - to ensure the client has some resemblance of an interactive browser, rather than a barebones spider...

  • RIA - make your data available through a Rich Internet Application interface. JavaScript-based grids include ExtJs, YUI, Dojo, etc. Richer environments include Flash and Silverlight as 1kevgriff mentions.

  • Encode data as images. This is pretty intrusive to regular users, but you could encode some of your data tables or values as images instead of text, which would defeat most text parsers, but isn't foolproof of course.

  • robots.txt - to deny obvious web spiders, known robot user agents.

    User-agent: *

    Disallow: /

  • Use robot metatags. This would stop conforming spiders. This will prevent Google from indexing you for instance:

    <meta name="robots" content="noindex,follow,noarchive">

There are different levels of deterrence and the first option is probably the least intrusive.

like image 124
jspcal Avatar answered Sep 12 '25 16:09

jspcal


If the data is published, it's visible and accessible to everyone on the Internet. This includes the people you want to see it and the people you don't.

You can't have it both ways. You can make it so that data can only be visible with an account, and people will make accounts to slurp the data. You can make it so that the data can only be visible from approved IP addresses, and people will go through the steps to acquire approval before slurping it.

Yes, you can make it hard to get, but if you want it to be convenient for typical users you need to make it convenient for malicious ones as well.

like image 39
Welbog Avatar answered Sep 12 '25 17:09

Welbog