Protection from Web Scraping

Tags:

data-collection

I am currently part of a team developing an application which includes a front end client.

Through this client we send the user data, each user has a user-id and the client talks to our server through a RESTful API asking the server for data.

For example, let's say we have a database of books, and the user can get the last 3 books an author wrote. We value our users' time and we would like users to be able to start using the product without explicit registration.

We value our database, we use our own proprietary software to populate it and would like to protect it as much as we can.

So basically the question is:

What can we do to protect ourselves from web scraping?

I would very much like to learn about some techniques to protect our data, we would like to prevent users from typing every single author name in the author search panel and fetching out the top three books every author wrote.

Any suggested reading would be appreciated.

I'd just like to mention we're aware of captchas and would like to avoid them as much as possible

515

asked Jan 17 '13 10:01

Benjamin Gruenbaum

2 Answers

The main strategies for preventing this are:

require registration, so you can limit the requests per user
captchas for registration and non-registered users
rate limiting for IPs
require JavaScript - writing a scraper that can read JS is harder
robots blocking, and bot detection (e.g. request rates, hidden link traps)
data poisoning. Put in books and links that nobody will want to have, that stall the download for bots that blindly collect everything.
mutation. Frequently change your templates, so that the scrapers may fail to find the desired contents.

Note that you can use Captchas very flexible.

For example: first book for each IP every day is non-captcha protected. But in order to access a second book, a captcha needs to be solved.

165

answered Sep 25 '22 21:09

Has QUIT--Anony-Mousse

Since you found that many of the items listed by Anony-Mousse dont solve your problem, I wanted to come in and suggest an alternative. Have you explored third party platforms that offer web scraping protection as a service? I'm going to list some of the solutions available on the market and try to lump them together. For full disclosure, I am one of the co-founders of Distil Networks, one of the companies that I am listing.

Web Scraping protection as a core competency:

Distil Networks
Sentor Assassin

Web Scraping protection as a feature in a larger product suite:

My opinion is that companies that try to solve the bot problem as a feature dont effectively do it well. Its just not their core competency and many loopholes exist

Akamai Kona
F5 ASM module to the BigIP loadbalancer
Imperva Web Application Firewall appliance
Incapsula, Imperva's cloud Web Application Firewall

It might also be helpful to talk about some of the pitfalls of the points mentioned:

captchas for registration and non-registered users captchas have been proven to be ineefective thanks to OCR software and captcha farms
rate limiting for IPs This could have a really high false positive rate as it lumps together users behind a shared IP. Also could miss a lot of bots if they simply rotate or annonomize the IP they use
require JavaScript Selenium, Phantom, and dozens of other scraping tools render javascript

answered Sep 22 '22 21:09

Rami

Related questions
                            
                                Scrapy - extract nested 'img src' using xPathSelector
                            
                                CasperJS/PhantomJS much slower than Curl
                            
                                Scrapy 1.0+ proper settings access in CsvItemExporter subclass?
                            
                                Posted Form data using HtppWebRequest has not effect
                            
                                Scraping Google Analytics by Scrapy
                            
                                cURL - How to fetch page only if it has changed since last fetch?
                            
                                Trouble downloading images using scrapy
                            
                                How to open the option items of a select tag (dropdown) in different tabs/windows?
                            
                                Puppeteer, save webpage and images
                            
                                Puppeteer loads blank page with 429 when accessing URL
                            
                                Is there any language which is just "perfect" for web scraping? [closed]
                            
                                readHTMLTable and UTF-8 encoding
                            
                                How to scrape JSON data streamed via websockets on a target site
                            
                                Apache Nutch: Get outlink URL's text context
                            
                                Web Scraping with Selenium Python [Twitter + Instagram]
                            
                                Dynamic paging with Nightmare / Electron (page scrape)
                            
                                Unable to scroll a split screen of a webpage
                            
                                How to scrape multiple pages with an unchanging URL - Python 3
                            
                                Can't store downloaded files in their concerning folders
                            
                                Hows Mozenda Screen Scraper coded?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With