I am currently part of a team developing an application which includes a front end client.
Through this client we send the user data, each user has a user-id and the client talks to our server through a RESTful API asking the server for data.
For example, let's say we have a database of books, and the user can get the last 3 books an author wrote. We value our users' time and we would like users to be able to start using the product without explicit registration.
We value our database, we use our own proprietary software to populate it and would like to protect it as much as we can.
So basically the question is:
What can we do to protect ourselves from web scraping?
I would very much like to learn about some techniques to protect our data, we would like to prevent users from typing every single author name in the author search panel and fetching out the top three books every author wrote.
Any suggested reading would be appreciated.
I'd just like to mention we're aware of captchas and would like to avoid them as much as possible
If you send repetitive requests from the same IP, the website owners can detect your footprint and may block your web scrapers by checking the server log files. To avoid this, you can use rotating proxies. A rotating proxy is a proxy server that allocates a new IP address from a set of proxies stored in the proxy pool.
Google does not allow it. If you scrape at a rate higher than 8 (updated from 15) keyword requests per hour you risk detection, higher than 10/h (updated from 20) will get you blocked from my experience.
Good news for archivists, academics, researchers and journalists: Scraping publicly accessible data is legal, according to a U.S. appeals court ruling.
The main strategies for preventing this are:
Note that you can use Captchas very flexible.
For example: first book for each IP every day is non-captcha protected. But in order to access a second book, a captcha needs to be solved.
Since you found that many of the items listed by Anony-Mousse dont solve your problem, I wanted to come in and suggest an alternative. Have you explored third party platforms that offer web scraping protection as a service? I'm going to list some of the solutions available on the market and try to lump them together. For full disclosure, I am one of the co-founders of Distil Networks, one of the companies that I am listing.
Web Scraping protection as a core competency:
Web Scraping protection as a feature in a larger product suite:
My opinion is that companies that try to solve the bot problem as a feature dont effectively do it well. Its just not their core competency and many loopholes exist
It might also be helpful to talk about some of the pitfalls of the points mentioned:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With