Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find out my site is being scraped?

How to find out my site is being scraped?

I've some points...

  1. Network Bandwidth occupation, causing throughput problems (matches if proxy used).
  2. When querting search engine for key words the new referrences appear to other similar resources with the same content (matches if proxy used).
  3. Multiple requesting from the same IP.
  4. High requests rate from a single IP. (by the way: What is a normal rate?)
  5. Headless or weird user agent (matches if proxy used).
  6. Requesting with predictable (equal) intervals from the same IP.
  7. Certain support files are never requested, ex. favicon.ico, various CSS and javascript files (matches if proxy used).
  8. The client's requests sequence. Ex. client access not directly accessible pages (matches if proxy used).

Would you add more to this list?

What points might fit/match if a scraper uses proxying?

like image 244
Igor Savinkin Avatar asked Oct 19 '22 07:10

Igor Savinkin


1 Answers

As a first note; consider if its worthwhile to provide an API for bots for the future. If you are being crawled by another company/etc, if it is information you want to provide to them anyways it makes your website valuable to them. Creating an API would reduce your server load substantially and give you 100% clarity on people crawling you.

Second, coming from personal experience (I created web-crawls for quite a while), generally you can tell immediately by tracking what the browser was that accessed your website. If they are using one of the automated ones or one out of a development language it will be uniquely different from your average user. Not to mention tracking the log file and updating your .htaccess with banning them (if that's what you are looking to do).

Its usually other then that fairly easy to spot. Repeated, very consistent opening of pages.

Check out this other post for more information on how you might want to deal with them, also for some thoughts on how to identify them.

How to block bad unidentified bots crawling my website?

like image 181
Sh4d0wsPlyr Avatar answered Nov 15 '22 09:11

Sh4d0wsPlyr