Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Block Website Scraping by Google Docs

I run a website that provides various pieces of data in chart/tabular format for people to read. Recently I've noticed an increase in the requests to the website that originate from Google Docs. Looking at the IPs and User Agent, it does appear to be originating from Google servers - example IP lookup here.

The number of hits is in the region of 2,500 to 10,000 requests per day.

I assume that someone has created one or more Google Sheets that scrape data from my website (possibly using the IMPORTHTML function or similar). I would prefer that this did not happen (as I cannot know if the data is being attributed properly).

Is there a preferred way to block this traffic that Google supports/approves?

I would rather not block based on IP addresses, as blocking Google servers feels wrong and may lead to future problems or IPs could change. At the moment I am blocking (returning 403 status) based on User Agent containing GoogleDocs or docs.google.com.

Traffic is mostly coming from 66.249.89.221 and 66.249.89.223 at present, always with the user agent Mozilla/5.0 (compatible; GoogleDocs; apps-spreadsheets; http://docs.google.com)

As a secondary question: Is there a way to trace the document or its account owner? I have access to the URLs that they are accessing, but little else to go on as the requests appear to proxy through the Google Docs servers (no Referer, Cookies or other such data in the HTTP logs).

Thank you.

like image 437
Peter Avatar asked Jan 24 '17 14:01

Peter


People also ask

Can Google Sheets scrape website?

Web Scraping Google Sheets FAQYes, you can! you can use one of the many IMPORT functions to do so. You would usually use the IMPORTHTML function but it can differ based on what you want to scrape.


1 Answers

Blocking on User-Agent is great solution because there doesn't appear to be a way to set a different User-Agent and still use INPUTHTML function -- and since you're happy to ban 'all' usage from doc-sheets, that's perfect.

Additional thoughts, though if full on ban seems unpleasant:

  1. Rate limit it: as you say you're recognizing it's mostly coming from two IP and always with the same user agent, just slow down your response. As long as the requests are serial, the you can provide data, yet at a pass which may be sufficient to discourage scraping. Delay your response (to suspected scrapers) by 20 or 30 seconds.

  2. Redirect to "You're blocked" screen, or screen with "default" data (i.e., scrapable, but not with current data). Better than basic 403 because it will tell the human it's not for scraping and then you can direct them to purchasing access (or at least requesting a key from you.)

like image 108
pbuck Avatar answered Nov 20 '22 11:11

pbuck