I run a website that provides various pieces of data in chart/tabular format for people to read. Recently I've noticed an increase in the requests to the website that originate from Google Docs. Looking at the IPs and User Agent, it does appear to be originating from Google servers - example IP lookup here. The number of hits is in the region of 2,500 to 10,000 requests per day. I assume that someone has created one or more Google Sheets that scrape data from my website (possibly using the IMPORTHTML function or similar). I would prefer that this did not happen (as I cannot know if the data is being attributed properly). Is there a preferred way to block this traffic that Google supports/approves? I would rather not block based on IP addresses, as blocking Google servers feels wrong and may lead to future problems or IPs could change. At the moment I am blocking (returning 403 status) based on User Agent containing <code>GoogleDocs</code> or <code>docs.google.com</code>. Traffic is mostly coming from 66.249.89.221 and 66.249.89.223 at present, always with the user agent <code>Mozilla/5.0 (compatible; GoogleDocs; apps-spreadsheets; http://docs.google.com)</code> As a secondary question: Is there a way to trace the document or its account owner? I have access to the URLs that they are accessing, but little else to go on as the requests appear to proxy through the Google Docs servers (no Referer, Cookies or other such data in the HTTP logs). Thank you.

Blocking on User-Agent is great solution because there doesn't appear to be a way to set a different User-Agent and still use INPUTHTML function -- and since you're happy to ban 'all' usage from doc-sheets, that's perfect. Additional thoughts, though if full on ban seems unpleasant: <ol> <li>Rate limit it: as you say you're recognizing it's mostly coming from two IP and always with the same user agent, just slow down your response. As long as the requests are serial, the you can provide data, yet at a pass which may be sufficient to discourage scraping. Delay your response (to suspected scrapers) by 20 or 30 seconds.</li> <li>Redirect to "You're blocked" screen, or screen with "default" data (i.e., scrapable, but not with current data). Better than basic 403 because it will tell the human it's not for scraping and then you can direct them to purchasing access (or at least requesting a key from you.)</li> </ol>

Block Website Scraping by Google Docs

Tags:

web-scraping

google-sheets

google-docs

google-sheets-importxml

I run a website that provides various pieces of data in chart/tabular format for people to read. Recently I've noticed an increase in the requests to the website that originate from Google Docs. Looking at the IPs and User Agent, it does appear to be originating from Google servers - example IP lookup here.

The number of hits is in the region of 2,500 to 10,000 requests per day.

I assume that someone has created one or more Google Sheets that scrape data from my website (possibly using the IMPORTHTML function or similar). I would prefer that this did not happen (as I cannot know if the data is being attributed properly).

Is there a preferred way to block this traffic that Google supports/approves?

I would rather not block based on IP addresses, as blocking Google servers feels wrong and may lead to future problems or IPs could change. At the moment I am blocking (returning 403 status) based on User Agent containing GoogleDocs or docs.google.com.

Traffic is mostly coming from 66.249.89.221 and 66.249.89.223 at present, always with the user agent Mozilla/5.0 (compatible; GoogleDocs; apps-spreadsheets; http://docs.google.com)

As a secondary question: Is there a way to trace the document or its account owner? I have access to the URLs that they are accessing, but little else to go on as the requests appear to proxy through the Google Docs servers (no Referer, Cookies or other such data in the HTTP logs).

Thank you.

437

asked Jan 24 '17 14:01

Peter

1 Answers

Blocking on User-Agent is great solution because there doesn't appear to be a way to set a different User-Agent and still use INPUTHTML function -- and since you're happy to ban 'all' usage from doc-sheets, that's perfect.

Additional thoughts, though if full on ban seems unpleasant:

Rate limit it: as you say you're recognizing it's mostly coming from two IP and always with the same user agent, just slow down your response. As long as the requests are serial, the you can provide data, yet at a pass which may be sufficient to discourage scraping. Delay your response (to suspected scrapers) by 20 or 30 seconds.
Redirect to "You're blocked" screen, or screen with "default" data (i.e., scrapable, but not with current data). Better than basic 403 because it will tell the human it's not for scraping and then you can direct them to purchasing access (or at least requesting a key from you.)

108

answered Nov 20 '22 11:11

pbuck

Related questions
                            
                                How to scrape HTTPS javascript web pages
                            
                                Scrapy like tool for Nodejs? [closed]
                            
                                Get data from <script> tag in HTML using Scrapy
                            
                                How to enable built-in VPN in OperaDriver?
                            
                                How to extract meaningful and useful content from web pages? [closed]
                            
                                How to automate multiple requests to a web search form using R
                            
                                Does using scrapy-splash significantly affect scraping speed? [closed]
                            
                                How to isolate a single element from a scraped web page in R
                            
                                rvest - scrape 2 classes in 1 tag
                            
                                How to read an html table using Rselenium?
                            
                                How to grab all headers from a website using BeautifulSoup?
                            
                                Puppeteer: get localStorage from a website
                            
                                Using urllib and BeautifulSoup to retrieve info from web with Python
                            
                                Scrapy concurrency strategy
                            
                                How can I catch and process the data from the XHR responses using casperjs?
                            
                                selecting second child in beautiful soup
                            
                                Javascript: REGEX to change all relative Urls to Absolute
                            
                                Selendroid as a web scraper
                            
                                How can I get AWS monthly invoice PDF using AWS API?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With