Using Scrapy on a Google cache of a website

Tags:

scrapy

Under the heading "Avoiding getting banned", the Scrapy documentation advises:

if possible, use Google cache to fetch pages, instead of hitting the sites directly

It refers to http://www.googleguide.com/cached_pages.html, which was last updated in 2011.

I'm attempting to do that to scrape a website that requires captchas I cannot get around. However, Google then creates the same problem.

I'm causing the spider to stay on the Google cache version of the links using this middleware:

class GoogleCacheMiddleware(object):
    def process_request(self, request, spider):
    if spider.use_google_cache == True and 'googleusercontent' not in request.url:
        new_url = 'https://webcache.googleusercontent.com/search?q=cache:' + request.url
        request = request.replace(url=new_url)
        return request

In the spider itself, I crawl politely with settings such as:

custom_settings = {
    'AUTOTHROTTLE_ENABLE' :True,
    'CONCURRENT_REQUESTS' : 2, # or 1
    'DOWNLOAD_DELAY' : 8 # increased this to as much as 10
}

I've also tried using Selenium on both the original site and the Google cached version of the site. This sometimes succeeds in crawling for a few minutes and returning data, but finally lands at https://support.google.com/websearch/answer/86640, which states that Google detects "Unusual traffic" from your computer network, and requires a captcha to proceed.

It appears the Scrapy documentation is simply in conflict with Google terms of use, am I correct? Either way, is there a recommended way to either get around captchas, or accomplish scraping from a Google cache of a site in spite of this limitation?

UPDATE, 7-9-18:

When this spider runs several times over a week or more, it eventually yields complete or fuller results, evidently because the initially scraped URLs change on each crawl and succeed before the captcha kicks in. Still interested if anyone knows a solution consistent with the documentation or a specific workaround.

359

asked May 27 '18 19:05

NFB

1 Answers

I am not well versed with Scrapy but it seems the website must be blocking the cache view. Have you tried checking the cache with https://www.seoweather.com/google-cache-search/

You can get around the Google blocking though if you were to use proxies, preferably back-connect proxies as you'll need a lot when scraping Google.

Another option might be to try and scrape the https://archive.org/web/ version of a page? Actually, they even have an API you might be able to use https://archive.org/help/wayback_api.php

189

answered Oct 21 '22 10:10

joker91

Related questions
                            
                                Opening already opened hdf5 file in write mode, using h5py
                            
                                Use JWT Token created by Python in Java
                            
                                Issue using qualitative brewer palettes in plotnine
                            
                                How to get back to default tensorflow version on google colab
                            
                                How to save Keras model progress into a file?
                            
                                Using tf.data.Dataset makes saved model bigger
                            
                                Extract only body text from arXiv articles formatted as .tex
                            
                                Python numpy: perform function on each pair of columns in a numpy 2-D array?
                            
                                zsh: /usr/local/bin/youtube-dl: bad interpreter: /usr/local/opt/python/bin/python2.7: no such file or directory
                            
                                How to batch delete buckets
                            
                                Using RandomForestClassifier.decision_path, how do I tell which samples the classifier used to make a decision?
                            
                                How to limit tensorflow memory usage?
                            
                                Sqlite database backup and restore in flask sqlalchemy
                            
                                Type hint a subclass of list
                            
                                Implementing Tags using Django rest framework
                            
                                Importing matplotlib.pyplot fails in PyCharm due to AttributeError: module 'PyQt5.QtGui' has no attribute 'QApplication'
                            
                                Return Longest Path with nodes of same value
                            
                                extracting graph from printed ecg
                            
                                Jupyter Notebook Input Line Executed Before Print Statement
                            
                                How to link python 2.7 with latest openssl version in MAC OS?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using Scrapy on a Google cache of a website

Tags:

python

scrapy

NFB

People also ask

1 Answers

joker91

Recent Activity

Donate For Us