Under the heading "Avoiding getting banned", the Scrapy documentation advises:
if possible, use Google cache to fetch pages, instead of hitting the sites directly
It refers to http://www.googleguide.com/cached_pages.html, which was last updated in 2011.
I'm attempting to do that to scrape a website that requires captchas I cannot get around. However, Google then creates the same problem.
I'm causing the spider to stay on the Google cache version of the links using this middleware:
class GoogleCacheMiddleware(object):
def process_request(self, request, spider):
if spider.use_google_cache == True and 'googleusercontent' not in request.url:
new_url = 'https://webcache.googleusercontent.com/search?q=cache:' + request.url
request = request.replace(url=new_url)
return request
In the spider itself, I crawl politely with settings such as:
custom_settings = {
'AUTOTHROTTLE_ENABLE' :True,
'CONCURRENT_REQUESTS' : 2, # or 1
'DOWNLOAD_DELAY' : 8 # increased this to as much as 10
}
I've also tried using Selenium on both the original site and the Google cached version of the site. This sometimes succeeds in crawling for a few minutes and returning data, but finally lands at https://support.google.com/websearch/answer/86640, which states that Google detects "Unusual traffic" from your computer network, and requires a captcha to proceed.
It appears the Scrapy documentation is simply in conflict with Google terms of use, am I correct? Either way, is there a recommended way to either get around captchas, or accomplish scraping from a Google cache of a site in spite of this limitation?
UPDATE, 7-9-18:
When this spider runs several times over a week or more, it eventually yields complete or fuller results, evidently because the initially scraped URLs change on each crawl and succeed before the captcha kicks in. Still interested if anyone knows a solution consistent with the documentation or a specific workaround.
While working with Scrapy, one needs to create scrapy project. In Scrapy, always try to create one spider which helps to fetch data, so to create one, move to spider folder and create one python file over there. Create one spider with name gfgfetch.py python file. Move to the spider folder and create gfgfetch.py .
Scrapy, being one of the most popular web scraping frameworks, is a great choice if you want to learn how to scrape data from the web. In this tutorial, you'll learn how to get started with Scrapy and you'll also implement an example project to scrape an e-commerce website.
Scrapy is a more robust, feature-complete, more extensible, and more maintained web scraping tool. Scrapy allows you to crawl, extract, and store a full website. BeautilfulSoup on the other end only allows you to parse HTML and extract the information you're looking for.
I am not well versed with Scrapy but it seems the website must be blocking the cache view. Have you tried checking the cache with https://www.seoweather.com/google-cache-search/
You can get around the Google blocking though if you were to use proxies, preferably back-connect proxies as you'll need a lot when scraping Google.
Another option might be to try and scrape the https://archive.org/web/ version of a page? Actually, they even have an API you might be able to use https://archive.org/help/wayback_api.php
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With