Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to bypass cloudflare bot/ddos protection in Scrapy?

I used to scrape e-commerce webpage occasionally to get product prices information. I have not used the scraper built using Scrapy in a while and yesterday was trying to use it - I run into a problem with bot protection.

It is using CloudFlare’s DDOS protection which is basically using JavaScript evaluation to filter out the browsers (and therefore scrapers) with JS disabled. Once the function is evaluated, the response with calculated number is generated. In return, service sends back two authentication cookies which attached to each request allow to normally crawl the site. Here's the description of how it works.

I have also found a cloudflare-scrape Python module that uses external JS evaluation engine to calculate the number and send the request back to server. I'm not sure how to integrate it into Scrapy though. Or maybe there's a smarter way without using JS execution? In the end, it's a form...

I'd apriciate any help.

like image 496
Kulbi Avatar asked Oct 20 '15 22:10

Kulbi


People also ask

How do I bypass Cloudflare DDOS protection?

In fact, there are three ways to get rid of Cloudflare WAF: Customize the payoffs to bypass the rules in place. Modify requests to disrupt the server. Get around Cloudflare WAF by locating the origin IP of the web server.

How do I scrape a Cloudflare protected website?

Cloudflare changes their techniques periodically and anyway you can just use a simple Python module to bypass Cloudflare's anti-bot page. The module can be useful if you wish to scrape or crawl a website protected with Cloudflare.

How do I use Cfscrape?

The simplest way to use cloudflare-scrape is by calling create_scraper() . import cfscrape scraper = cfscrape. create_scraper() # returns a CloudflareScraper instance # Or: scraper = cfscrape. CloudflareScraper() # CloudflareScraper inherits from requests.


2 Answers

So I executed JavaScript using Python with help of cloudflare-scrape.

To your scraper, you need to add the following code:

def start_requests(self):   for url in self.start_urls:     token, agent = cfscrape.get_tokens(url, 'Your prefarable user agent, _optional_')     yield Request(url=url, cookies=token, headers={'User-Agent': agent}) 

alongside parsing functions. And that's it!

Of course, you need to install cloudflare-scrape first and import it to your spider. You also need a JS execution engine installed. I had Node.JS already, no complaints.

like image 77
Kulbi Avatar answered Sep 19 '22 23:09

Kulbi


Obviously the best way to do this would be to whitelist your IP in CloudFlare; if this isn't suitable let me recommend the cloudflare-scrape library. You can use this to get the cookie token, then provide this cookie token in your Scrapy request back to the server.

like image 30
mjsa Avatar answered Sep 18 '22 23:09

mjsa