Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy: 503 Error when scraping site using CloudFlare

When using scrapy to scrape a site, I was receiving 503 Service Unavailable as an error right away (could not even start scraping any items). After finding this thread:

How to bypass cloudflare bot/ddos protection in Scrapy?

I assumed the problem was CloudFlare, so I added the following code that uses cfscrape from one of the answers to my Spider:

def start_requests(self):
    cf_requests = []
    for url in self.start_urls:
        token, agent = cfscrape.get_tokens(url, USER_AGENT)
        #token, agent = cfscrape.get_tokens(url)
        cf_requests.append(scrapy.Request(url=url, cookies={'__cfduid': token['__cfduid']}, headers={'User-Agent': agent}))
        print "useragent in cfrequest: " , agent
        print "token in cfrequest: ", token
    return cf_requests

Looking at the output, it seems like this workaround is indeed executing the javascript that CloudFlare uses for ddos protection, but it still gives me 503 error afterwards. Here is the debug output:

2015-11-04 23:07:12 [scrapy] INFO: Scrapy 1.0.3 started (bot: forumscrape)
2015-11-04 23:07:12 [scrapy] INFO: Optional features available: ssl, http11, boto
2015-11-04 23:07:12 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'forumscrape.spiders', 'CONCURRENT_REQUESTS_PER_DOMAIN': 1, 'CONCURRENT_REQUESTS': 1, 'SPIDER_MODULES': ['forumscrape.spiders'], 'CONCURRENT_REQUESTS_PER_IP': 1, 'BOT_NAME': 'forumscrape', 'COOKIES_ENABLED': False, 'DOWNLOAD_DELAY': 1}
2015-11-04 23:07:12 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-11-04 23:07:13 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-11-04 23:07:13 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-11-04 23:07:13 [scrapy] INFO: Enabled item pipelines:
2015-11-04 23:07:13 [requests.packages.urllib3.connectionpool] INFO: Starting new HTTP connection (1): sampleforum.com
2015-11-04 23:07:13 [requests.packages.urllib3.connectionpool] DEBUG: "GET /forumdisplay.php?29-Chat HTTP/1.1" 503 None
2015-11-04 23:07:18 [requests.packages.urllib3.connectionpool] INFO: Starting new HTTP connection (1): sampleforum.com
2015-11-04 23:07:18 [requests.packages.urllib3.connectionpool] DEBUG: "GET /cdn-cgi/l/chk_jschl?jschl_answer=-8011&jschl_vc=6b1abd999393b114b8eea35ff2be9e55&pass=1446696428.397-J92apQZ8k3 HTTP/1.1" 302 165
2015-11-04 23:07:18 [requests.packages.urllib3.connectionpool] DEBUG: "GET /forumdisplay.php?29-Chat HTTP/1.1" 200 21403
useragent in cfrequest:  Mozilla/5.0 (Windows NT 6.1; WOW64; rv:41.0) Gecko/20100101 Firefox/41.0
token in cfrequest:  {'cf_clearance': '037ab6c531be7e6fa6c3d0a98c988f57d17fd781-1446696429-604800', '__cfduid': 'd2635be16360da698f9dd07e4929690ed1446696424'}
2015-11-04 23:07:18 [scrapy] INFO: Spider opened
2015-11-04 23:07:18 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-11-04 23:07:18 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-11-04 23:07:18 [scrapy] DEBUG: Retrying <GET http://sampleforum.com/forumdisplay.php?29-Chat> (failed 1 times): 503 Service Unavailable
2015-11-04 23:07:20 [scrapy] DEBUG: Retrying <GET http://sampleforum.com/forumdisplay.php?29-Chat> (failed 2 times): 503 Service Unavailable
2015-11-04 23:07:21 [scrapy] DEBUG: Gave up retrying <GET http://sampleforum.com/forumdisplay.php?29-Chat> (failed 3 times): 503 Service Unavailable
2015-11-04 23:07:21 [scrapy] DEBUG: Crawled (503) <GET http://sampleforum.com/forumdisplay.php?29-Chat> (referer: None)
2015-11-04 23:07:21 [scrapy] DEBUG: Ignoring response <503 http://sampleforum.com/forumdisplay.php?29-Chat>: HTTP status code is not handled or not allowed
2015-11-04 23:07:21 [scrapy] INFO: Closing spider (finished)
2015-11-04 23:07:21 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 828,
 downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
 'downloader/response_bytes': 14276, 'downloader/response_count': 3, 'downloader/response_status_count/503': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 11, 5, 4, 7, 21, 363000),
'log_count/DEBUG': 9,
'log_count/INFO': 9,
'response_received_count': 1,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2015, 11, 5, 4, 7, 18, 755000)}
2015-11-04 23:07:21 [scrapy] INFO: Spider closed (finished)

The site loads fine in my browser (the same useragent being used). Other sites that I'm running similar scraping on (just picking up some text) are working. Is there another reason I'm getting 503? Any help would be appreciated.

I believe this line: DEBUG: "GET /cdn-cgi/l/chk_jschl?jschl_answer=23986&jschl_vc=2e88b65c8bcf26f39b980de3d5b198ea&pass=1446698472.387-F9OS39Peei HTTP/1.1" 302 165 Shows that the cloudflare javascript check is being done, so perhaps it's a reason besides this that is causing the 503?

like image 737
ddnm Avatar asked Nov 05 '15 04:11

ddnm


People also ask

What does Cloudflare Error 503 mean?

On the other hand, an “Error 503: Service Temporarily Unavailable” with “Cloudflare” means you are hitting a connection limit in a Cloudf… i think cloudflare 503 cames nearly every website which is using cloudflare because they are doing maintenance in this month so you have to wait and it i will fixed

Does Cloudflare use JavaScript for DDoS protection?

Looking at the output, it seems like this workaround is indeed executing the javascript that CloudFlare uses for ddos protection, but it still gives me 503 error afterwards. Here is the debug output: The site loads fine in my browser (the same useragent being used).

How to install Scrapy in Python?

In order to use Scrapy, you need to install it. Luckily, there’s a very easy way to do it via pip. You can use pip install scrapy to install Scrapy. You can also find other installation options in the Scrapy docs. It’s recommended to install Scrapy within a Python virtual environment.

What is web scraping and how to get started with scrapy?

Web scraping is the process of extracting structured data from websites. Scrapy, being one of the most popular web scraping frameworks, is a great choice if you want to learn how to scrape data from the web. In this tutorial, you’ll learn how to get started with Scrapy and you’ll also implement an example project to scrape an e-commerce website.


1 Answers

You can try use splash for avoid cloudflare.

like image 76
Verz1Lka Avatar answered Oct 19 '22 04:10

Verz1Lka