When using scrapy to scrape a site, I was receiving 503 Service Unavailable as an error right away (could not even start scraping any items). After finding this thread: How to bypass cloudflare bot/ddos protection in Scrapy? I assumed the problem was CloudFlare, so I added the following code that uses cfscrape from one of the answers to my Spider: <pre class="prettyprint"><code>def start_requests(self): cf_requests = [] for url in self.start_urls: token, agent = cfscrape.get_tokens(url, USER_AGENT) #token, agent = cfscrape.get_tokens(url) cf_requests.append(scrapy.Request(url=url, cookies={'__cfduid': token['__cfduid']}, headers={'User-Agent': agent})) print "useragent in cfrequest: " , agent print "token in cfrequest: ", token return cf_requests </code></pre> Looking at the output, it seems like this workaround is indeed executing the javascript that CloudFlare uses for ddos protection, but it still gives me 503 error afterwards. Here is the debug output: <pre class="prettyprint"><code>2015-11-04 23:07:12 [scrapy] INFO: Scrapy 1.0.3 started (bot: forumscrape) 2015-11-04 23:07:12 [scrapy] INFO: Optional features available: ssl, http11, boto 2015-11-04 23:07:12 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'forumscrape.spiders', 'CONCURRENT_REQUESTS_PER_DOMAIN': 1, 'CONCURRENT_REQUESTS': 1, 'SPIDER_MODULES': ['forumscrape.spiders'], 'CONCURRENT_REQUESTS_PER_IP': 1, 'BOT_NAME': 'forumscrape', 'COOKIES_ENABLED': False, 'DOWNLOAD_DELAY': 1} 2015-11-04 23:07:12 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState 2015-11-04 23:07:13 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, ChunkedTransferMiddleware, DownloaderStats 2015-11-04 23:07:13 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2015-11-04 23:07:13 [scrapy] INFO: Enabled item pipelines: 2015-11-04 23:07:13 [requests.packages.urllib3.connectionpool] INFO: Starting new HTTP connection (1): sampleforum.com 2015-11-04 23:07:13 [requests.packages.urllib3.connectionpool] DEBUG: "GET /forumdisplay.php?29-Chat HTTP/1.1" 503 None 2015-11-04 23:07:18 [requests.packages.urllib3.connectionpool] INFO: Starting new HTTP connection (1): sampleforum.com 2015-11-04 23:07:18 [requests.packages.urllib3.connectionpool] DEBUG: "GET /cdn-cgi/l/chk_jschl?jschl_answer=-8011&jschl_vc=6b1abd999393b114b8eea35ff2be9e55&pass=1446696428.397-J92apQZ8k3 HTTP/1.1" 302 165 2015-11-04 23:07:18 [requests.packages.urllib3.connectionpool] DEBUG: "GET /forumdisplay.php?29-Chat HTTP/1.1" 200 21403 useragent in cfrequest: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:41.0) Gecko/20100101 Firefox/41.0 token in cfrequest: {'cf_clearance': '037ab6c531be7e6fa6c3d0a98c988f57d17fd781-1446696429-604800', '__cfduid': 'd2635be16360da698f9dd07e4929690ed1446696424'} 2015-11-04 23:07:18 [scrapy] INFO: Spider opened 2015-11-04 23:07:18 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2015-11-04 23:07:18 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 2015-11-04 23:07:18 [scrapy] DEBUG: Retrying <GET http://sampleforum.com/forumdisplay.php?29-Chat> (failed 1 times): 503 Service Unavailable 2015-11-04 23:07:20 [scrapy] DEBUG: Retrying <GET http://sampleforum.com/forumdisplay.php?29-Chat> (failed 2 times): 503 Service Unavailable 2015-11-04 23:07:21 [scrapy] DEBUG: Gave up retrying <GET http://sampleforum.com/forumdisplay.php?29-Chat> (failed 3 times): 503 Service Unavailable 2015-11-04 23:07:21 [scrapy] DEBUG: Crawled (503) <GET http://sampleforum.com/forumdisplay.php?29-Chat> (referer: None) 2015-11-04 23:07:21 [scrapy] DEBUG: Ignoring response <503 http://sampleforum.com/forumdisplay.php?29-Chat>: HTTP status code is not handled or not allowed 2015-11-04 23:07:21 [scrapy] INFO: Closing spider (finished) 2015-11-04 23:07:21 [scrapy] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 828, downloader/request_count': 3, 'downloader/request_method_count/GET': 3, 'downloader/response_bytes': 14276, 'downloader/response_count': 3, 'downloader/response_status_count/503': 3, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2015, 11, 5, 4, 7, 21, 363000), 'log_count/DEBUG': 9, 'log_count/INFO': 9, 'response_received_count': 1, 'scheduler/dequeued': 3, 'scheduler/dequeued/memory': 3, 'scheduler/enqueued': 3, 'scheduler/enqueued/memory': 3, 'start_time': datetime.datetime(2015, 11, 5, 4, 7, 18, 755000)} 2015-11-04 23:07:21 [scrapy] INFO: Spider closed (finished) </code></pre> The site loads fine in my browser (the same useragent being used). Other sites that I'm running similar scraping on (just picking up some text) are working. Is there another reason I'm getting 503? Any help would be appreciated. I believe this line: <code>DEBUG: "GET /cdn-cgi/l/chk_jschl?jschl_answer=23986&jschl_vc=2e88b65c8bcf26f39b980de3d5b198ea&pass=1446698472.387-F9OS39Peei HTTP/1.1" 302 165</code> Shows that the cloudflare javascript check is being done, so perhaps it's a reason besides this that is causing the 503?

You can try use splash for avoid cloudflare.

Scrapy: 503 Error when scraping site using CloudFlare

Tags:

cloudflare

When using scrapy to scrape a site, I was receiving 503 Service Unavailable as an error right away (could not even start scraping any items). After finding this thread:

How to bypass cloudflare bot/ddos protection in Scrapy?

I assumed the problem was CloudFlare, so I added the following code that uses cfscrape from one of the answers to my Spider:

def start_requests(self):
    cf_requests = []
    for url in self.start_urls:
        token, agent = cfscrape.get_tokens(url, USER_AGENT)
        #token, agent = cfscrape.get_tokens(url)
        cf_requests.append(scrapy.Request(url=url, cookies={'__cfduid': token['__cfduid']}, headers={'User-Agent': agent}))
        print "useragent in cfrequest: " , agent
        print "token in cfrequest: ", token
    return cf_requests

Looking at the output, it seems like this workaround is indeed executing the javascript that CloudFlare uses for ddos protection, but it still gives me 503 error afterwards. Here is the debug output:

2015-11-04 23:07:12 [scrapy] INFO: Scrapy 1.0.3 started (bot: forumscrape)
2015-11-04 23:07:12 [scrapy] INFO: Optional features available: ssl, http11, boto
2015-11-04 23:07:12 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'forumscrape.spiders', 'CONCURRENT_REQUESTS_PER_DOMAIN': 1, 'CONCURRENT_REQUESTS': 1, 'SPIDER_MODULES': ['forumscrape.spiders'], 'CONCURRENT_REQUESTS_PER_IP': 1, 'BOT_NAME': 'forumscrape', 'COOKIES_ENABLED': False, 'DOWNLOAD_DELAY': 1}
2015-11-04 23:07:12 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-11-04 23:07:13 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-11-04 23:07:13 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-11-04 23:07:13 [scrapy] INFO: Enabled item pipelines:
2015-11-04 23:07:13 [requests.packages.urllib3.connectionpool] INFO: Starting new HTTP connection (1): sampleforum.com
2015-11-04 23:07:13 [requests.packages.urllib3.connectionpool] DEBUG: "GET /forumdisplay.php?29-Chat HTTP/1.1" 503 None
2015-11-04 23:07:18 [requests.packages.urllib3.connectionpool] INFO: Starting new HTTP connection (1): sampleforum.com
2015-11-04 23:07:18 [requests.packages.urllib3.connectionpool] DEBUG: "GET /cdn-cgi/l/chk_jschl?jschl_answer=-8011&jschl_vc=6b1abd999393b114b8eea35ff2be9e55&pass=1446696428.397-J92apQZ8k3 HTTP/1.1" 302 165
2015-11-04 23:07:18 [requests.packages.urllib3.connectionpool] DEBUG: "GET /forumdisplay.php?29-Chat HTTP/1.1" 200 21403
useragent in cfrequest:  Mozilla/5.0 (Windows NT 6.1; WOW64; rv:41.0) Gecko/20100101 Firefox/41.0
token in cfrequest:  {'cf_clearance': '037ab6c531be7e6fa6c3d0a98c988f57d17fd781-1446696429-604800', '__cfduid': 'd2635be16360da698f9dd07e4929690ed1446696424'}
2015-11-04 23:07:18 [scrapy] INFO: Spider opened
2015-11-04 23:07:18 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-11-04 23:07:18 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-11-04 23:07:18 [scrapy] DEBUG: Retrying <GET http://sampleforum.com/forumdisplay.php?29-Chat> (failed 1 times): 503 Service Unavailable
2015-11-04 23:07:20 [scrapy] DEBUG: Retrying <GET http://sampleforum.com/forumdisplay.php?29-Chat> (failed 2 times): 503 Service Unavailable
2015-11-04 23:07:21 [scrapy] DEBUG: Gave up retrying <GET http://sampleforum.com/forumdisplay.php?29-Chat> (failed 3 times): 503 Service Unavailable
2015-11-04 23:07:21 [scrapy] DEBUG: Crawled (503) <GET http://sampleforum.com/forumdisplay.php?29-Chat> (referer: None)
2015-11-04 23:07:21 [scrapy] DEBUG: Ignoring response <503 http://sampleforum.com/forumdisplay.php?29-Chat>: HTTP status code is not handled or not allowed
2015-11-04 23:07:21 [scrapy] INFO: Closing spider (finished)
2015-11-04 23:07:21 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 828,
 downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
 'downloader/response_bytes': 14276, 'downloader/response_count': 3, 'downloader/response_status_count/503': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 11, 5, 4, 7, 21, 363000),
'log_count/DEBUG': 9,
'log_count/INFO': 9,
'response_received_count': 1,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2015, 11, 5, 4, 7, 18, 755000)}
2015-11-04 23:07:21 [scrapy] INFO: Spider closed (finished)

The site loads fine in my browser (the same useragent being used). Other sites that I'm running similar scraping on (just picking up some text) are working. Is there another reason I'm getting 503? Any help would be appreciated.

I believe this line: DEBUG: "GET /cdn-cgi/l/chk_jschl?jschl_answer=23986&jschl_vc=2e88b65c8bcf26f39b980de3d5b198ea&pass=1446698472.387-F9OS39Peei HTTP/1.1" 302 165 Shows that the cloudflare javascript check is being done, so perhaps it's a reason besides this that is causing the 503?

737

asked Nov 05 '15 04:11

ddnm

1 Answers

You can try use splash for avoid cloudflare.

answered Oct 19 '22 04:10

Verz1Lka

Related questions
                            
                                OpenCV contour minimum dimension location in python
                            
                                Is it possible to add a global argument for all subcommands in Click based interfaces?
                            
                                Does scikit-learn have Bayes Net ? If yes is there an implementation for reference
                            
                                Pandas groupby: Count the number of occurrences within a time range for each group
                            
                                Django Logging with Elastic Beanstalk (AWS)
                            
                                Python Bokeh: Set line color based on column in columndatasource
                            
                                requests.exceptions.SSLError: hostname 'boxfwd.com' doesn't match either of 'nycmsk.com', 'www.nycmsk.com'
                            
                                Equivalent to \Sexpr{} for Python, etc., in knitr + RMarkdown?
                            
                                Forcing `None` on load and skipping `None` on dump
                            
                                Adaptable descriptor in Python
                            
                                Python counterpart to partial for ignoring an argument
                            
                                Is there any way in Django Rest framework serializer for ignoring case in choice field?
                            
                                How to get SciPy.integrate.odeint to stop when path is closed?
                            
                                get_includes doesn't find standard library headers
                            
                                Plotting a dictionary of DataFrames
                            
                                Python : sklearn svm, providing a custom loss function
                            
                                Differentiating an equation
                            
                                Flask: Set header on static files
                            
                                Downloading images with gevent

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scrapy: 503 Error when scraping site using CloudFlare

Tags:

python

scrapy

cloudflare

ddnm

People also ask

1 Answers

Verz1Lka

Recent Activity

Donate For Us