Logo Questions Linux Laravel Mysql Ubuntu Git Menu

How to connect to https site with Scrapy via Polipo over TOR?





Not entirely sure what the problem is here.

Running Python 2.7.3, and Scrapy 0.16.5

I've created a very simple Scrapy spider to test connecting to my local Polipo proxy so I can send requests out via TOR. Basic code of my spider is as follows:

from scrapy.spider import BaseSpider

class TorSpider(BaseSpider):
    name = "tor"
    allowed_domains = ["check.torproject.org"]
    start_urls = [

    def parse(self, response):
        print response.body

For my proxy middleware, I've defined:

class ProxyMiddleware(object):
    def process_request(self, request, spider):
        request.meta['proxy'] = settings.get('HTTP_PROXY')

My HTTP_PROXY in my settings file is defined as HTTP_PROXY = 'http://localhost:8123'.

Now, if I change my start URL to http://check.torproject.org, everything works fine, no problems.

If I attempt to run against https://check.torproject.org, I get a 400 Bad Request error every time (I've also tried different https:// sites, and all of them have the same problem):

2013-07-23 21:36:18+0100 [scrapy] INFO: Scrapy 0.16.5 started (bot: arachnid)
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, RandomUserAgentMiddleware, ProxyMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Enabled item pipelines: 
2013-07-23 21:36:18+0100 [tor] INFO: Spider opened
2013-07-23 21:36:18+0100 [tor] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Telnet console listening on
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Web service listening on
2013-07-23 21:36:18+0100 [tor] DEBUG: Retrying <GET https://check.torproject.org> (failed 1 times): 400 Bad Request
2013-07-23 21:36:18+0100 [tor] DEBUG: Retrying <GET https://check.torproject.org> (failed 2 times): 400 Bad Request
2013-07-23 21:36:18+0100 [tor] DEBUG: Gave up retrying <GET https://check.torproject.org> (failed 3 times): 400 Bad Request
2013-07-23 21:36:18+0100 [tor] DEBUG: Crawled (400) <GET https://check.torproject.org> (referer: None)
2013-07-23 21:36:18+0100 [tor] INFO: Closing spider (finished)

And just to double check that it isn't something wrong with my TOR/Polipo set up, I'm able to run the following curl command in a terminal, and connect fine: curl --proxy localhost:8123 https://check.torproject.org/

Any suggestions as to what's wrong here?

like image 987
Craig Sefton Avatar asked Jul 23 '13 20:07

Craig Sefton

1 Answers


rq.meta['proxy'] = ''

In my case it's works

like image 131
Verz1Lka Avatar answered Sep 26 '22 09:09
