Not entirely sure what the problem is here.
Running Python 2.7.3, and Scrapy 0.16.5
I've created a very simple Scrapy spider to test connecting to my local Polipo proxy so I can send requests out via TOR. Basic code of my spider is as follows:
from scrapy.spider import BaseSpider
class TorSpider(BaseSpider):
name = "tor"
allowed_domains = ["check.torproject.org"]
start_urls = [
"https://check.torproject.org"
]
def parse(self, response):
print response.body
For my proxy middleware, I've defined:
class ProxyMiddleware(object):
def process_request(self, request, spider):
request.meta['proxy'] = settings.get('HTTP_PROXY')
My HTTP_PROXY in my settings file is defined as HTTP_PROXY = 'http://localhost:8123'
.
Now, if I change my start URL to http://check.torproject.org, everything works fine, no problems.
If I attempt to run against https://check.torproject.org, I get a 400 Bad Request error every time (I've also tried different https:// sites, and all of them have the same problem):
2013-07-23 21:36:18+0100 [scrapy] INFO: Scrapy 0.16.5 started (bot: arachnid)
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, RandomUserAgentMiddleware, ProxyMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Enabled item pipelines:
2013-07-23 21:36:18+0100 [tor] INFO: Spider opened
2013-07-23 21:36:18+0100 [tor] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-07-23 21:36:18+0100 [tor] DEBUG: Retrying <GET https://check.torproject.org> (failed 1 times): 400 Bad Request
2013-07-23 21:36:18+0100 [tor] DEBUG: Retrying <GET https://check.torproject.org> (failed 2 times): 400 Bad Request
2013-07-23 21:36:18+0100 [tor] DEBUG: Gave up retrying <GET https://check.torproject.org> (failed 3 times): 400 Bad Request
2013-07-23 21:36:18+0100 [tor] DEBUG: Crawled (400) <GET https://check.torproject.org> (referer: None)
2013-07-23 21:36:18+0100 [tor] INFO: Closing spider (finished)
And just to double check that it isn't something wrong with my TOR/Polipo set up, I'm able to run the following curl command in a terminal, and connect fine: curl --proxy localhost:8123 https://check.torproject.org/
Any suggestions as to what's wrong here?
Try
rq.meta['proxy'] = 'http://127.0.0.1:8123'
In my case it's works
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With