I use Tor to crawl web pages. I started tor and polipo service and added
class ProxyMiddleware(object): # overwrite process request def
process_request(self, request, spider):
# Set the location of the proxy
request.meta['proxy'] = "127.0.0.1:8123"
Now, how can I make sure that scrapy uses different IP address for requests?
You can yield the first request to check your public IP, and compare this to the IP you see when you go to http://checkip.dyndns.org/ without using Tor/VPN. If they are not the same, scrapy is using a different IP obviously.
def start_reqests():
yield Request('http://checkip.dyndns.org/', callback=self.check_ip)
# yield other requests from start_urls here if needed
def check_ip(self, response):
pub_ip = response.xpath('//body/text()').re('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')[0]
print "My public IP is: " + pub_ip
# yield other requests here if needed
The fastest option would be to use the scrapy shell
and check for the meta
to contain the proxy
.
Start it from the project root:
$ scrapy shell http://google.com
>>> request.meta
{'handle_httpstatus_all': True, 'redirect_ttl': 20, 'download_timeout': 180, 'proxy': 'http://127.0.0.1:8123', 'download_latency': 0.4804518222808838, 'download_slot': 'google.com'}
>>> response.meta
{'download_timeout': 180, 'handle_httpstatus_all': True, 'redirect_ttl': 18, 'redirect_times': 2, 'redirect_urls': ['http://google.com', 'http://www.google.com/'], 'depth': 0, 'proxy': 'http://127.0.0.1:8123', 'download_latency': 1.5814828872680664, 'download_slot': 'google.com'}
This way you would check that middleware is configured correctly and the request is going through proxy.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With