How do you utilize proxy support with the python web-scraping framework Scrapy?
As a web scraping tool, Scrapy has support for proxies, and you will most likely make use of proxies in your scraping project.
1. Via Request Parameters Simply include the proxy connection details in the meta field of every request within your spider. Scrapy's HttpProxyMiddleware, which is enabled by default, will then route the request through the proxy you defined.
The spider middleware is a framework of hooks into Scrapy's spider processing mechanism where you can plug custom functionality to process the responses that are sent to Spiders for processing and to process the requests and items that are generated from spiders.
From the Scrapy FAQ,
Does Scrapy work with HTTP proxies?
Yes. Support for HTTP proxies is provided (since Scrapy 0.8) through the HTTP Proxy downloader middleware. See
HttpProxyMiddleware
.
The easiest way to use a proxy is to set the environment variable http_proxy
. How this is done depends on your shell.
C:\>set http_proxy=http://proxy:port csh% setenv http_proxy http://proxy:port sh$ export http_proxy=http://proxy:port
if you want to use https proxy and visited https web,to set the environment variable http_proxy
you should follow below,
C:\>set https_proxy=https://proxy:port csh% setenv https_proxy https://proxy:port sh$ export https_proxy=https://proxy:port
Single Proxy
Enable HttpProxyMiddleware
in your settings.py
, like this:
DOWNLOADER_MIDDLEWARES = { 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 1 }
pass proxy to request via request.meta
:
request = Request(url="http://example.com") request.meta['proxy'] = "host:port" yield request
You also can choose a proxy address randomly if you have an address pool. Like this:
Multiple Proxies
class MySpider(BaseSpider): name = "my_spider" def __init__(self, *args, **kwargs): super(MySpider, self).__init__(*args, **kwargs) self.proxy_pool = ['proxy_address1', 'proxy_address2', ..., 'proxy_addressN'] def parse(self, response): ...parse code... if something: yield self.get_request(url) def get_request(self, url): req = Request(url=url) if self.proxy_pool: req.meta['proxy'] = random.choice(self.proxy_pool) return req
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With