Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy and proxies

Tags:

python

scrapy

How do you utilize proxy support with the python web-scraping framework Scrapy?

like image 645
no1 Avatar asked Jan 17 '11 06:01

no1


People also ask

Does Scrapy use proxy?

As a web scraping tool, Scrapy has support for proxies, and you will most likely make use of proxies in your scraping project.

How do you integrate a proxy in Scrapy?

1. Via Request Parameters​ Simply include the proxy connection details in the meta field of every request within your spider. Scrapy's HttpProxyMiddleware, which is enabled by default, will then route the request through the proxy you defined.

What is middleware in Scrapy?

The spider middleware is a framework of hooks into Scrapy's spider processing mechanism where you can plug custom functionality to process the responses that are sent to Spiders for processing and to process the requests and items that are generated from spiders.


2 Answers

From the Scrapy FAQ,

Does Scrapy work with HTTP proxies?

Yes. Support for HTTP proxies is provided (since Scrapy 0.8) through the HTTP Proxy downloader middleware. See HttpProxyMiddleware.

The easiest way to use a proxy is to set the environment variable http_proxy. How this is done depends on your shell.

 C:\>set http_proxy=http://proxy:port csh% setenv http_proxy http://proxy:port sh$ export http_proxy=http://proxy:port 

if you want to use https proxy and visited https web,to set the environment variable http_proxy you should follow below,

 C:\>set https_proxy=https://proxy:port csh% setenv https_proxy https://proxy:port sh$ export https_proxy=https://proxy:port 
like image 155
ephemient Avatar answered Oct 22 '22 01:10

ephemient


Single Proxy

  1. Enable HttpProxyMiddleware in your settings.py, like this:

    DOWNLOADER_MIDDLEWARES = {     'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 1 } 
  2. pass proxy to request via request.meta:

    request = Request(url="http://example.com") request.meta['proxy'] = "host:port" yield request 

You also can choose a proxy address randomly if you have an address pool. Like this:

Multiple Proxies

class MySpider(BaseSpider):     name = "my_spider"     def __init__(self, *args, **kwargs):         super(MySpider, self).__init__(*args, **kwargs)         self.proxy_pool = ['proxy_address1', 'proxy_address2', ..., 'proxy_addressN']      def parse(self, response):         ...parse code...         if something:             yield self.get_request(url)      def get_request(self, url):         req = Request(url=url)         if self.proxy_pool:             req.meta['proxy'] = random.choice(self.proxy_pool)         return req 
like image 32
Amom Avatar answered Oct 22 '22 03:10

Amom