Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use Downloader Middleware in Scrapy

I am using scrapy to scrape some web pages. I wrote my customised ProxyMiddleware class in which I implemented my requirement in process_request(self,request,spider) method. Here is my code(copied):

class ProxyMiddleware(scrapy.downloadermiddlewares.httpproxy):
def __init__(self, proxy_ip=''):
    self.proxy_ip = proxy_ip

def process_request(self,request,spider):
    ip = random.choice(self.proxy_list)
    if ip:
        request.meta['proxy'] = ip
    return request

proxy_list = [list of proxies]

Now, I didn't understand how scrapy will consider my implementation instead of default class. After some searching and brainstorming, what I understood is, I need to make changes in settings.py

DOWNLOADER_MIDDLEWARES = {
    'IPProxy.middlewares.MyCustomDownloaderMiddleware': 543,
    'IPProxy.IPProxy.spiders.RandomProxy': 600
}

For better understanding of my project structure to readers, I added second element in the list with some random value. My project structure is:

enter image description here

My question is,

  • How to use DOWNLOADER_MIDDLEWARES in settings.py correctly
  • How to assign the values to the elements in DOWNLOADER_MIDDLEWARES
  • How to make scrapy to call my customized code instead of the default
like image 972
Jack Daniel Avatar asked Nov 09 '22 03:11

Jack Daniel


1 Answers

If you want to disable the, assuming, built-in HttpProxyMiddleware Downloader Middleware - set its value in DOWNLOADER_MIDDLEWARES to None:

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': None,
    'IPProxy.middlewares.MyCustomDownloaderMiddleware': 543,
    'IPProxy.IPProxy.spiders.RandomProxy': 600
}
like image 88
alecxe Avatar answered Nov 14 '22 22:11

alecxe