Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to set different scrapy-settings for different spiders?

Tags:

I want to enable some http-proxy for some spiders, and disable them for other spiders.

Can I do something like this?

# settings.py
proxy_spiders = ['a1' , b2']

if spider in proxy_spider: #how to get spider name ???
    HTTP_PROXY = 'http://127.0.0.1:8123'
    DOWNLOADER_MIDDLEWARES = {
         'myproject.middlewares.RandomUserAgentMiddleware': 400,
         'myproject.middlewares.ProxyMiddleware': 410,
         'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
    }
else:
    DOWNLOADER_MIDDLEWARES = {
         'myproject.middlewares.RandomUserAgentMiddleware': 400,
         'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
    }

If the code above doesn't work, is there any other suggestion?

like image 442
Michael Nguyen Avatar asked Oct 11 '13 21:10

Michael Nguyen


People also ask

How do you run multiple spiders in a Scrapy?

We use the CrawlerProcess class to run multiple Scrapy spiders in a process simultaneously. We need to create an instance of CrawlerProcess with the project settings. We need to create an instance of Crawler for the spider if we want to have custom settings for the Spider.

How do you stop a spider from being Scrapy?

In the latest version of Scrapy, available on GitHub, you can raise a CloseSpider exception to manually close a spider. Show activity on this post. This causes the Spider to do the following: [scrapy] INFO: Received SIGKILL, shutting down gracefully.

What does parse function do in Scrapy?

The parse method is in charge of processing the response and returning scraped data and/or more URLs to follow. Other Requests callbacks have the same requirements as the Spider class. This method, as well as any other Request callback, must return an iterable of Request and/or item objects.


Video Answer


2 Answers

a bit late, but since release 1.0.0 there is a new feature in scrapy where you can override settings per spider like this:

class MySpider(scrapy.Spider):
    name = "my_spider"
    custom_settings = {"HTTP_PROXY":'http://127.0.0.1:8123',
                       "DOWNLOADER_MIDDLEWARES": {'myproject.middlewares.RandomUserAgentMiddleware': 400,
                                                  'myproject.middlewares.ProxyMiddleware': 410,
                                                  'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None}}




class MySpider2(scrapy.Spider):
        name = "my_spider2"
        custom_settings = {"DOWNLOADER_MIDDLEWARES": {'myproject.middlewares.RandomUserAgentMiddleware': 400,
                                                      'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None}}
like image 104
user4055746 Avatar answered Nov 03 '22 19:11

user4055746


There is a new and easier way to do this.

class MySpider(scrapy.Spider):
    name = 'myspider'

    custom_settings = {
        'SOME_SETTING': 'some value',
    }

I use Scrapy 1.3.1

like image 28
Aminah Nuraini Avatar answered Nov 03 '22 21:11

Aminah Nuraini