Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there any method to using seperate scrapy pipeline for each spider?

I wanna to fetch web pages under different domain, that means I have to use different spider under the command "scrapy crawl myspider". However, I have to use different pipeline logic to put the data into database since the content of web pages are different. But for every spider, they have to go through all of the pipelines which defined in settings.py. Is there have other elegant method to using seperate pipelines for each spider?

like image 791
uuball Avatar asked Dec 03 '22 22:12

uuball


1 Answers

ITEM_PIPELINES setting is defined globally for all spiders in the project during the engine start. It cannot be changed per spider on the fly.

Here are some options to consider:

  • Change the code of pipelines. Skip/continue processing items returned by spiders in the process_item method of your pipeline, e.g.:

    def process_item(self, item, spider): 
        if spider.name not in ['spider1', 'spider2']: 
            return item  
    
        # process item
    
  • Change the way you start crawling. Do it from a script, based on spider name passed as a parameter, override your ITEM_PIPELINES setting before calling crawler.configure().

See also:

  • Scrapy. How to change spider settings after start crawling?
  • Can I use spider-specific settings?
  • Using one Scrapy spider for several websites
  • related answer

Hope that helps.

like image 191
alecxe Avatar answered Dec 09 '22 15:12

alecxe