Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I use different pipelines for different spiders in a single Scrapy project

I have a scrapy project which contains multiple spiders. Is there any way I can define which pipelines to use for which spider? Not all the pipelines i have defined are applicable for every spider.

Thanks

like image 962
CodeMonkeyB Avatar asked Dec 04 '11 02:12

CodeMonkeyB


People also ask

How do you use multiple spiders in Scrapy?

We use the CrawlerProcess class to run multiple Scrapy spiders in a process simultaneously. We need to create an instance of CrawlerProcess with the project settings. We need to create an instance of Crawler for the spider if we want to have custom settings for the Spider.

What are Scrapy pipelines?

Each item pipeline component (sometimes referred as just “Item Pipeline”) is a Python class that implements a simple method. They receive an item and perform an action over it, also deciding if the item should continue through the pipeline or be dropped and no longer processed.

How do you run a Scrapy spider from a Python script?

The key to running scrapy in a python script is the CrawlerProcess class. This is a class of the Crawler module. It provides the engine to run scrapy within a python script. Within the CrawlerProcess class code, python's twisted framework is imported.

What is a spider in Scrapy?

Spiders are classes which define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items).


2 Answers

Just remove all pipelines from main settings and use this inside spider.

This will define the pipeline to user per spider

class testSpider(InitSpider):     name = 'test'     custom_settings = {         'ITEM_PIPELINES': {             'app.MyPipeline': 400         }     } 
like image 86
Mirage Avatar answered Sep 26 '22 10:09

Mirage


Building on the solution from Pablo Hoffman, you can use the following decorator on the process_item method of a Pipeline object so that it checks the pipeline attribute of your spider for whether or not it should be executed. For example:

def check_spider_pipeline(process_item_method):      @functools.wraps(process_item_method)     def wrapper(self, item, spider):          # message template for debugging         msg = '%%s %s pipeline step' % (self.__class__.__name__,)          # if class is in the spider's pipeline, then use the         # process_item method normally.         if self.__class__ in spider.pipeline:             spider.log(msg % 'executing', level=log.DEBUG)             return process_item_method(self, item, spider)          # otherwise, just return the untouched item (skip this step in         # the pipeline)         else:             spider.log(msg % 'skipping', level=log.DEBUG)             return item      return wrapper 

For this decorator to work correctly, the spider must have a pipeline attribute with a container of the Pipeline objects that you want to use to process the item, for example:

class MySpider(BaseSpider):      pipeline = set([         pipelines.Save,         pipelines.Validate,     ])      def parse(self, response):         # insert scrapy goodness here         return item 

And then in a pipelines.py file:

class Save(object):      @check_spider_pipeline     def process_item(self, item, spider):         # do saving here         return item  class Validate(object):      @check_spider_pipeline     def process_item(self, item, spider):         # do validating here         return item 

All Pipeline objects should still be defined in ITEM_PIPELINES in settings (in the correct order -- would be nice to change so that the order could be specified on the Spider, too).

like image 44
mstringer Avatar answered Sep 24 '22 10:09

mstringer