I have a scrapy project which contains multiple spiders. Is there any way I can define which pipelines to use for which spider? Not all the pipelines i have defined are applicable for every spider.
Thanks
We use the CrawlerProcess class to run multiple Scrapy spiders in a process simultaneously. We need to create an instance of CrawlerProcess with the project settings. We need to create an instance of Crawler for the spider if we want to have custom settings for the Spider.
Each item pipeline component (sometimes referred as just “Item Pipeline”) is a Python class that implements a simple method. They receive an item and perform an action over it, also deciding if the item should continue through the pipeline or be dropped and no longer processed.
The key to running scrapy in a python script is the CrawlerProcess class. This is a class of the Crawler module. It provides the engine to run scrapy within a python script. Within the CrawlerProcess class code, python's twisted framework is imported.
Spiders are classes which define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items).
Just remove all pipelines from main settings and use this inside spider.
This will define the pipeline to user per spider
class testSpider(InitSpider): name = 'test' custom_settings = { 'ITEM_PIPELINES': { 'app.MyPipeline': 400 } }
Building on the solution from Pablo Hoffman, you can use the following decorator on the process_item
method of a Pipeline object so that it checks the pipeline
attribute of your spider for whether or not it should be executed. For example:
def check_spider_pipeline(process_item_method): @functools.wraps(process_item_method) def wrapper(self, item, spider): # message template for debugging msg = '%%s %s pipeline step' % (self.__class__.__name__,) # if class is in the spider's pipeline, then use the # process_item method normally. if self.__class__ in spider.pipeline: spider.log(msg % 'executing', level=log.DEBUG) return process_item_method(self, item, spider) # otherwise, just return the untouched item (skip this step in # the pipeline) else: spider.log(msg % 'skipping', level=log.DEBUG) return item return wrapper
For this decorator to work correctly, the spider must have a pipeline attribute with a container of the Pipeline objects that you want to use to process the item, for example:
class MySpider(BaseSpider): pipeline = set([ pipelines.Save, pipelines.Validate, ]) def parse(self, response): # insert scrapy goodness here return item
And then in a pipelines.py
file:
class Save(object): @check_spider_pipeline def process_item(self, item, spider): # do saving here return item class Validate(object): @check_spider_pipeline def process_item(self, item, spider): # do validating here return item
All Pipeline objects should still be defined in ITEM_PIPELINES in settings (in the correct order -- would be nice to change so that the order could be specified on the Spider, too).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With