I have a scrapy project which contains multiple spiders. Is there any way I can define which pipelines to use for which spider? Not all the pipelines i have defined are applicable for every spider. Thanks

Just remove all pipelines from main settings and use this inside spider. This will define the pipeline to user per spider <pre class="prettyprint"><code>class testSpider(InitSpider): name = 'test' custom_settings = { 'ITEM_PIPELINES': { 'app.MyPipeline': 400 } } </code></pre>

Building on the solution from Pablo Hoffman, you can use the following decorator on the <code>process_item</code> method of a Pipeline object so that it checks the <code>pipeline</code> attribute of your spider for whether or not it should be executed. For example: <pre class="prettyprint"><code>def check_spider_pipeline(process_item_method): @functools.wraps(process_item_method) def wrapper(self, item, spider): # message template for debugging msg = '%%s %s pipeline step' % (self.__class__.__name__,) # if class is in the spider's pipeline, then use the # process_item method normally. if self.__class__ in spider.pipeline: spider.log(msg % 'executing', level=log.DEBUG) return process_item_method(self, item, spider) # otherwise, just return the untouched item (skip this step in # the pipeline) else: spider.log(msg % 'skipping', level=log.DEBUG) return item return wrapper </code></pre> For this decorator to work correctly, the spider must have a pipeline attribute with a container of the Pipeline objects that you want to use to process the item, for example: <pre class="prettyprint"><code>class MySpider(BaseSpider): pipeline = set([ pipelines.Save, pipelines.Validate, ]) def parse(self, response): # insert scrapy goodness here return item </code></pre> And then in a <code>pipelines.py</code> file: <pre class="prettyprint"><code>class Save(object): @check_spider_pipeline def process_item(self, item, spider): # do saving here return item class Validate(object): @check_spider_pipeline def process_item(self, item, spider): # do validating here return item </code></pre> All Pipeline objects should still be defined in ITEM_PIPELINES in settings (in the correct order -- would be nice to change so that the order could be specified on the Spider, too).

How can I use different pipelines for different spiders in a single Scrapy project

2 Answers

Just remove all pipelines from main settings and use this inside spider.

This will define the pipeline to user per spider

class testSpider(InitSpider):     name = 'test'     custom_settings = {         'ITEM_PIPELINES': {             'app.MyPipeline': 400         }     }

answered Sep 26 '22 10:09

Mirage

Building on the solution from Pablo Hoffman, you can use the following decorator on the process_item method of a Pipeline object so that it checks the pipeline attribute of your spider for whether or not it should be executed. For example:

def check_spider_pipeline(process_item_method):      @functools.wraps(process_item_method)     def wrapper(self, item, spider):          # message template for debugging         msg = '%%s %s pipeline step' % (self.__class__.__name__,)          # if class is in the spider's pipeline, then use the         # process_item method normally.         if self.__class__ in spider.pipeline:             spider.log(msg % 'executing', level=log.DEBUG)             return process_item_method(self, item, spider)          # otherwise, just return the untouched item (skip this step in         # the pipeline)         else:             spider.log(msg % 'skipping', level=log.DEBUG)             return item      return wrapper

For this decorator to work correctly, the spider must have a pipeline attribute with a container of the Pipeline objects that you want to use to process the item, for example:

class MySpider(BaseSpider):      pipeline = set([         pipelines.Save,         pipelines.Validate,     ])      def parse(self, response):         # insert scrapy goodness here         return item

And then in a pipelines.py file:

class Save(object):      @check_spider_pipeline     def process_item(self, item, spider):         # do saving here         return item  class Validate(object):      @check_spider_pipeline     def process_item(self, item, spider):         # do validating here         return item

All Pipeline objects should still be defined in ITEM_PIPELINES in settings (in the correct order -- would be nice to change so that the order could be specified on the Spider, too).

answered Sep 24 '22 10:09

mstringer

Related questions
                            
                                Flask Value error view function did not return a response [duplicate]
                            
                                Slicing a list in Python without generating a copy
                            
                                get UTC timestamp in python with datetime
                            
                                Check if all elements of a list are of the same type
                            
                                python: how to check if a line is an empty line
                            
                                How to execute Python scripts in Windows?
                            
                                How to add multiple values to a dictionary key in python? [closed]
                            
                                Django import error - no module named django.conf.urls.defaults
                            
                                Python : Get size of string in bytes
                            
                                Python: printing a file to stdout
                            
                                Regular expression syntax for "match nothing"?
                            
                                Python - Passing a function into another function
                            
                                Large, persistent DataFrame in pandas
                            
                                Detect python version in shell script
                            
                                Variable defined with with-statement available outside of with-block?
                            
                                Decode escaped characters in URL
                            
                                Is there any simple way to benchmark Python script?
                            
                                Stripping non printable characters from a string in python
                            
                                First Python list index greater than x?
                            
                                How to replace negative numbers in Pandas Data Frame by zero

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I use different pipelines for different spiders in a single Scrapy project

Tags:

python

scrapy

web-crawler

CodeMonkeyB

People also ask

2 Answers

Mirage

mstringer

Recent Activity

Donate For Us