I am using scrapy to crawl different sites, for each site I have an Item (different information is extracted)
Well, for example I have a generic pipeline (most of information is the same) but now I am crawling some google search response and the pipeline must be different.
For example:
GenericItem
uses GenericPipeline
But the GoogleItem
uses GoogleItemPipeline
, but when the spider is crawling it tries to use GenericPipeline
instead of GoogleItemPipeline
....how can I specify which pipeline Google spider must use?
Each item pipeline component (sometimes referred as just “Item Pipeline”) is a Python class that implements a simple method. They receive an item and perform an action over it, also deciding if the item should continue through the pipeline or be dropped and no longer processed.
You can activate an Item Pipeline component by adding its class to the ITEM_PIPELINES setting as shown in the following code. You can assign integer values to the classes in the order in which they run (the order can be lower valued to higher valued classes) and values will be in the 0-1000 range.
Scrapy is a web scraping library that is used to scrape, parse and collect web data. For all these functions we are having a pipelines.py file which is used to handle scraped data through various components (known as class) which are executed sequentially.
Now only one way - check Item type in pipeline and process it or return "as is"
pipelines.py:
from grabbers.items import FeedItem
class StoreFeedPost(object):
def process_item(self, domain, item):
if isinstance(item, FeedItem):
#process it...
return item
items.py:
from scrapy.item import ScrapedItem
class FeedItem(ScrapedItem):
pass
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With