What is the correct way to nest Item data?
For example, I want the output of a product:
{
'price': price,
'title': title,
'meta': {
'url': url,
'added_on': added_on
}
I have scrapy.Item of:
class ProductItem(scrapy.Item):
url = scrapy.Field(output_processor=TakeFirst())
price = scrapy.Field(output_processor=TakeFirst())
title = scrapy.Field(output_processor=TakeFirst())
url = scrapy.Field(output_processor=TakeFirst())
added_on = scrapy.Field(output_processor=TakeFirst())
Now, the way I do it is just to reformat the whole item in the pipeline according to new item template:
class FormatedItem(scrapy.Item):
title = scrapy.Field()
price = scrapy.Field()
meta = scrapy.Field()
and in pipeline:
def process_item(self, item, spider):
formated_item = FormatedItem()
formated_item['title'] = item['title']
formated_item['price'] = item['price']
formated_item['meta'] = {
'url': item['url'],
'added_on': item['added_on']
}
return formated_item
Is this correct way to approach this or is there a more straight-forward way to approach this without breaking the philosophy of the framework?
I think it would be more straightforward to construct the dictionary in the spider. Here are two different ways of doing it, both achieving the same result. The only possible dealbreaker here is that the processors apply on the item['meta'] field, not on the item['meta']['added_on'] and item['meta']['url'] fields.
def parse(self, response):
item = MyItem()
item['meta'] = {'added_on': response.css("a::text").extract()[0]}
item['meta']['url'] = response.xpath("//a/@href").extract()[0]
return item
Is there a specific reason for which you want to construct it that way instead of unpacking the meta field ?
UPDATE from comments: Looks like nested loaders is the updated approach. Another comment suggests this approach will cause errors during serialization.
Best way to approach this is by creating a main
and a meta
item class/loader.
from scrapy.item import Item, Field
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import TakeFirst
class MetaItem(Item):
url = Field()
added_on = Field()
class MainItem(Item):
price = Field()
title = Field()
meta = Field(serializer=MetaItem)
class MainItemLoader(ItemLoader):
default_item_class = MainItem
default_output_processor = TakeFirst()
class MetaItemLoader(ItemLoader):
default_item_class = MetaItem
default_output_processor = TakeFirst()
Sample usage:
from scrapy.spider import Spider
from qwerty.items import MainItemLoader, MetaItemLoader
from scrapy.selector import Selector
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["example.com"]
start_urls = ["http://example.com"]
def parse(self, response):
mainloader = MainItemLoader(selector=Selector(response))
mainloader.add_value('title', 'test')
mainloader.add_value('price', 'price')
mainloader.add_value('meta', self.get_meta(response))
return mainloader.load_item()
def get_meta(self, response):
metaloader = MetaItemLoader(selector=Selector(response))
metaloader.add_value('url', response.url)
metaloader.add_value('added_on', 'now')
return metaloader.load_item()
After that, you can easily expand your items in the future by creating more "sub-items."
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With