I'm pretty new to scrapy, I know that items are used to populate scraped data, but I cant understand the difference between items and item loaders. I tried to read some example codes, they used item loaders to store instead of items and I can't understand why. Scrapy documentation wasn't clear enough for me. Can anyone give a simple explanation (better with example) about when item loaders are used and what additional facilities do they provide over items ?
Item Loaders provide a convenient mechanism for populating scraped items. Even though items can be populated directly, Item Loaders provide a much more convenient API for populating them from a scraping process, by automating some common tasks like parsing the raw extracted data before assigning it.
An Item in Scrapy is a logical grouping of extracted data points from a website that represents a real-world thing. You do not have to make use of Scrapy Items right away, as we saw in earlier Scrapy tutorials. You can simply yield page elements as they are extracted and do with the data as you wish.
Field([arg]) class scrapy. Field([arg]) The Field class is just an alias to the built-in dict class and doesn't provide any extra functionality or attributes. In other words, Field objects are plain-old Python dicts. A separate class is used to support the item declaration syntax based on class attributes.
Scrapy is a web scraping library that is used to scrape, parse and collect web data. For all these functions we are having a pipelines.py file which is used to handle scraped data through various components (known as class) which are executed sequentially.
I really like the official explanation in the docs:
Item Loaders provide a convenient mechanism for populating scraped Items. Even though Items can be populated using their own dictionary-like API, Item Loaders provide a much more convenient API for populating them from a scraping process, by automating some common tasks like parsing the raw extracted data before assigning it.
In other words, Items provide the container of scraped data, while Item Loaders provide the mechanism for populating that container.
Last paragraph should answer your question.
Item loaders are great since they allow you to have so many processing shortcuts and reuse a bunch of code to keep everything tidy, clean and understandable.
Comparison example case. Lets say we want to scrape this item:
class MyItem(Item):
full_name = Field()
bio = Field()
age = Field()
weight = Field()
height = Field()
Item only approach would look something like this:
def parse(self, response):
full_name = response.xpath("//div[contains(@class,'name')]/text()").extract()
# i.e. returns ugly ['John\n', '\n\t ', ' Snow']
item['full_name'] = ' '.join(i.strip() for i in full_name if i.strip())
bio = response.xpath("//div[contains(@class,'bio')]/text()").extract()
item['bio'] = ' '.join(i.strip() for i in full_name if i.strip())
age = response.xpath("//div[@class='age']/text()").extract_first(0)
item['age'] = int(age)
weight = response.xpath("//div[@class='weight']/text()").extract_first(0)
item['weight'] = int(age)
height = response.xpath("//div[@class='height']/text()").extract_first(0)
item['height'] = int(age)
return item
vs Item Loaders approach:
# define once in items.py
from scrapy.loader.processors import Compose, MapCompose, Join, TakeFirst
clean_text = Compose(MapCompose(lambda v: v.strip()), Join())
to_int = Compose(TakeFirst(), int)
class MyItemLoader(ItemLoader):
default_item_class = MyItem
full_name_out = clean_text
bio_out = clean_text
age_out = to_int
weight_out = to_int
height_out = to_int
# parse as many different places and times as you want
def parse(self, response):
loader = MyItemLoader(selector=response)
loader.add_xpath('full_name', "//div[contains(@class,'name')]/text()")
loader.add_xpath('bio', "//div[contains(@class,'bio')]/text()")
loader.add_xpath('age', "//div[@class='age']/text()")
loader.add_xpath('weight', "//div[@class='weight']/text()")
loader.add_xpath('height', "//div[@class='height']/text()")
return loader.load_item()
As you can see the Item Loader is so much cleaner and easier to scale. Let's say you have 20 more fields from which a lot share the same processing logic, would be a suicide to do it without Item Loaders. Item Loaders are awesome and you should use them!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With