Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Items vs item loaders in scrapy

I'm pretty new to scrapy, I know that items are used to populate scraped data, but I cant understand the difference between items and item loaders. I tried to read some example codes, they used item loaders to store instead of items and I can't understand why. Scrapy documentation wasn't clear enough for me. Can anyone give a simple explanation (better with example) about when item loaders are used and what additional facilities do they provide over items ?

like image 772
Airbear Avatar asked Aug 24 '16 15:08

Airbear


People also ask

What is Item loader in Scrapy?

Item Loaders provide a convenient mechanism for populating scraped items. Even though items can be populated directly, Item Loaders provide a much more convenient API for populating them from a scraping process, by automating some common tasks like parsing the raw extracted data before assigning it.

What are items in Scrapy?

An Item in Scrapy is a logical grouping of extracted data points from a website that represents a real-world thing. You do not have to make use of Scrapy Items right away, as we saw in earlier Scrapy tutorials. You can simply yield page elements as they are extracted and do with the data as you wish.

What is field in Scrapy?

Field([arg]) class scrapy. Field([arg]) The Field class is just an alias to the built-in dict class and doesn't provide any extra functionality or attributes. In other words, Field objects are plain-old Python dicts. A separate class is used to support the item declaration syntax based on class attributes.

What are Scrapy pipelines?

Scrapy is a web scraping library that is used to scrape, parse and collect web data. For all these functions we are having a pipelines.py file which is used to handle scraped data through various components (known as class) which are executed sequentially.


1 Answers

I really like the official explanation in the docs:

Item Loaders provide a convenient mechanism for populating scraped Items. Even though Items can be populated using their own dictionary-like API, Item Loaders provide a much more convenient API for populating them from a scraping process, by automating some common tasks like parsing the raw extracted data before assigning it.

In other words, Items provide the container of scraped data, while Item Loaders provide the mechanism for populating that container.

Last paragraph should answer your question.
Item loaders are great since they allow you to have so many processing shortcuts and reuse a bunch of code to keep everything tidy, clean and understandable.

Comparison example case. Lets say we want to scrape this item:

class MyItem(Item):
    full_name = Field()
    bio = Field()
    age = Field()
    weight = Field()
    height = Field()

Item only approach would look something like this:

def parse(self, response):
    full_name = response.xpath("//div[contains(@class,'name')]/text()").extract()
    # i.e. returns ugly ['John\n', '\n\t  ', '  Snow']
    item['full_name'] = ' '.join(i.strip() for i in full_name if i.strip())
    bio = response.xpath("//div[contains(@class,'bio')]/text()").extract()
    item['bio'] = ' '.join(i.strip() for i in full_name if i.strip())
    age = response.xpath("//div[@class='age']/text()").extract_first(0)
    item['age'] = int(age) 
    weight = response.xpath("//div[@class='weight']/text()").extract_first(0)
    item['weight'] = int(age) 
    height = response.xpath("//div[@class='height']/text()").extract_first(0)
    item['height'] = int(age) 
    return item

vs Item Loaders approach:

# define once in items.py 
from scrapy.loader.processors import Compose, MapCompose, Join, TakeFirst
clean_text = Compose(MapCompose(lambda v: v.strip()), Join())   
to_int = Compose(TakeFirst(), int)

class MyItemLoader(ItemLoader):
    default_item_class = MyItem
    full_name_out = clean_text
    bio_out = clean_text
    age_out = to_int
    weight_out = to_int
    height_out = to_int

# parse as many different places and times as you want  
def parse(self, response):
    loader = MyItemLoader(selector=response)
    loader.add_xpath('full_name', "//div[contains(@class,'name')]/text()")
    loader.add_xpath('bio', "//div[contains(@class,'bio')]/text()")
    loader.add_xpath('age', "//div[@class='age']/text()")
    loader.add_xpath('weight', "//div[@class='weight']/text()")
    loader.add_xpath('height', "//div[@class='height']/text()")
    return loader.load_item()

As you can see the Item Loader is so much cleaner and easier to scale. Let's say you have 20 more fields from which a lot share the same processing logic, would be a suicide to do it without Item Loaders. Item Loaders are awesome and you should use them!

like image 101
Granitosaurus Avatar answered Oct 18 '22 11:10

Granitosaurus