Items vs item loaders in scrapy

Tags:

I'm pretty new to scrapy, I know that items are used to populate scraped data, but I cant understand the difference between items and item loaders. I tried to read some example codes, they used item loaders to store instead of items and I can't understand why. Scrapy documentation wasn't clear enough for me. Can anyone give a simple explanation (better with example) about when item loaders are used and what additional facilities do they provide over items ?

772

asked Aug 24 '16 15:08

Airbear

1 Answers

I really like the official explanation in the docs:

Item Loaders provide a convenient mechanism for populating scraped Items. Even though Items can be populated using their own dictionary-like API, Item Loaders provide a much more convenient API for populating them from a scraping process, by automating some common tasks like parsing the raw extracted data before assigning it.

In other words, Items provide the container of scraped data, while Item Loaders provide the mechanism for populating that container.

Last paragraph should answer your question.
Item loaders are great since they allow you to have so many processing shortcuts and reuse a bunch of code to keep everything tidy, clean and understandable.

Comparison example case. Lets say we want to scrape this item:

class MyItem(Item):
    full_name = Field()
    bio = Field()
    age = Field()
    weight = Field()
    height = Field()

Item only approach would look something like this:

def parse(self, response):
    full_name = response.xpath("//div[contains(@class,'name')]/text()").extract()
    # i.e. returns ugly ['John\n', '\n\t  ', '  Snow']
    item['full_name'] = ' '.join(i.strip() for i in full_name if i.strip())
    bio = response.xpath("//div[contains(@class,'bio')]/text()").extract()
    item['bio'] = ' '.join(i.strip() for i in full_name if i.strip())
    age = response.xpath("//div[@class='age']/text()").extract_first(0)
    item['age'] = int(age) 
    weight = response.xpath("//div[@class='weight']/text()").extract_first(0)
    item['weight'] = int(age) 
    height = response.xpath("//div[@class='height']/text()").extract_first(0)
    item['height'] = int(age) 
    return item

vs Item Loaders approach:

# define once in items.py 
from scrapy.loader.processors import Compose, MapCompose, Join, TakeFirst
clean_text = Compose(MapCompose(lambda v: v.strip()), Join())   
to_int = Compose(TakeFirst(), int)

class MyItemLoader(ItemLoader):
    default_item_class = MyItem
    full_name_out = clean_text
    bio_out = clean_text
    age_out = to_int
    weight_out = to_int
    height_out = to_int

# parse as many different places and times as you want  
def parse(self, response):
    loader = MyItemLoader(selector=response)
    loader.add_xpath('full_name', "//div[contains(@class,'name')]/text()")
    loader.add_xpath('bio', "//div[contains(@class,'bio')]/text()")
    loader.add_xpath('age', "//div[@class='age']/text()")
    loader.add_xpath('weight', "//div[@class='weight']/text()")
    loader.add_xpath('height', "//div[@class='height']/text()")
    return loader.load_item()

As you can see the Item Loader is so much cleaner and easier to scale. Let's say you have 20 more fields from which a lot share the same processing logic, would be a suicide to do it without Item Loaders. Item Loaders are awesome and you should use them!

101

answered Oct 18 '22 11:10

Granitosaurus

Related questions
                            
                                matplotlib hist() autocropping range
                            
                                numpy array concatenation error: 0-d arrays can't be concatenated
                            
                                AttributeError: 'FreqDist' object has no attribute 'inc'
                            
                                Getting (index, column) pairs for True elements of a boolean DataFrame in Pandas
                            
                                Can luigi rerun tasks when the task dependencies become out of date?
                            
                                Python filter function - single result [duplicate]
                            
                                How to show minor tick labels on log-scale with Matplotlib
                            
                                Python regex AttributeError: 'NoneType' object has no attribute 'group'
                            
                                faster alternative to numpy.where?
                            
                                Pandas usecols all except last
                            
                                ImageFont IO error: cannot open resource
                            
                                Pandas df.describe() , is it possible to do it by row without transposing?
                            
                                Create a post activate script in Conda [duplicate]
                            
                                Tensorflow: How to get all variables from rnn_cell.BasicLSTM & rnn_cell.MultiRNNCell
                            
                                How do I pass command line arguments to Python from VS in Debug mode?
                            
                                Check if user is logged in with Flask-Login in template
                            
                                Change working directory of console in PyCharm
                            
                                Openpyxl auto-height row
                            
                                Trying to migrate in Django 1.9 -- strange SQL error "django.db.utils.OperationalError: near ")": syntax error"
                            
                                "Almost Equal" in Jasmine

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Items vs item loaders in scrapy

Tags:

python

web-scraping

scrapy

scrapy-spider

Airbear

People also ask

1 Answers

Granitosaurus

Recent Activity

Donate For Us