Using Scrapy Itemloader in a loop

Question

I want to use Scrapy on the Dmoz website they use in their tutorials, but instead of just reading the books in the books URL (http://www.dmoz.org/Computers/Programming/Languages/Python/Books/) by using the Item/Field pairs, I want to create an Itemloader that will read in the desired values (name, title, description).

This is my items.py file:

from scrapy.item import Item, Field
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import Identity


class DmozItem(Item):
    title = Field(
        output_processor=Identity()
        )
    link = Field(
        output_processor=Identity()
        )
    desc = Field(
        output_processor=Identity()
        )


class MainItemLoader(ItemLoader):
    default_item_class = DmozItem
    default_output_processor = Identity()

And my spider file:

import scrapy
from scrapy.spiders import Spider
from scrapy.loader import ItemLoader
from tutorial.items import MainItemLoader, DmozItem 
from scrapy.selector import Selector


class DmozSpider(Spider):
    name = 'dmoz'
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"
    ]

    def parse(self, response):
        for sel in response.xpath('//div[@class="site-item "]/div[@class="title-and-desc"]'):
            l = MainItemLoader(response=response)
            l.add_xpath('title', '/a/div[@class="site-title"]/text()')
            l.add_xpath('link', '/a/@href')
            l.add_xpath('desc', '/div[@class="site-descr "]/text()')
            yield l.load_item()

I have tried a number of different alternatives. I suspect the main issue is in the "response=response" part of the itemloader declaration, but I can't make heads or tails of the scrapy documentation regarding this. Could using the selector="blah" syntax be where I should look?

If I run this, I get a list of 22 empty brackets (the correct number of books). If I change the first slash in each add_xpath line to be a double slash, I get 22 identical lists containing ALL the data (unsurprisingly).

How can I write this so the itemloader will make a new list containing the desired fields for each different book?

Thank you!

alecxe · Accepted Answer

You need to let your ItemLoader work inside a specific selector, not response:

l = MainItemLoader(selector=sel)
l.add_xpath('title', './a/div[@class="site-title"]/text()')
l.add_xpath('link', './a/@href')
l.add_xpath('desc', './div[@class="site-descr "]/text()')
yield l.load_item()

Also note the dots at the beginning of XPath expressions.

Using Scrapy Itemloader in a loop

Tags:

python

web-scraping

scrapy

Paulo Black

1 Answers

alecxe

Recent Activity

Donate For Us

Using Scrapy Itemloader in a loop

Tags:

python

web-scraping

scrapy

Paulo Black

1 Answers

alecxe

Related questions

Recent Activity

Donate For Us