Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Scrapy Itemloader in a loop

I want to use Scrapy on the Dmoz website they use in their tutorials, but instead of just reading the books in the books URL (http://www.dmoz.org/Computers/Programming/Languages/Python/Books/) by using the Item/Field pairs, I want to create an Itemloader that will read in the desired values (name, title, description).

This is my items.py file:

from scrapy.item import Item, Field
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import Identity


class DmozItem(Item):
    title = Field(
        output_processor=Identity()
        )
    link = Field(
        output_processor=Identity()
        )
    desc = Field(
        output_processor=Identity()
        )


class MainItemLoader(ItemLoader):
    default_item_class = DmozItem
    default_output_processor = Identity()

And my spider file:

import scrapy
from scrapy.spiders import Spider
from scrapy.loader import ItemLoader
from tutorial.items import MainItemLoader, DmozItem 
from scrapy.selector import Selector


class DmozSpider(Spider):
    name = 'dmoz'
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"
    ]

    def parse(self, response):
        for sel in response.xpath('//div[@class="site-item "]/div[@class="title-and-desc"]'):
            l = MainItemLoader(response=response)
            l.add_xpath('title', '/a/div[@class="site-title"]/text()')
            l.add_xpath('link', '/a/@href')
            l.add_xpath('desc', '/div[@class="site-descr "]/text()')
            yield l.load_item()

I have tried a number of different alternatives. I suspect the main issue is in the "response=response" part of the itemloader declaration, but I can't make heads or tails of the scrapy documentation regarding this. Could using the selector="blah" syntax be where I should look?

If I run this, I get a list of 22 empty brackets (the correct number of books). If I change the first slash in each add_xpath line to be a double slash, I get 22 identical lists containing ALL the data (unsurprisingly).

How can I write this so the itemloader will make a new list containing the desired fields for each different book?

Thank you!

like image 666
Paulo Black Avatar asked Jun 06 '16 13:06

Paulo Black


1 Answers

You need to let your ItemLoader work inside a specific selector, not response:

l = MainItemLoader(selector=sel)
l.add_xpath('title', './a/div[@class="site-title"]/text()')
l.add_xpath('link', './a/@href')
l.add_xpath('desc', './div[@class="site-descr "]/text()')
yield l.load_item()

Also note the dots at the beginning of XPath expressions.

like image 73
alecxe Avatar answered Sep 29 '22 19:09

alecxe