I want to use Scrapy on the Dmoz website they use in their tutorials, but instead of just reading the books in the books URL (http://www.dmoz.org/Computers/Programming/Languages/Python/Books/) by using the Item/Field pairs, I want to create an Itemloader that will read in the desired values (name, title, description).
This is my items.py file:
from scrapy.item import Item, Field
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import Identity
class DmozItem(Item):
    title = Field(
        output_processor=Identity()
        )
    link = Field(
        output_processor=Identity()
        )
    desc = Field(
        output_processor=Identity()
        )
class MainItemLoader(ItemLoader):
    default_item_class = DmozItem
    default_output_processor = Identity()
And my spider file:
import scrapy
from scrapy.spiders import Spider
from scrapy.loader import ItemLoader
from tutorial.items import MainItemLoader, DmozItem 
from scrapy.selector import Selector
class DmozSpider(Spider):
    name = 'dmoz'
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"
    ]
    def parse(self, response):
        for sel in response.xpath('//div[@class="site-item "]/div[@class="title-and-desc"]'):
            l = MainItemLoader(response=response)
            l.add_xpath('title', '/a/div[@class="site-title"]/text()')
            l.add_xpath('link', '/a/@href')
            l.add_xpath('desc', '/div[@class="site-descr "]/text()')
            yield l.load_item()
I have tried a number of different alternatives. I suspect the main issue is in the "response=response" part of the itemloader declaration, but I can't make heads or tails of the scrapy documentation regarding this. Could using the selector="blah" syntax be where I should look?
If I run this, I get a list of 22 empty brackets (the correct number of books). If I change the first slash in each add_xpath line to be a double slash, I get 22 identical lists containing ALL the data (unsurprisingly).
How can I write this so the itemloader will make a new list containing the desired fields for each different book?
Thank you!
You need to let your ItemLoader work inside a specific selector, not response:
l = MainItemLoader(selector=sel)
l.add_xpath('title', './a/div[@class="site-title"]/text()')
l.add_xpath('link', './a/@href')
l.add_xpath('desc', './div[@class="site-descr "]/text()')
yield l.load_item()
Also note the dots at the beginning of XPath expressions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With