I want to use Scrapy on the Dmoz website they use in their tutorials, but instead of just reading the books in the books URL (http://www.dmoz.org/Computers/Programming/Languages/Python/Books/) by using the Item/Field pairs, I want to create an Itemloader that will read in the desired values (name, title, description).
This is my items.py file:
from scrapy.item import Item, Field
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import Identity
class DmozItem(Item):
title = Field(
output_processor=Identity()
)
link = Field(
output_processor=Identity()
)
desc = Field(
output_processor=Identity()
)
class MainItemLoader(ItemLoader):
default_item_class = DmozItem
default_output_processor = Identity()
And my spider file:
import scrapy
from scrapy.spiders import Spider
from scrapy.loader import ItemLoader
from tutorial.items import MainItemLoader, DmozItem
from scrapy.selector import Selector
class DmozSpider(Spider):
name = 'dmoz'
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"
]
def parse(self, response):
for sel in response.xpath('//div[@class="site-item "]/div[@class="title-and-desc"]'):
l = MainItemLoader(response=response)
l.add_xpath('title', '/a/div[@class="site-title"]/text()')
l.add_xpath('link', '/a/@href')
l.add_xpath('desc', '/div[@class="site-descr "]/text()')
yield l.load_item()
I have tried a number of different alternatives. I suspect the main issue is in the "response=response" part of the itemloader declaration, but I can't make heads or tails of the scrapy documentation regarding this. Could using the selector="blah" syntax be where I should look?
If I run this, I get a list of 22 empty brackets (the correct number of books). If I change the first slash in each add_xpath line to be a double slash, I get 22 identical lists containing ALL the data (unsurprisingly).
How can I write this so the itemloader will make a new list containing the desired fields for each different book?
Thank you!
You need to let your ItemLoader
work inside a specific selector, not response
:
l = MainItemLoader(selector=sel)
l.add_xpath('title', './a/div[@class="site-title"]/text()')
l.add_xpath('link', './a/@href')
l.add_xpath('desc', './div[@class="site-descr "]/text()')
yield l.load_item()
Also note the dots at the beginning of XPath expressions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With