I am new to scrapy and I am trying to scrape the Ikea website webpage. The basic page with the list of locations as given here.
My items.py file is given below:
import scrapy
class IkeaItem(scrapy.Item):
name = scrapy.Field()
link = scrapy.Field()
And the spider is given below:
import scrapy
from ikea.items import IkeaItem
class IkeaSpider(scrapy.Spider):
name = 'ikea'
allowed_domains = ['http://www.ikea.com/']
start_urls = ['http://www.ikea.com/']
def parse(self, response):
for sel in response.xpath('//tr/td/a'):
item = IkeaItem()
item['name'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
yield item
On running the file I am not getting any output. The json file output is something like:
[[{"link": [], "name": []}
The output that I am looking for is the name of the location and the link. I am getting nothing. Where am I going wrong?
When you are using text nodes in a XPath string function, then use . (dot) instead of using .//text(), because this produces the collection of text elements called as node-set.
A link extractor is an object that extracts links from responses. The __init__ method of LxmlLinkExtractor takes settings that determine which links may be extracted. LxmlLinkExtractor. extract_links returns a list of matching Link objects from a Response object.
Description. When you are scraping the web pages, you need to extract a certain part of the HTML source by using the mechanism called selectors, achieved by using either XPath or CSS expressions. Selectors are built upon the lxml library, which processes the XML and HTML in Python language.
There is a simple mistake inside the xpath expressions for the item fields. The loop is already going over the a
tags, you don't need to specify a
in the inner xpath expressions. In other words, currently you are searching for a
tags inside the a
tags inside the td
inside tr
. Which obviously results into nothing.
Replace a/text()
with text()
and a/@href
with @href
.
(tested - works for me)
use this....
item['name'] = sel.xpath('//a/text()').extract()
item['link'] = sel.xpath('//a/@href').extract()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With