Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy: Extract links and text

Tags:

I am new to scrapy and I am trying to scrape the Ikea website webpage. The basic page with the list of locations as given here.

My items.py file is given below:

import scrapy


class IkeaItem(scrapy.Item):

    name = scrapy.Field()
    link = scrapy.Field()

And the spider is given below:

import  scrapy
from ikea.items import IkeaItem
class IkeaSpider(scrapy.Spider):
    name = 'ikea'

    allowed_domains = ['http://www.ikea.com/']

    start_urls = ['http://www.ikea.com/']

    def parse(self, response):
        for sel in response.xpath('//tr/td/a'):
            item = IkeaItem()
            item['name'] = sel.xpath('a/text()').extract()
            item['link'] = sel.xpath('a/@href').extract()

            yield item

On running the file I am not getting any output. The json file output is something like:

[[{"link": [], "name": []}

The output that I am looking for is the name of the location and the link. I am getting nothing. Where am I going wrong?

like image 232
praxmon Avatar asked Jan 03 '15 09:01

praxmon


People also ask

How do I get text from XPath in Scrapy?

When you are using text nodes in a XPath string function, then use . (dot) instead of using .//text(), because this produces the collection of text elements called as node-set.

What is link extractor?

A link extractor is an object that extracts links from responses. The __init__ method of LxmlLinkExtractor takes settings that determine which links may be extracted. LxmlLinkExtractor. extract_links returns a list of matching Link objects from a Response object.

How do you use Scrapy selectors?

Description. When you are scraping the web pages, you need to extract a certain part of the HTML source by using the mechanism called selectors, achieved by using either XPath or CSS expressions. Selectors are built upon the lxml library, which processes the XML and HTML in Python language.


2 Answers

There is a simple mistake inside the xpath expressions for the item fields. The loop is already going over the a tags, you don't need to specify a in the inner xpath expressions. In other words, currently you are searching for a tags inside the a tags inside the td inside tr. Which obviously results into nothing.

Replace a/text() with text() and a/@href with @href.

(tested - works for me)

like image 115
alecxe Avatar answered Nov 03 '22 09:11

alecxe


use this....

    item['name'] = sel.xpath('//a/text()').extract()
    item['link'] = sel.xpath('//a/@href').extract()
like image 30
Ganesh Avatar answered Nov 03 '22 10:11

Ganesh