Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy HtmlXPathSelector

Tags:

scrapy

Just trying out scrapy and trying to get a basic spider working. I know this is just probably something I'm missing but I've tried everything I can think of.

The error I get is:

line 11, in JustASpider
    sites = hxs.select('//title/text()')
NameError: name 'hxs' is not defined

My code is very basic at the moment, but I still can't seem to find where I'm going wrong. Thanks for any help!

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

class JustASpider(BaseSpider):
    name = "google.com"
    start_urls = ["http://www.google.com/search?hl=en&q=search"]


    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//title/text()')
        for site in sites:
            print site.extract()


SPIDER = JustASpider()
like image 430
Keanan Koppenhaver Avatar asked Sep 03 '12 22:09

Keanan Koppenhaver


People also ask

How do you write XPath for Scrapy?

When you are using text nodes in a XPath string function, then use . (dot) instead of using .//text(), because this produces the collection of text elements called as node-set.

How do you make a href in Scrapy?

We are using response. css() to select all the elements with the class title and the tag a. Then we are using the ::attr(href) to select the href attribute of all the elements we have selected. Then we are using the getall() to get all the values of the href attribute.


1 Answers

The code looks quite old version. I recommend using these codes instead

from scrapy.spider import Spider
from scrapy.selector import Selector

class JustASpider(Spider):
    name = "googlespider"
    allowed_domains=["google.com"]
    start_urls = ["http://www.google.com/search?hl=en&q=search"]


    def parse(self, response):
        sel = Selector(response)
        sites = sel.xpath('//title/text()').extract()
        print sites
        #for site in sites: (I dont know why you want to loop for extracting the text in the title element)
            #print site.extract()
hope it helps and here is a good example to follow.
like image 101
pink bunny Avatar answered Sep 27 '22 23:09

pink bunny