Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy - extract nested 'img src' using xPathSelector

I am relatively new to using Scrapy or python for that matter. I am looking to extract the from a few different links and I am having issues using a HTMLXPathSelector expression (syntax). I have looked at extensive documentation for the proper syntax but have yet to figure out a solution.

Here is an example of a link I am trying to extract the 'img src' from:

Page I am trying to extract the img src url from

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

class GeekSpider(BaseSpider):
    name = "geekS"
    allowed_domains = ["geek.com"]
    start_urls = ["http://www.geek.com/articles/gadgets/kindle-fire-hd-8-9-on-sale-for-50-off-today-only-20121210/"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        imgurl = hxs.select("//div[@class='article']//a/img/@src").extract()
        return imgurl

I think I have figured out the syntax for the x.select statement but, since I am new to this syntax/method I am not sure.

Here is my items.py file, basically followed the scrapy tutorial for this:

from scrapy.item import Item, Field

class GeekItem(Item):
    imgsrc = Field()

To clarify: What I am looking to do is extract the img src url that is on the page. I dont need to extract all image src's which I have already figured out(much easier).

I am just looking to narrow it down and only extract that particular url of the img src. (I will be using this across multiple pages on this site)

Any help is greatly appreciated!

EDIT - Updated Code I was getting some syntax errors with geek = geek() So I changed it slightly to hopefully be easier to understand and function

like image 355
Twhyler Avatar asked Dec 15 '12 02:12

Twhyler


1 Answers

I believe your xpath expression should be more like this. I tested it on another page (the Amazon shipping center article) and it returned all ten of the clickable images.

geek['imgsrc'] = x.select("//div[@class='article']//a/img/@src").extract()

To fix your other issue, you need to import GeekItem into your GeekSpider code.

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from geekspider.items import GeekItem # I'm guessing the name of your project here

class GeekSpider(BaseSpider):
    name = "geekS"
    allowed_domains = ["geek.com"]
    start_urls = ["http://www.geek.com/articles/gadgets/kindle-fire-hd-8-9-on-sale-for-50-off-today-only-20121210/"]

    def parse(self, response):
        item = GeekItem()
        hxs = HtmlXPathSelector(response)
        item['imgsrc'] = hxs.select("//div[@class='article']//a/img/@src").extract()
        return item
like image 141
Talvalin Avatar answered Oct 22 '22 01:10

Talvalin