I am relatively new to using Scrapy or python for that matter. I am looking to extract the from a few different links and I am having issues using a HTMLXPathSelector expression (syntax). I have looked at extensive documentation for the proper syntax but have yet to figure out a solution.
Here is an example of a link I am trying to extract the 'img src' from:
Page I am trying to extract the img src url from
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class GeekSpider(BaseSpider):
name = "geekS"
allowed_domains = ["geek.com"]
start_urls = ["http://www.geek.com/articles/gadgets/kindle-fire-hd-8-9-on-sale-for-50-off-today-only-20121210/"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
imgurl = hxs.select("//div[@class='article']//a/img/@src").extract()
return imgurl
I think I have figured out the syntax for the x.select statement but, since I am new to this syntax/method I am not sure.
Here is my items.py file, basically followed the scrapy tutorial for this:
from scrapy.item import Item, Field
class GeekItem(Item):
imgsrc = Field()
To clarify: What I am looking to do is extract the img src url that is on the page. I dont need to extract all image src's which I have already figured out(much easier).
I am just looking to narrow it down and only extract that particular url of the img src. (I will be using this across multiple pages on this site)
Any help is greatly appreciated!
EDIT - Updated Code I was getting some syntax errors with geek = geek() So I changed it slightly to hopefully be easier to understand and function
I believe your xpath expression should be more like this. I tested it on another page (the Amazon shipping center article) and it returned all ten of the clickable images.
geek['imgsrc'] = x.select("//div[@class='article']//a/img/@src").extract()
To fix your other issue, you need to import GeekItem into your GeekSpider code.
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from geekspider.items import GeekItem # I'm guessing the name of your project here
class GeekSpider(BaseSpider):
name = "geekS"
allowed_domains = ["geek.com"]
start_urls = ["http://www.geek.com/articles/gadgets/kindle-fire-hd-8-9-on-sale-for-50-off-today-only-20121210/"]
def parse(self, response):
item = GeekItem()
hxs = HtmlXPathSelector(response)
item['imgsrc'] = hxs.select("//div[@class='article']//a/img/@src").extract()
return item
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With