I am using Scrapy to scrape the images related to a product on amazon.com. How would I parse the image data?
I typically use the XPath. However, I was not able to locate the XPath for the images (besides the thumbnails). For example, this is how I parse the title.
title = response.xpath('//h1[@id="title"]/span/text()').extract()
The link to the item is: https://www.amazon.com/dp/B01N068GIX?psc=1
Scrapy provides reusable item pipelines for downloading files attached to a particular item (for example, when you scrape products and also want to download their images locally).
While working with Scrapy, one needs to create scrapy project. In Scrapy, always try to create one spider which helps to fetch data, so to create one, move to spider folder and create one python file over there. Create one spider with name gfgfetch.py python file. Move to the spider folder and create gfgfetch.py .
Seems like the images can be extracted from JavaScript that's present in the page source. I used js2xml library to convert JavaScript source code to XML (you can learn more about it on Scrapinghub's blogpost). The XML can then be used to create a Selector
with which you can extract data as usual. Take a look at this example spider:
# -*- coding: utf-8 -*-
import js2xml
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example'
allowed_domains = ['amazon.com']
start_urls = ['https://www.amazon.com/dp/B01N068GIX?psc=1/']
def parse(self, response):
item = dict()
js = response.xpath("//script[contains(text(), 'register(\"ImageBlockATF\"')]/text()").extract_first()
xml = js2xml.parse(js)
selector = scrapy.Selector(root=xml)
item['image_urls'] = selector.xpath('//property[@name="colorImages"]//property[@name="hiRes"]/string/text()').extract()
yield item
If you'd like to test it out, run it like
scrapy runspider example.py -s USER_AGENT="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.52 Safari/537.36"
as Amazon seems to block Scrapy based on user agent string.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With