Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrape image data with scrapy

I am using Scrapy to scrape the images related to a product on amazon.com. How would I parse the image data?

I typically use the XPath. However, I was not able to locate the XPath for the images (besides the thumbnails). For example, this is how I parse the title.

title = response.xpath('//h1[@id="title"]/span/text()').extract()

The link to the item is: https://www.amazon.com/dp/B01N068GIX?psc=1

like image 939
PiccolMan Avatar asked Oct 01 '17 22:10

PiccolMan


People also ask

Can we scrape images using Scrapy?

Scrapy provides reusable item pipelines for downloading files attached to a particular item (for example, when you scrape products and also want to download their images locally).

How do you scrape data from Scrapy?

While working with Scrapy, one needs to create scrapy project. In Scrapy, always try to create one spider which helps to fetch data, so to create one, move to spider folder and create one python file over there. Create one spider with name gfgfetch.py python file. Move to the spider folder and create gfgfetch.py .


1 Answers

Seems like the images can be extracted from JavaScript that's present in the page source. I used js2xml library to convert JavaScript source code to XML (you can learn more about it on Scrapinghub's blogpost). The XML can then be used to create a Selector with which you can extract data as usual. Take a look at this example spider:

# -*- coding: utf-8 -*-                                                         
import js2xml                                                                   
import scrapy                                                                   

class ExampleSpider(scrapy.Spider):                                             
    name = 'example'                                                            
    allowed_domains = ['amazon.com']                                            
    start_urls = ['https://www.amazon.com/dp/B01N068GIX?psc=1/']                

    def parse(self, response):                                                  
        item = dict()
        js = response.xpath("//script[contains(text(), 'register(\"ImageBlockATF\"')]/text()").extract_first()
        xml = js2xml.parse(js)                                                  
        selector = scrapy.Selector(root=xml)                                   
        item['image_urls'] = selector.xpath('//property[@name="colorImages"]//property[@name="hiRes"]/string/text()').extract()
        yield item

If you'd like to test it out, run it like

scrapy runspider example.py -s USER_AGENT="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.52 Safari/537.36"

as Amazon seems to block Scrapy based on user agent string.

like image 73
Tomáš Linhart Avatar answered Sep 22 '22 07:09

Tomáš Linhart