Scrape image data with scrapy

Tags:

I am using Scrapy to scrape the images related to a product on amazon.com. How would I parse the image data?

I typically use the XPath. However, I was not able to locate the XPath for the images (besides the thumbnails). For example, this is how I parse the title.

title = response.xpath('//h1[@id="title"]/span/text()').extract()

The link to the item is: https://www.amazon.com/dp/B01N068GIX?psc=1

939

asked Oct 01 '17 22:10

PiccolMan

1 Answers

Seems like the images can be extracted from JavaScript that's present in the page source. I used js2xml library to convert JavaScript source code to XML (you can learn more about it on Scrapinghub's blogpost). The XML can then be used to create a Selector with which you can extract data as usual. Take a look at this example spider:

# -*- coding: utf-8 -*-                                                         
import js2xml                                                                   
import scrapy                                                                   

class ExampleSpider(scrapy.Spider):                                             
    name = 'example'                                                            
    allowed_domains = ['amazon.com']                                            
    start_urls = ['https://www.amazon.com/dp/B01N068GIX?psc=1/']                

    def parse(self, response):                                                  
        item = dict()
        js = response.xpath("//script[contains(text(), 'register(\"ImageBlockATF\"')]/text()").extract_first()
        xml = js2xml.parse(js)                                                  
        selector = scrapy.Selector(root=xml)                                   
        item['image_urls'] = selector.xpath('//property[@name="colorImages"]//property[@name="hiRes"]/string/text()').extract()
        yield item

If you'd like to test it out, run it like

scrapy runspider example.py -s USER_AGENT="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.52 Safari/537.36"

as Amazon seems to block Scrapy based on user agent string.

answered Sep 22 '22 07:09

Tomáš Linhart

Related questions
                            
                                Virtualenv activate script won't run in bash script with set -euo
                            
                                Wordcloud Python with generate_from_frequencies
                            
                                Python to mysql 'Timestamp' object has no attribute 'translate'
                            
                                Logistic Regression: How to find top three feature that have highest weights?
                            
                                Python pandas load csv ANSI Format as UTF-8
                            
                                Python Strings are immutable so why does s.split( ) return a list of new strings
                            
                                Side Effects in Python
                            
                                How can I jump to the cell currently being run in a Jupyter notebook?
                            
                                Django - unavailable field of model while doing migration
                            
                                How to perform cluster with weights/density in python? Something like kmeans with weights?
                            
                                How to include library dependencies with a python project?
                            
                                How to use lazy_attribute with Faker in Factory Boy
                            
                                Tensorflow failed to create a newwriteablefile when retraining inception
                            
                                Why keywords in tcl and c pop up when to call tag completion for python?
                            
                                Python - Pandas - Convert YYYYMM to datetime
                            
                                Pandas: Remove Row Based on Applying Function
                            
                                Calling function on valid values of masked arrays
                            
                                TypeError: cannot do label indexing on <class 'pandas.indexes.base.Index'> with these indexers
                            
                                Django long request timeout
                            
                                celery task routes not working as expected

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scrape image data with scrapy

Tags:

python

xpath

scrapy

PiccolMan

People also ask

1 Answers

Tomáš Linhart

Recent Activity

Donate For Us