Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrape image from amazon with python 3 and beautifulsoup

I need to scrape the main image from a product page of amazon. I stored the ASIN into a list and i build every single product page with a for loop. i'm trying to scrape the images but i can't. I try with this code:

#declare a session object
session = HTMLSession()

#ignore warnings
if not sys.warnoptions:
    warnings.simplefilter("ignore")

urls = ['https://www.amazon.it/gp/bestsellers/apparel/', 'https://www.amazon.it/gp/bestsellers/electronics/', 'https://www.amazon.it/gp/bestsellers/books/']
asins = []
for url in urls:
    content = requests.get(url).content
    decoded_content = content.decode()
    asins = re.findall(r'/[^/]+/dp/([^\"?]+)', decoded_content)

#The ASIN Number will be between the dp/ and another /

for asin in asins:
    site = 'https://www.amazon.it/'
    start = 'dp/'
    end = '/'
    url = site + start + asin + end
    resp1 = requests.get(url).content

    soup = bsoup(resp1, "html.parser")
    body = soup.find("body")
    imgtag = soup.find("img", {"id":"landingImage"})
    imageurl = dict(imgtag.attrs)["src"]
    resp2 = request.urlopen(imaegurl)
like image 276
Andrea Ventura Avatar asked Dec 10 '25 16:12

Andrea Ventura


1 Answers

The problem is that the images are loaded dinamically; inspecting the page, and thanks to the BeautifulSoup documentation, I was able to scrape all the images needed, given a product.

Take the page of a given link

I have a class in which store data, so I save the page information in the instance...

import urllib
from bs4 import BeautifulSoup

def take_page(self, url_page):
    req = urllib.request.Request(
        url_page,
        data=None
    )
    f = urllib.request.urlopen(req)
    page = f.read().decode('utf-8')
    self.page = page

Scrape images

The following simple method will return the first image, in the smallest size

import json

def take_image(self):
    soup = BeautifulSoup(self.page, 'html.parser')
    img_div = soup.find(id="imgTagWrapperId")

    imgs_str = img_div.img.get('data-a-dynamic-image')  # a string in Json format

    # convert to a dictionary
    imgs_dict = json.loads(imgs_str)
    #each key in the dictionary is a link of an image, and the value shows the size (print all the dictionay to inspect)
    num_element = 0 
    first_link = list(imgs_dict.keys())[num_element]
    return first_link

So, you can apply these methods to your needs, I think that this is all you need to improve your code.

like image 51
PIERPAOLO MASELLA Avatar answered Dec 12 '25 04:12

PIERPAOLO MASELLA



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!