Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrape ASIN from Amazon's Search page

I try to scrape the ASIN numbers on Amazon. Please note that this is not about the product details (like this: https://www.youtube.com/watch?v=qRVRIh3GZgI), but this is when you search for a keyword (in this example "trimmer", try this: https://www.amazon.com/s?k=trimmer&ref=nb_sb_noss_2). The results are many products, I am able to scrape all the Titles.

What is not visible is the ASIN (which is a unique Amazon number). I saw, while inspecting the HTML a link in the text (href), which is containing the ASIN number. In the example below, the ASIN = B01MSHQ5IQ

<a class="a-link-normal a-text-normal" href="/Philips-Norelco-Groomer-MG3750-50/dp/B01MSHQ5IQ/ref=sr_1_3?keywords=trimmer&amp;qid=1554462204&amp;s=gateway&amp;sr=8-3">

Ending with my question: How can I retrieve all the Product Titles AND ASIN numbers on the page? For example:

Number     Title                       ASIN
 1       Braun, Beardtrimmer          B07JH1LLYR 
 2       TNT Pro Series Waist         B00R84J2PK
 ...     ...                          ...

So far, I am using scrapy (but also open for other Python solutions) and I am able to scrape the Titles.

My code so far:

First run in the command line:

scrapy startproject tutorial

Then, adjust the files in the Spider (see example 1) and items.py (see example 2).

Example 1

class AmazonProductSpider(scrapy.Spider):
  name = "AmazonDeals"
  allowed_domains = ["amazon.com"]

  #Use working product URL below
  start_urls = [
     "https://www.amazon.com/s?k=trimmer&ref=nb_sb_noss_2"         

]
 ## scrapy crawl AmazonDeals -o Asin_Titles.json

  def parse(self, response):
      items = AmazonItem()


      Title = response.css('.a-text-normal').css('::text').extract()
      items['title_Products'] = Title 
      yield items

As requested by @glhr, adding the items.py code:

Example 2

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class AmazonItem(scrapy.Item):
  # define the fields for your item here like:
  title_Products = scrapy.Field()
like image 491
helloworld1990 Avatar asked Nov 06 '22 18:11

helloworld1990


1 Answers

You can get the link to the product by extracting the href attribute of <a class="a-link-normal a-text-normal" href="...">:

Link = response.css('.a-text-normal').css('a::attr(href)').extract()

From a link, you can use a regular expression to extract the ASIN number from the link:

(?<=dp/)[A-Z0-9]{10}

The regular expression above will match 10 characters (either uppercase letters or numbers) preceded by dp/. See demo here: https://regex101.com/r/mLMv3k/1

Here's a working implementation of the parse() method:

def parse(self, response):
    Link = response.css('.a-text-normal').css('a::attr(href)').extract()
    Title = response.css('span.a-text-normal').css('::text').extract()

    # for each product, create an AmazonItem, populate the fields and yield the item
    for result in zip(Link,Title):
        item = AmazonItem()
        item['title_Product'] = result[1]
        item['link_Product'] = result[0]
        # extract ASIN from link
        ASIN = re.findall(r"(?<=dp/)[A-Z0-9]{10}",result[0])[0]
        item['ASIN_Product'] = ASIN
        yield item

This requires extending AmazonItem with new fields:

class AmazonItem(scrapy.Item):
    # define the fields for your item here like:
    title_Product = scrapy.Field()
    link_Product = scrapy.Field()
    ASIN_Product = scrapy.Field()

Sample output:

{'ASIN_Product': 'B01MSHQ5IQ',
 'link_Product': '/Philips-Norelco-Groomer-MG3750-50/dp/B01MSHQ5IQ',
 'title_Product': 'Philips Norelco Multigroom Series 3000, 13 attachments, '
                  'FFP, MG3750'}
{'ASIN_Product': 'B01MSHQ5IQ',
 'link_Product': '/Philips-Norelco-Groomer-MG3750-50/dp/B01MSHQ5IQ',
 'title_Product': 'Philips Norelco Multi Groomer MG7750/49-23 piece, beard, '
                  'body, face, nose, and ear hair trimmer, shaver, and clipper'}

Demo: https://repl.it/@glhr/55534679-AmazonSpider

To write the output to a JSON file, simply specify feed export settings in the spider:

class AmazonProductSpider(scrapy.Spider):
    name = "AmazonDeals"
    allowed_domains = ["amazon.com"]
    start_urls = ["https://www.amazon.com/s?k=trimmer&ref=nb_sb_noss_2"]
    custom_settings = {
            'FEED_URI' : 'Asin_Titles.json',
            'FEED_FORMAT' : 'json'
    }
like image 167
glhr Avatar answered Nov 15 '22 10:11

glhr