I try to scrape the ASIN numbers on Amazon. Please note that this is not about the product details (like this: https://www.youtube.com/watch?v=qRVRIh3GZgI), but this is when you search for a keyword (in this example "trimmer", try this: https://www.amazon.com/s?k=trimmer&ref=nb_sb_noss_2). The results are many products, I am able to scrape all the Titles.
What is not visible is the ASIN (which is a unique Amazon number). I saw, while inspecting the HTML a link in the text (href), which is containing the ASIN number. In the example below, the ASIN = B01MSHQ5IQ
<a class="a-link-normal a-text-normal" href="/Philips-Norelco-Groomer-MG3750-50/dp/B01MSHQ5IQ/ref=sr_1_3?keywords=trimmer&qid=1554462204&s=gateway&sr=8-3">
Ending with my question: How can I retrieve all the Product Titles AND ASIN numbers on the page? For example:
Number Title ASIN
1 Braun, Beardtrimmer B07JH1LLYR
2 TNT Pro Series Waist B00R84J2PK
... ... ...
So far, I am using scrapy (but also open for other Python solutions) and I am able to scrape the Titles.
My code so far:
First run in the command line:
scrapy startproject tutorial
Then, adjust the files in the Spider (see example 1) and items.py (see example 2).
Example 1
class AmazonProductSpider(scrapy.Spider):
name = "AmazonDeals"
allowed_domains = ["amazon.com"]
#Use working product URL below
start_urls = [
"https://www.amazon.com/s?k=trimmer&ref=nb_sb_noss_2"
]
## scrapy crawl AmazonDeals -o Asin_Titles.json
def parse(self, response):
items = AmazonItem()
Title = response.css('.a-text-normal').css('::text').extract()
items['title_Products'] = Title
yield items
As requested by @glhr, adding the items.py code:
Example 2
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class AmazonItem(scrapy.Item):
# define the fields for your item here like:
title_Products = scrapy.Field()
You can get the link to the product by extracting the href
attribute of <a class="a-link-normal a-text-normal" href="...">
:
Link = response.css('.a-text-normal').css('a::attr(href)').extract()
From a link, you can use a regular expression to extract the ASIN number from the link:
(?<=dp/)[A-Z0-9]{10}
The regular expression above will match 10 characters (either uppercase letters or numbers) preceded by dp/
. See demo here: https://regex101.com/r/mLMv3k/1
Here's a working implementation of the parse()
method:
def parse(self, response):
Link = response.css('.a-text-normal').css('a::attr(href)').extract()
Title = response.css('span.a-text-normal').css('::text').extract()
# for each product, create an AmazonItem, populate the fields and yield the item
for result in zip(Link,Title):
item = AmazonItem()
item['title_Product'] = result[1]
item['link_Product'] = result[0]
# extract ASIN from link
ASIN = re.findall(r"(?<=dp/)[A-Z0-9]{10}",result[0])[0]
item['ASIN_Product'] = ASIN
yield item
This requires extending AmazonItem
with new fields:
class AmazonItem(scrapy.Item):
# define the fields for your item here like:
title_Product = scrapy.Field()
link_Product = scrapy.Field()
ASIN_Product = scrapy.Field()
Sample output:
{'ASIN_Product': 'B01MSHQ5IQ',
'link_Product': '/Philips-Norelco-Groomer-MG3750-50/dp/B01MSHQ5IQ',
'title_Product': 'Philips Norelco Multigroom Series 3000, 13 attachments, '
'FFP, MG3750'}
{'ASIN_Product': 'B01MSHQ5IQ',
'link_Product': '/Philips-Norelco-Groomer-MG3750-50/dp/B01MSHQ5IQ',
'title_Product': 'Philips Norelco Multi Groomer MG7750/49-23 piece, beard, '
'body, face, nose, and ear hair trimmer, shaver, and clipper'}
Demo: https://repl.it/@glhr/55534679-AmazonSpider
To write the output to a JSON file, simply specify feed export settings in the spider:
class AmazonProductSpider(scrapy.Spider):
name = "AmazonDeals"
allowed_domains = ["amazon.com"]
start_urls = ["https://www.amazon.com/s?k=trimmer&ref=nb_sb_noss_2"]
custom_settings = {
'FEED_URI' : 'Asin_Titles.json',
'FEED_FORMAT' : 'json'
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With