Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get All Reviews From Amazon? Python 3

I am trying to read all the reviews of a product from python. I have a script, but it does not work.

parser = html.fromstring(page_response)
XPATH_AGGREGATE = '//span[@id="acrCustomerReviewText"]'
XPATH_REVIEW_SECTION_1 = '//div[@data-hook="reviews-content"]'
XPATH_REVIEW_SECTION_2 = '//div[@data-hook="review"]'

XPATH_AGGREGATE_RATING = '//table[@id="histogramTable"]//tr'
XPATH_PRODUCT_NAME = '//h1//span[@id="productTitle"]//text()'
XPATH_PRODUCT_PRICE  = '//span[@id="priceblock_ourprice"]/text()'

raw_product_price = parser.xpath(XPATH_PRODUCT_PRICE)
product_price = ''.join(raw_product_price).replace(',','')

raw_product_name = parser.xpath(XPATH_PRODUCT_NAME)
product_name = ''.join(raw_product_name).strip()
total_ratings  = parser.xpath(XPATH_AGGREGATE_RATING)
reviews = parser.xpath(XPATH_REVIEW_SECTION_1)
if not reviews:
    reviews = parser.xpath(XPATH_REVIEW_SECTION_2)

The page is https://www.amazon.com/productreviews/'+asin+"/, where asin is an ID (eg, B0718Y23CQ). I get nothing in reviews. Thanks for any help!


1 Answers

Well, if I have to be honest, I don't know where are some of the paths that you use, because I can't find them. I have redone your code to try to help:

from lxml import html 
import requests
import json
asin = 'B0718Y23CQ'
page_response = requests.get('https://www.amazon.com/product-reviews/'+ asin)
parser = html.fromstring(page_response.content)
reviews_html = parser.xpath('//div[@class="a-section review"]')
reviews_arr = []
for review in reviews_html:
    review_dic = {}
    review_dic['title'] = review.xpath('.//a[@data-hook="review-title"]/text()')
    review_dic['rating'] = review.xpath('.//a[@class="a-link-normal"]/@title')
    review_dic['author'] = review.xpath('.//a[@data-hook="review-author"]/text()')
    review_dic['date'] = review.xpath('.//span[@data-hook="review-date"]/text()')
    review_dic['purchase'] = review.xpath('.//span[@data-hook="avp-badge"]/text()')
    review_dic['review_text'] = review.xpath('.//span[@data-hook="review-body"]/text()')
    review_dic['helpful_votes'] = review.xpath('.//span[@data-hook="helpful-vote-statement"]/text()')
    reviews_arr.append(review_dic)
print(json.dumps(reviews_arr, indent = 4))

The output scheme is:

{
        "title": [
            "I find it very useful, I use for anything I need"
        ],
        "rating": [
            "5.0 out of 5 stars"
        ],
        "author": [
            "Nicoletta Delon"
        ],
        "date": [
            "on January 2, 2018"
        ],
        "purchase": [
            "Verified Purchase"
        ],
        "review_text": [
            "I like this a lot. I use it a lot. It's a medium to small size but it holds a lot."
        ],
        "helpful_votes": [
            "\n      One person found this helpful.\n    "
        ]
    }

Now you have to clean the results, remove them from the lists, prevent that the element can be empty and I think you'll have what you need. To get all the reviews, you have to iterate the pages, adding ?pageNumber=1 to the link, and iterating the number. You can use proxies for prevent the blocking of the ip, in case you're going to make many requests.

like image 123
Alex Avatar answered Feb 17 '26 11:02

Alex



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!