I am trying to read all the reviews of a product from python. I have a script, but it does not work.
parser = html.fromstring(page_response)
XPATH_AGGREGATE = '//span[@id="acrCustomerReviewText"]'
XPATH_REVIEW_SECTION_1 = '//div[@data-hook="reviews-content"]'
XPATH_REVIEW_SECTION_2 = '//div[@data-hook="review"]'
XPATH_AGGREGATE_RATING = '//table[@id="histogramTable"]//tr'
XPATH_PRODUCT_NAME = '//h1//span[@id="productTitle"]//text()'
XPATH_PRODUCT_PRICE = '//span[@id="priceblock_ourprice"]/text()'
raw_product_price = parser.xpath(XPATH_PRODUCT_PRICE)
product_price = ''.join(raw_product_price).replace(',','')
raw_product_name = parser.xpath(XPATH_PRODUCT_NAME)
product_name = ''.join(raw_product_name).strip()
total_ratings = parser.xpath(XPATH_AGGREGATE_RATING)
reviews = parser.xpath(XPATH_REVIEW_SECTION_1)
if not reviews:
reviews = parser.xpath(XPATH_REVIEW_SECTION_2)
The page is https://www.amazon.com/productreviews/'+asin+"/, where asin is an ID (eg, B0718Y23CQ). I get nothing in reviews. Thanks for any help!
Well, if I have to be honest, I don't know where are some of the paths that you use, because I can't find them. I have redone your code to try to help:
from lxml import html
import requests
import json
asin = 'B0718Y23CQ'
page_response = requests.get('https://www.amazon.com/product-reviews/'+ asin)
parser = html.fromstring(page_response.content)
reviews_html = parser.xpath('//div[@class="a-section review"]')
reviews_arr = []
for review in reviews_html:
review_dic = {}
review_dic['title'] = review.xpath('.//a[@data-hook="review-title"]/text()')
review_dic['rating'] = review.xpath('.//a[@class="a-link-normal"]/@title')
review_dic['author'] = review.xpath('.//a[@data-hook="review-author"]/text()')
review_dic['date'] = review.xpath('.//span[@data-hook="review-date"]/text()')
review_dic['purchase'] = review.xpath('.//span[@data-hook="avp-badge"]/text()')
review_dic['review_text'] = review.xpath('.//span[@data-hook="review-body"]/text()')
review_dic['helpful_votes'] = review.xpath('.//span[@data-hook="helpful-vote-statement"]/text()')
reviews_arr.append(review_dic)
print(json.dumps(reviews_arr, indent = 4))
The output scheme is:
{
"title": [
"I find it very useful, I use for anything I need"
],
"rating": [
"5.0 out of 5 stars"
],
"author": [
"Nicoletta Delon"
],
"date": [
"on January 2, 2018"
],
"purchase": [
"Verified Purchase"
],
"review_text": [
"I like this a lot. I use it a lot. It's a medium to small size but it holds a lot."
],
"helpful_votes": [
"\n One person found this helpful.\n "
]
}
Now you have to clean the results, remove them from the lists, prevent that the element can be empty and I think you'll have what you need.
To get all the reviews, you have to iterate the pages, adding ?pageNumber=1 to the link, and iterating the number. You can use proxies for prevent the blocking of the ip, in case you're going to make many requests.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With