I have a simple project of scraping reviews from a tourist site and store it in a excel file. Reviews could be in spanish, japanese or any other language, also reviews sometimes contains special symbols like "❤❤".
I need to store all the data (special symbols can be excluded if can't be written).
I am able to scrape the data i want and print it in the console as it is (like japanese text), but problem is with storing it in the csv file, it is showing error message as shown below
i tried opening the file with utf-8 encoding (As mentioned in below comment) but then it saves the data in some weird symbols that makes no sense .... and couldn't find an answer to the problem. Any suggestions.
I am using python 3.5.3
My code for python:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
import re
file = "TajMahalSpanish.csv"
f = open(file, "w")
headers = "rating, title, review\n"
f.write(headers)
pages = 119
pageNumber = 2
option = webdriver.ChromeOptions()
option.add_argument("--incognito")
browser = webdriver.Chrome(executable_path='C:\Program Files\JetBrains\PyCharm Community Edition 2017.1.5\chrome webdriver\chromedriver', chrome_options=option)
browser.get("https://www.tripadvisor.in/Attraction_Review-g297683-d317329-Reviews-Taj_Mahal-Agra_Agra_District_Uttar_Pradesh.html")
time.sleep(10)
browser.find_element_by_xpath('//*[@id="taplc_location_review_filter_controls_0_form"]/div[4]/ul/li[5]/a').click()
time.sleep(5)
browser.find_element_by_xpath('//*[@id="BODY_BLOCK_JQUERY_REFLOW"]/span/div[1]/div/form/ul/li[2]/label').click()
time.sleep(5)
while (pages):
html = browser.page_source
soup = BeautifulSoup(html, "html.parser")
containers = soup.find_all("div",{"class":"innerBubble"})
showMore = soup.find("span", {"onclick": "widgetEvCall('handlers.clickExpand',event,this);"})
if showMore:
browser.find_element_by_xpath("//span[@onclick=\"widgetEvCall('handlers.clickExpand',event,this);\"]").click()
time.sleep(3)
html = browser.page_source
soup = BeautifulSoup(html, "html.parser")
containers = soup.find_all("div", {"class": "innerBubble"})
showMore = False
for container in containers:
bubble = container.div.div.span["class"][1]
title = container.div.find("div", {"class": "quote"}).a.span.text
review = container.find("p", {"class": "partial_entry"}).text
f.write(bubble + "," + title.replace(",", "|").replace("\n", "...") + "," + review.replace(",", "|").replace("\n", "...") + "\n")
print(bubble)
print(title)
print(review)
browser.find_element_by_xpath("//div[@class='ppr_rup ppr_priv_location_reviews_list']//div[@class='pageNumbers']/span[@data-page-number='" + str(pageNumber) + "']").click()
time.sleep(5)
pages -= 1
pageNumber += 1
f.close()
I am getting the following error:
Traceback (most recent call last):
File "C:/Users/Akshit/Documents/pycharmProjects/spanish.py", line 45, in <module>
f.write(bubble + "," + title.replace(",", "|").replace("\n", "...") + "," + review.replace(",", "|").replace("\n", "...") + "\n")
File "C:\Users\Akshit\AppData\Local\Programs\Python\Python35\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 10-18: character maps to <undefined>
Process finished with exit code 1
UPDATE
I am trying a workaround to this problem. At the end i need to translate the Japanese reviews to english as well for the research, so may be i can use one of the google api's to tranlsate the string in the code itself before writing it and then write it into the csv file....
UPDATE
Found the solution in
Is it possible to force Excel recognize UTF-8 CSV files automatically?
as suggested by @MaartenFabré in the comments.
Basically from what I understood, the problem is that Excel file has problems in reading csv file with utf-8 encoding so when i directly opens the csv file (made via python) with Excel...all the data is corrupted.
The solution is that:
Again thanks to @MaartenFabre for the help !
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With