Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to store non-English string into a excel file, python3?

I have a simple project of scraping reviews from a tourist site and store it in a excel file. Reviews could be in spanish, japanese or any other language, also reviews sometimes contains special symbols like "❤❤".

I need to store all the data (special symbols can be excluded if can't be written).

I am able to scrape the data i want and print it in the console as it is (like japanese text), but problem is with storing it in the csv file, it is showing error message as shown below

i tried opening the file with utf-8 encoding (As mentioned in below comment) but then it saves the data in some weird symbols that makes no sense .... and couldn't find an answer to the problem. Any suggestions.

I am using python 3.5.3

My code for python:

from selenium import webdriver
from bs4 import BeautifulSoup
import time
import re

file = "TajMahalSpanish.csv"
f = open(file, "w")
headers = "rating, title, review\n"
f.write(headers)

pages = 119
pageNumber = 2
option = webdriver.ChromeOptions()
option.add_argument("--incognito")

browser = webdriver.Chrome(executable_path='C:\Program Files\JetBrains\PyCharm Community Edition 2017.1.5\chrome webdriver\chromedriver', chrome_options=option)

browser.get("https://www.tripadvisor.in/Attraction_Review-g297683-d317329-Reviews-Taj_Mahal-Agra_Agra_District_Uttar_Pradesh.html")
time.sleep(10)
browser.find_element_by_xpath('//*[@id="taplc_location_review_filter_controls_0_form"]/div[4]/ul/li[5]/a').click()
time.sleep(5)
browser.find_element_by_xpath('//*[@id="BODY_BLOCK_JQUERY_REFLOW"]/span/div[1]/div/form/ul/li[2]/label').click()
time.sleep(5)

while (pages):
    html = browser.page_source
    soup = BeautifulSoup(html, "html.parser")
    containers = soup.find_all("div",{"class":"innerBubble"})

    showMore = soup.find("span", {"onclick": "widgetEvCall('handlers.clickExpand',event,this);"})
    if showMore:
        browser.find_element_by_xpath("//span[@onclick=\"widgetEvCall('handlers.clickExpand',event,this);\"]").click()
        time.sleep(3)
        html = browser.page_source
        soup = BeautifulSoup(html, "html.parser")
        containers = soup.find_all("div", {"class": "innerBubble"})
        showMore = False

    for container in containers:
        bubble = container.div.div.span["class"][1]
        title = container.div.find("div", {"class": "quote"}).a.span.text
        review = container.find("p", {"class": "partial_entry"}).text
        f.write(bubble + "," + title.replace(",", "|").replace("\n", "...") + "," + review.replace(",", "|").replace("\n", "...") + "\n")
        print(bubble)
        print(title)
        print(review)
    browser.find_element_by_xpath("//div[@class='ppr_rup ppr_priv_location_reviews_list']//div[@class='pageNumbers']/span[@data-page-number='" + str(pageNumber) + "']").click()
    time.sleep(5)
    pages -= 1
    pageNumber += 1

f.close()

I am getting the following error:

Traceback (most recent call last):
  File "C:/Users/Akshit/Documents/pycharmProjects/spanish.py", line 45, in <module>
    f.write(bubble + "," + title.replace(",", "|").replace("\n", "...") + "," + review.replace(",", "|").replace("\n", "...") + "\n")
  File "C:\Users\Akshit\AppData\Local\Programs\Python\Python35\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 10-18: character maps to <undefined>

Process finished with exit code 1

UPDATE

I am trying a workaround to this problem. At the end i need to translate the Japanese reviews to english as well for the research, so may be i can use one of the google api's to tranlsate the string in the code itself before writing it and then write it into the csv file....

like image 393
Akshit Agarwal Avatar asked Oct 29 '22 04:10

Akshit Agarwal


1 Answers

UPDATE

Found the solution in

Is it possible to force Excel recognize UTF-8 CSV files automatically?

as suggested by @MaartenFabré in the comments.

Basically from what I understood, the problem is that Excel file has problems in reading csv file with utf-8 encoding so when i directly opens the csv file (made via python) with Excel...all the data is corrupted.

The solution is that:

  1. I saved the data in a text file, instead of csv in python
  2. Open Excel
  3. Go to import external data and import using a txt file
  4. select file type as "delimited" and file origin as "650001: Unicode (utf-8)"
  5. Select "," as the delimiter (your choice) and import
  6. Data is correctly shown in the excel in proper rows and column for every language...japenese, spanish, french etc.

Again thanks to @MaartenFabre for the help !

like image 87
Akshit Agarwal Avatar answered Nov 15 '22 05:11

Akshit Agarwal