I have a similar problem to the one mentioned here but none of the suggested methods work for me.
I have a medium size utf-8
.csv file with a lot of non-ascii characters.
I am splitting the file by a particular value from one of the columns, and then I'd like to save each of the obtained dataframes as an .xlsx file with the characters preserved.
This doesn't work, as I am getting an error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 7: ordinal not in range(128)
Here is what I tried:
xlsxwriter
engine explicitly. This doesn't seem to change anything.Defining a function (below) to change encoding and throw away bad characters. This also doesn't change anything.
def changeencode(data):
cols = data.columns
for col in cols:
if data[col].dtype == 'O':
data[col] = data[col].str.decode('utf-8').str.encode('ascii', 'ignore')
return data
Changing by hand all the offensive chars to some others. Still no effect (the quoted error was obtained after this change).
Encoding the file as utf-16
(which, I believe, is the correct encoding since I want to be able to manipulate the file from within the excel afterwards) doesn't help either.
I believe that the problem is in the file itself (because of 2 and 3) but I have no idea how to get around it. I'd appreciate any help. The beginning of the file is pasted below.
"Submitted","your-name","youremail","phone","miasto","cityCF","innemiasto","languagesCF","morelanguages","wiek","partnerCF","messageCF","acceptance-795","Submitted Login","Submitted From","2015-12-25 14:07:58 +00:00","Zózia kryś","[email protected]","4444444","Wrocław","","testujemy polskie znaki","Polski","testujemy polskie znaki","44","test","test","1","Justyna","99.111.155.132",
EDIT
Some code (one of the versions, without the splitting part):
import pandas as pd
import string
import xlsxwriter
df = pd.read_csv('path-to-file.csv')
with pd.ExcelWriter ('test.xlsx') as writer:
df.to_excel(writer, sheet_name = 'sheet1',engine='xlsxwriter')
You can write any data (lists, strings, numbers etc) to Excel, by first converting it into a Pandas DataFrame and then writing the DataFrame to Excel. To export a Pandas DataFrame as an Excel file (extension: . xlsx, . xls), use the to_excel() method.
read_excel. Read an Excel file into a pandas DataFrame. Supports xls , xlsx , xlsm , xlsb , odf , ods and odt file extensions read from a local filesystem or URL.
Speed - Pandas is much faster than Excel, which is especially noticeable when working with larger quantities of data. Automation - A lot of the tasks that can be achieved with Pandas are extremely easy to automate, reducing the amount of tedious and repetitive tasks that need to be performed daily.
Supposedly this was a bug in the version of pandas
which I was using back then.
Right now, in pandas
ver. 0.19.2, the code below saves the csv from the question without any trouble (and with correct encoding).
NB: openpyxl
module have to be installed on your system.
import pandas as pd
df = pd.read_csv('Desktop/test.csv')
df.to_excel('Desktop/test.xlsx', encoding='utf8')
Try encoding the columns with non-ascii characters as
df['col'] = df['col'].apply(lambda x: unicode(x))
and then save the file to xlsx format with encoding 'utf8'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With