Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: save to excel encoding issue

I have a similar problem to the one mentioned here but none of the suggested methods work for me.

I have a medium size utf-8 .csv file with a lot of non-ascii characters. I am splitting the file by a particular value from one of the columns, and then I'd like to save each of the obtained dataframes as an .xlsx file with the characters preserved.

This doesn't work, as I am getting an error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 7: ordinal not in range(128)

Here is what I tried:

  1. Using xlsxwriter engine explicitly. This doesn't seem to change anything.
  2. Defining a function (below) to change encoding and throw away bad characters. This also doesn't change anything.

    def changeencode(data):
    cols = data.columns
    for col in cols:
    if data[col].dtype == 'O':
        data[col] = data[col].str.decode('utf-8').str.encode('ascii', 'ignore')
    return data   
    
  3. Changing by hand all the offensive chars to some others. Still no effect (the quoted error was obtained after this change).

  4. Encoding the file as utf-16 (which, I believe, is the correct encoding since I want to be able to manipulate the file from within the excel afterwards) doesn't help either.

I believe that the problem is in the file itself (because of 2 and 3) but I have no idea how to get around it. I'd appreciate any help. The beginning of the file is pasted below.

"Submitted","your-name","youremail","phone","miasto","cityCF","innemiasto","languagesCF","morelanguages","wiek","partnerCF","messageCF","acceptance-795","Submitted Login","Submitted From","2015-12-25 14:07:58 +00:00","Zózia kryś","[email protected]","4444444","Wrocław","","testujemy polskie znaki","Polski","testujemy polskie znaki","44","test","test","1","Justyna","99.111.155.132",

EDIT

Some code (one of the versions, without the splitting part):

import pandas as pd
import string
import xlsxwriter

df = pd.read_csv('path-to-file.csv')

with pd.ExcelWriter ('test.xlsx') as writer:
                df.to_excel(writer, sheet_name = 'sheet1',engine='xlsxwriter')
like image 901
jjj Avatar asked Dec 27 '15 23:12

jjj


People also ask

How do I save pandas data to Excel?

You can write any data (lists, strings, numbers etc) to Excel, by first converting it into a Pandas DataFrame and then writing the DataFrame to Excel. To export a Pandas DataFrame as an Excel file (extension: . xlsx, . xls), use the to_excel() method.

Does pandas work with Xlsx?

read_excel. Read an Excel file into a pandas DataFrame. Supports xls , xlsx , xlsm , xlsb , odf , ods and odt file extensions read from a local filesystem or URL.

Is Excel or pandas faster?

Speed - Pandas is much faster than Excel, which is especially noticeable when working with larger quantities of data. Automation - A lot of the tasks that can be achieved with Pandas are extremely easy to automate, reducing the amount of tedious and repetitive tasks that need to be performed daily.


2 Answers

Supposedly this was a bug in the version of pandas which I was using back then. Right now, in pandas ver. 0.19.2, the code below saves the csv from the question without any trouble (and with correct encoding).
NB: openpyxl module have to be installed on your system.

import pandas as pd
df = pd.read_csv('Desktop/test.csv')
df.to_excel('Desktop/test.xlsx', encoding='utf8')
like image 119
jjj Avatar answered Sep 20 '22 22:09

jjj


Try encoding the columns with non-ascii characters as

df['col'] = df['col'].apply(lambda x: unicode(x))

and then save the file to xlsx format with encoding 'utf8'

like image 44
Siva Arasu Avatar answered Sep 20 '22 22:09

Siva Arasu