Pandas dataframe and character encoding when reading excel file

Tags:

I am reading an excel file that has several numerical and categorical data. The columns name_string contains characters in a foreign language. When I try to see the content of the name_string column, I get the results I want, but the foreign characters (that are displayed correctly in the excel spreadsheet) are displayed with the wrong encoding. Here is what I have:

import pandas as pd
df = pd.read_excel('MC_simulation.xlsx', 'DataSet', encoding='utf-8')
name_string = df.name_string.unique()
name_string.sort()
name_string

Producing the following:

array([u'4th of July', u'911', u'Abab', u'Abass', u'Abcar', u'Abced',
       u'Ceded', u'Cedes', u'Cedfus', u'Ceding', u'Cedtim', u'Cedtol',
       u'Cedxer', u'Chevrolet Corvette', u'Chuck Norris',
       u'Cristina Fern\xe1ndez de Kirchner'], dtype=object)

In the last line, the correctly encoded name should be Cristina Fernández de Kirchner. Can anybody help me with this issue?

576

asked May 11 '14 16:05

Luis Miguel

1 Answers

Actually, the data is being parsed correctly into unicode, not strs. The u prefix indicate that the objects are unicode. When a list, tuple, or NumPy array is printed, Python shows the repr of the items in the sequence. So instead of seeing the printed version of the unicode, you see the repr:

In [160]: repr(u'Cristina Fern\xe1ndez de Kirchner')
Out[160]: "u'Cristina Fern\\xe1ndez de Kirchner'"

In [156]: print(u'Cristina Fern\xe1ndez de Kirchner')
Cristina Fernández de Kirchner

The purpose of the repr is to provide an unambiguous string representation for each object. The printed verson of a unicode can be ambiguous because of invisible or unprintable characters.

If you print the DataFrame or Series, however, you'll get the printed version of the unicodes:

In [157]: df = pd.DataFrame({'foo':np.array([u'4th of July', u'911', u'Abab', u'Abass', u'Abcar', u'Abced',
       u'Ceded', u'Cedes', u'Cedfus', u'Ceding', u'Cedtim', u'Cedtol',
       u'Cedxer', u'Chevrolet Corvette', u'Chuck Norris',
       u'Cristina Fern\xe1ndez de Kirchner'], dtype=object)})
   .....:    .....:    .....: 
In [158]: df
Out[158]: 
                               foo
0                      4th of July
1                              911
2                             Abab
3                            Abass
4                            Abcar
5                            Abced
6                            Ceded
7                            Cedes
8                           Cedfus
9                           Ceding
10                          Cedtim
11                          Cedtol
12                          Cedxer
13              Chevrolet Corvette
14                    Chuck Norris
15  Cristina Fernández de Kirchner

[16 rows x 1 columns]

130

answered Oct 17 '22 03:10

unutbu

Related questions
                            
                                Is it possible to open a locked sqlite database in read only mode?
                            
                                Need more mechanize documentation (python) [closed]
                            
                                Fill area under curve in matlibplot python on log scale
                            
                                find position of a substring in a string
                            
                                Reading data from text file with missing values
                            
                                opencv and multiprocessing
                            
                                How to generate a new map for jvectormap jquery plugin?
                            
                                Python: best way to test a single method of a class
                            
                                schedule number of web dynos by time of day
                            
                                Is it possible to determine with NDB if model is persistent in the datastore or not?
                            
                                What causes "no viable alternative at input 'None'" error with Cassandra CQL
                            
                                Python socket (Socket Error Bad File Descriptor)
                            
                                Storing Pandas objects along with regular Python objects in HDF5
                            
                                "Fire and forget" a process from a Python script
                            
                                AppEngine: warning during python app update
                            
                                Finding out an exception context
                            
                                NumPy resize method
                            
                                What is the correct way to get the previous page of results given an NDB cursor?
                            
                                Python - Remove any element from a list of strings that is a substring of another element
                            
                                Mocking a HTTP server in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas dataframe and character encoding when reading excel file

Tags:

python

pandas

character-encoding

excel

Luis Miguel

People also ask

1 Answers

unutbu

Recent Activity

Donate For Us