Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas Unicode Import Export error with to_excel() read_excel()

Morning.

I have condensed a much larger situation down to the following:

I have one file with a dataframe with some values in it.

df = pd.DataFrame(
{'joe': [['dog'], ['cat'], ['fish'], ['rabbit']], 'ben': [['dog'], ['fish'], ['fish'], ['bear']]})

df:
      ben       joe
0   [dog]     [dog]
1  [fish]     [cat]
2  [fish]    [fish]
3  [bear]  [rabbit]

The type of data contained within this dataframe is as follows:

type(df.iloc[2,1]),df.iloc[2,1]
>>> (list, ['fish'])

When I save the dataframe to excel using pd.to_excel():

writer1 = pd.ExcelWriter('Input Output Test.xlsx')
df.to_excel(writer1,'Sheet1')
writer1.save()

I immediately read this back in to the same file as follows:

dfi = pd.read_excel(open('Input Output Test.xlsx'), sheetname='Sheet1')

I again test the type of the data:

type(dfi.iloc[2,1]),dfi.iloc[2,1]
>>> (unicode, u"['fish']")

The data is now in unicode format. This is problematic, as, when I compare the two dataframes as follows, all results are false, due to the non matching string formats:

np.where(df['joe'] == dfi['joe'],True,False)
dfi:
        ben         joe   test
0   ['dog']     ['dog']  False
1  ['fish']     ['cat']  False
2  ['fish']    ['fish']  False
3  ['bear']  ['rabbit']  False

What is going on during the read and write process that is causing this change, and how I change it to maintain str format post save?

E: Unfortunately, the nature of my problem dictates that the dataframe must be saved, and manipulated within a different file.

Edit, in response to EdChum's comment: if I instead store these strings as strings, not lists: I still get the same error:

df = pd.DataFrame({'joe': ['dog', 'cat', 'fish', 'rabbit'], 'ben': ['dog', 'fish', 'fish', 'bear']})

    ben     joe
0   dog     dog
1  fish     cat
2  fish    fish
3  bear  rabbit

writer1 = pd.ExcelWriter('Input Output Test Joe.xlsx')
df.to_excel(writer1,'Sheet1')
writer1.save()

dfi = pd.read_excel(open('Input Output Test Joe.xlsx','rb'), sheetname='Sheet1')

type(dfi.iloc[2, 1]), dfi.iloc[2, 1]
(unicode, u'fish')

And again, the comparison fails.

Edit: Evaluation of Unicode to a regular string can also be achieved via ast.literal_eval() as described here: Convert string representation of list to list in Python or as EdChum's suggestion.

Note, If you use to_csv() and read_csv() this issue is not present.

But why does to_excel() / re_excel() change the original code?

like image 785
Chuck Avatar asked Feb 03 '17 11:02

Chuck


1 Answers

But why does to_excel() / re_excel() change the original code?

I don't know. I briefly looked at the source of to_excel from_excel, but couldn't find any clues.
Setting engine='xlsxwriter' and leaving encoding to default seems to do it, i.e.:

import pandas as pd
df = pd.DataFrame(
    {'joe': [['dog'], ['cat'], ['fish'], ['rabbit']], 'ben': [['dog'], ['fish'], ['fish'], ['bear']]})

with pd.ExcelWriter ('Input Output Test.xlsx') as writer:
    df.to_excel(writer, sheet_name='Sheet1', engine='xlsxwriter')

dfi = pd.read_excel('Input Output Test.xlsx')

assert eval(dfi.iloc[2,1]) == df.iloc[2,1]
# True
like image 190
Pedro Lobito Avatar answered Nov 09 '22 23:11

Pedro Lobito