Morning.
I have condensed a much larger situation down to the following:
I have one file with a dataframe with some values in it.
df = pd.DataFrame(
{'joe': [['dog'], ['cat'], ['fish'], ['rabbit']], 'ben': [['dog'], ['fish'], ['fish'], ['bear']]})
df:
ben joe
0 [dog] [dog]
1 [fish] [cat]
2 [fish] [fish]
3 [bear] [rabbit]
The type of data contained within this dataframe is as follows:
type(df.iloc[2,1]),df.iloc[2,1]
>>> (list, ['fish'])
When I save the dataframe to excel using pd.to_excel()
:
writer1 = pd.ExcelWriter('Input Output Test.xlsx')
df.to_excel(writer1,'Sheet1')
writer1.save()
I immediately read this back in to the same file as follows:
dfi = pd.read_excel(open('Input Output Test.xlsx'), sheetname='Sheet1')
I again test the type of the data:
type(dfi.iloc[2,1]),dfi.iloc[2,1]
>>> (unicode, u"['fish']")
The data is now in unicode format. This is problematic, as, when I compare the two dataframes as follows, all results are false, due to the non matching string formats:
np.where(df['joe'] == dfi['joe'],True,False)
dfi:
ben joe test
0 ['dog'] ['dog'] False
1 ['fish'] ['cat'] False
2 ['fish'] ['fish'] False
3 ['bear'] ['rabbit'] False
What is going on during the read and write process that is causing this change, and how I change it to maintain str
format post save?
E: Unfortunately, the nature of my problem dictates that the dataframe must be saved, and manipulated within a different file.
Edit, in response to EdChum's comment: if I instead store these strings as strings, not lists: I still get the same error:
df = pd.DataFrame({'joe': ['dog', 'cat', 'fish', 'rabbit'], 'ben': ['dog', 'fish', 'fish', 'bear']})
ben joe
0 dog dog
1 fish cat
2 fish fish
3 bear rabbit
writer1 = pd.ExcelWriter('Input Output Test Joe.xlsx')
df.to_excel(writer1,'Sheet1')
writer1.save()
dfi = pd.read_excel(open('Input Output Test Joe.xlsx','rb'), sheetname='Sheet1')
type(dfi.iloc[2, 1]), dfi.iloc[2, 1]
(unicode, u'fish')
And again, the comparison fails.
Edit:
Evaluation of Unicode to a regular string can also be achieved via ast.literal_eval()
as described here:
Convert string representation of list to list in Python or as EdChum's suggestion.
Note, If you use to_csv()
and read_csv()
this issue is not present.
But why does to_excel() / re_excel()
change the original code?
But why does to_excel() / re_excel() change the original code?
I don't know. I briefly looked at the source of to_excel
from_excel
, but couldn't find any clues.
Setting engine='xlsxwriter'
and leaving encoding
to default seems to do it, i.e.:
import pandas as pd
df = pd.DataFrame(
{'joe': [['dog'], ['cat'], ['fish'], ['rabbit']], 'ben': [['dog'], ['fish'], ['fish'], ['bear']]})
with pd.ExcelWriter ('Input Output Test.xlsx') as writer:
df.to_excel(writer, sheet_name='Sheet1', engine='xlsxwriter')
dfi = pd.read_excel('Input Output Test.xlsx')
assert eval(dfi.iloc[2,1]) == df.iloc[2,1]
# True
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With