I have an original pandas dataframe, let's call it df
. I convert the dataframe to a csv file, and then convert it back to a pandas dataframe. When I call df.equals(new dataframe), it returns false. I thought one error could be that the indexing could be off, so I set the new dataframe's index as the first column of the csv file (which is the index of the original dataframe) but still getting the same result.
Example code:
import pandas as pd
df = <stuff here that aggregates other dataframes into one>
file_name = 'test/aggregated_reports.csv'
df.to_csv(file_name)
df2 = pd.read_csv(file_name, index_col=0)
assert df.equals(df2)
I did some manual testing by converting df2 into a csv again and comparing the 2 csvs (file_name and the csv created from df2.to_csv()
) and they appeared to be identical, so I'm assuming the "difference" occurs when converting the original dataframe to a csv file. But I still can't quite figure it out...
Any insights on what may be causing the "difference" here would be greatly appreciated!
This may be just a rounding error (I'm assuming your data is numeric). If you're storing floating point numbers as text, reading it back in tends to result in a slight error. See below - try comparing the numeric data using a difference rather than .equals().
import pandas as pd
import numpy as np
df = pd.DataFrame(
columns=['a', 'b', 'c'],
index=[0, 1, 2, 3] * 3,
data=np.random.random((12, 3)))
file_name = 'mydata.csv'
df.to_csv(file_name)
df2 = pd.read_csv(file_name, index_col=0)
print(df.equals(df2)) # Returns False
print(np.all(np.abs((df - df2) < 10 ** -10))) # Returns True
Some other options to look at:
compare = (df == df2) # Dataframe of True/False
compare.all() # By column, True if all values are equal
compare.count() # By column, how many values are equal
# Return any rows where there was a difference
df.where(~compare).dropna(how='all')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With