Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comparing two pandas dataframes for differences

I've got a script updating 5-10 columns worth of data , but sometimes the start csv will be identical to the end csv so instead of writing an identical csvfile I want it to do nothing...

How can I compare two dataframes to check if they're the same or not?

csvdata = pandas.read_csv('csvfile.csv') csvdata_old = csvdata  # ... do stuff with csvdata dataframe  if csvdata_old != csvdata:     csvdata.to_csv('csvfile.csv', index=False) 

Any ideas?

like image 790
Ryflex Avatar asked Nov 11 '13 22:11

Ryflex


People also ask

How do I compare two pandas DataFrames?

The compare method in pandas shows the differences between two DataFrames. It compares two data frames, row-wise and column-wise, and presents the differences side by side. The compare method can only compare DataFrames of the same shape, with exact dimensions and identical row and column labels.

How do you compare two pandas series?

Step 1: Define two Pandas series, s1 and s2. Step 2: Compare the series using compare() function in the Pandas series. Step 3: Print their difference.

How can you tell if two DataFrames have the same rows?

If your two dataframes have the same ids in them, then finding out what changed is actually pretty easy. Just doing frame1 != frame2 will give you a boolean DataFrame where each True is data that has changed. From that, you could easily get the index of each changed row by doing changedids = frame1.


2 Answers

You also need to be careful to create a copy of the DataFrame, otherwise the csvdata_old will be updated with csvdata (since it points to the same object):

csvdata_old = csvdata.copy() 

To check whether they are equal, you can use assert_frame_equal as in this answer:

from pandas.util.testing import assert_frame_equal assert_frame_equal(csvdata, csvdata_old) 

You can wrap this in a function with something like:

try:     assert_frame_equal(csvdata, csvdata_old)     return True except:  # appeantly AssertionError doesn't catch all     return False 

There was discussion of a better way...

like image 126
Andy Hayden Avatar answered Sep 22 '22 21:09

Andy Hayden


Not sure if this is helpful or not, but I whipped together this quick python method for returning just the differences between two dataframes that both have the same columns and shape.

def get_different_rows(source_df, new_df):     """Returns just the rows from the new dataframe that differ from the source dataframe"""     merged_df = source_df.merge(new_df, indicator=True, how='outer')     changed_rows_df = merged_df[merged_df['_merge'] == 'right_only']     return changed_rows_df.drop('_merge', axis=1) 
like image 24
Tom Chapin Avatar answered Sep 18 '22 21:09

Tom Chapin