Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comparing two dataframes and getting the differences [duplicate]

I have two dataframes. Examples:

df1:
Date       Fruit  Num  Color 
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange  8.6 Orange
2013-11-24 Apple   7.6 Green
2013-11-24 Celery 10.2 Green

df2:
Date       Fruit  Num  Color 
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange  8.6 Orange
2013-11-24 Apple   7.6 Green
2013-11-24 Celery 10.2 Green
2013-11-25 Apple  22.1 Red
2013-11-25 Orange  8.6 Orange

Each dataframe has the Date as an index. Both dataframes have the same structure.

What i want to do, is compare these two dataframes and find which rows are in df2 that aren't in df1. I want to compare the date (index) and the first column (Banana, APple, etc) to see if they exist in df2 vs df1.

I have tried the following:

  • Outputting difference in two Pandas dataframes side by side - highlighting the difference
  • Comparing two pandas dataframes for differences

For the first approach I get this error: "Exception: Can only compare identically-labeled DataFrame objects". I have tried removing the Date as index but get the same error.

On the third approach, I get the assert to return False but cannot figure out how to actually see the different rows.

Any pointers would be welcome

like image 766
Eric D. Brown Avatar asked Nov 26 '13 18:11

Eric D. Brown


People also ask

How do you compare two DataFrames and find the difference?

By using equals() function we can directly check if df1 is equal to df2. This function is used to determine if two dataframe objects in consideration are equal or not. Unlike dataframe. eq() method, the result of the operation is a scalar boolean value indicating if the dataframe objects are equal or not.

How can you compare two DataFrames are identical?

The equals() function is used to test whether two objects contain the same elements. This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements.

How do you compare data in two data frames?

The compare method in pandas shows the differences between two DataFrames. It compares two data frames, row-wise and column-wise, and presents the differences side by side. The compare method can only compare DataFrames of the same shape, with exact dimensions and identical row and column labels.

How to compare two DataFrames in pandas?

In this article, we will discuss how to compare two DataFrames in pandas. First, let’s create two DataFrames. By using equals () function we can directly check if df1 is equal to df2. This function is used to determine if two dataframe objects in consideration are equal or not.

What is an example of a Dataframe with the same structure?

As an example, let’s look at two pandas dataframes. Both have date indexes and the same structure. How can we compare these two dataframes and find which rows are in dataframe 2 that aren’t in dataframe 1?

How to find out if the two DataFrames are identical?

Example 1: Find out if the two DataFrames are identical. We can first find out if the two DataFrames are identical by using the DataFrame.equals () function:

Does _indexed_same compare all axes of a Dataframe?

In fact, all dataframes axes are compared with _indexed_same method, and exception is raised if differences found, even in columns/indices order. If I got you right, you want not to find changes, but symmetric difference.


4 Answers

This approach, df1 != df2, works only for dataframes with identical rows and columns. In fact, all dataframes axes are compared with _indexed_same method, and exception is raised if differences found, even in columns/indices order.

If I got you right, you want not to find changes, but symmetric difference. For that, one approach might be concatenate dataframes:

>>> df = pd.concat([df1, df2])
>>> df = df.reset_index(drop=True)

group by

>>> df_gpby = df.groupby(list(df.columns))

get index of unique records

>>> idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]

filter

>>> df.reindex(idx)
         Date   Fruit   Num   Color
9  2013-11-25  Orange   8.6  Orange
8  2013-11-25   Apple  22.1     Red
like image 94
alko Avatar answered Oct 19 '22 11:10

alko


Updating and placing, somewhere it will be easier for others to find, ling's comment upon jur's response above.

df_diff = pd.concat([df1,df2]).drop_duplicates(keep=False)

Testing with these DataFrames:

# with import pandas as pd

df1 = pd.DataFrame({
    'Date':['2013-11-24','2013-11-24','2013-11-24','2013-11-24'],
    'Fruit':['Banana','Orange','Apple','Celery'],
    'Num':[22.1,8.6,7.6,10.2],
    'Color':['Yellow','Orange','Green','Green'],
    })

df2 = pd.DataFrame({
    'Date':['2013-11-24','2013-11-24','2013-11-24','2013-11-24','2013-11-25','2013-11-25'],
    'Fruit':['Banana','Orange','Apple','Celery','Apple','Orange'],
    'Num':[22.1,8.6,7.6,10.2,22.1,8.6],
    'Color':['Yellow','Orange','Green','Green','Red','Orange'],
    })

Results in this:

# for df1

         Date   Fruit   Num   Color
0  2013-11-24  Banana  22.1  Yellow
1  2013-11-24  Orange   8.6  Orange
2  2013-11-24   Apple   7.6   Green
3  2013-11-24  Celery  10.2   Green


# for df2

         Date   Fruit   Num   Color
0  2013-11-24  Banana  22.1  Yellow
1  2013-11-24  Orange   8.6  Orange
2  2013-11-24   Apple   7.6   Green
3  2013-11-24  Celery  10.2   Green
4  2013-11-25   Apple  22.1     Red
5  2013-11-25  Orange   8.6  Orange


# for df_diff

         Date   Fruit   Num   Color
4  2013-11-25   Apple  22.1     Red
5  2013-11-25  Orange   8.6  Orange
like image 43
leerssej Avatar answered Oct 19 '22 12:10

leerssej


Passing the dataframes to concat in a dictionary, results in a multi-index dataframe from which you can easily delete the duplicates, which results in a multi-index dataframe with the differences between the dataframes:

import sys
if sys.version_info[0] < 3:
    from StringIO import StringIO
else:
    from io import StringIO
import pandas as pd

DF1 = StringIO("""Date       Fruit  Num  Color 
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange  8.6 Orange
2013-11-24 Apple   7.6 Green
2013-11-24 Celery 10.2 Green
""")
DF2 = StringIO("""Date       Fruit  Num  Color 
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange  8.6 Orange
2013-11-24 Apple   7.6 Green
2013-11-24 Celery 10.2 Green
2013-11-25 Apple  22.1 Red
2013-11-25 Orange  8.6 Orange""")


df1 = pd.read_table(DF1, sep='\s+')
df2 = pd.read_table(DF2, sep='\s+')
#%%
dfs_dictionary = {'DF1':df1,'DF2':df2}
df=pd.concat(dfs_dictionary)
df.drop_duplicates(keep=False)

Result:

             Date   Fruit   Num   Color
DF2 4  2013-11-25   Apple  22.1     Red
    5  2013-11-25  Orange   8.6  Orange
like image 26
jur Avatar answered Oct 19 '22 11:10

jur


# THIS WORK FOR ME

# Get all diferent values
df3 = pd.merge(df1, df2, how='outer', indicator='Exist')
df3 = df3.loc[df3['Exist'] != 'both']


# If you like to filter by a common ID
df3  = pd.merge(df1, df2, on="Fruit", how='outer', indicator='Exist')
df3  = df3.loc[df3['Exist'] != 'both']
like image 20
Ivan Moran Avatar answered Oct 19 '22 12:10

Ivan Moran