I have two dataframes. Examples:
df1:
Date Fruit Num Color
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange 8.6 Orange
2013-11-24 Apple 7.6 Green
2013-11-24 Celery 10.2 Green
df2:
Date Fruit Num Color
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange 8.6 Orange
2013-11-24 Apple 7.6 Green
2013-11-24 Celery 10.2 Green
2013-11-25 Apple 22.1 Red
2013-11-25 Orange 8.6 Orange
Each dataframe has the Date as an index. Both dataframes have the same structure.
What i want to do, is compare these two dataframes and find which rows are in df2 that aren't in df1. I want to compare the date (index) and the first column (Banana, APple, etc) to see if they exist in df2 vs df1.
I have tried the following:
For the first approach I get this error: "Exception: Can only compare identically-labeled DataFrame objects". I have tried removing the Date as index but get the same error.
On the third approach, I get the assert to return False but cannot figure out how to actually see the different rows.
Any pointers would be welcome
By using equals() function we can directly check if df1 is equal to df2. This function is used to determine if two dataframe objects in consideration are equal or not. Unlike dataframe. eq() method, the result of the operation is a scalar boolean value indicating if the dataframe objects are equal or not.
The equals() function is used to test whether two objects contain the same elements. This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements.
The compare method in pandas shows the differences between two DataFrames. It compares two data frames, row-wise and column-wise, and presents the differences side by side. The compare method can only compare DataFrames of the same shape, with exact dimensions and identical row and column labels.
In this article, we will discuss how to compare two DataFrames in pandas. First, let’s create two DataFrames. By using equals () function we can directly check if df1 is equal to df2. This function is used to determine if two dataframe objects in consideration are equal or not.
As an example, let’s look at two pandas dataframes. Both have date indexes and the same structure. How can we compare these two dataframes and find which rows are in dataframe 2 that aren’t in dataframe 1?
Example 1: Find out if the two DataFrames are identical. We can first find out if the two DataFrames are identical by using the DataFrame.equals () function:
In fact, all dataframes axes are compared with _indexed_same method, and exception is raised if differences found, even in columns/indices order. If I got you right, you want not to find changes, but symmetric difference.
This approach, df1 != df2
, works only for dataframes with identical rows and columns. In fact, all dataframes axes are compared with _indexed_same
method, and exception is raised if differences found, even in columns/indices order.
If I got you right, you want not to find changes, but symmetric difference. For that, one approach might be concatenate dataframes:
>>> df = pd.concat([df1, df2])
>>> df = df.reset_index(drop=True)
group by
>>> df_gpby = df.groupby(list(df.columns))
get index of unique records
>>> idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
filter
>>> df.reindex(idx)
Date Fruit Num Color
9 2013-11-25 Orange 8.6 Orange
8 2013-11-25 Apple 22.1 Red
Updating and placing, somewhere it will be easier for others to find, ling's comment upon jur's response above.
df_diff = pd.concat([df1,df2]).drop_duplicates(keep=False)
Testing with these DataFrames:
# with import pandas as pd
df1 = pd.DataFrame({
'Date':['2013-11-24','2013-11-24','2013-11-24','2013-11-24'],
'Fruit':['Banana','Orange','Apple','Celery'],
'Num':[22.1,8.6,7.6,10.2],
'Color':['Yellow','Orange','Green','Green'],
})
df2 = pd.DataFrame({
'Date':['2013-11-24','2013-11-24','2013-11-24','2013-11-24','2013-11-25','2013-11-25'],
'Fruit':['Banana','Orange','Apple','Celery','Apple','Orange'],
'Num':[22.1,8.6,7.6,10.2,22.1,8.6],
'Color':['Yellow','Orange','Green','Green','Red','Orange'],
})
Results in this:
# for df1
Date Fruit Num Color
0 2013-11-24 Banana 22.1 Yellow
1 2013-11-24 Orange 8.6 Orange
2 2013-11-24 Apple 7.6 Green
3 2013-11-24 Celery 10.2 Green
# for df2
Date Fruit Num Color
0 2013-11-24 Banana 22.1 Yellow
1 2013-11-24 Orange 8.6 Orange
2 2013-11-24 Apple 7.6 Green
3 2013-11-24 Celery 10.2 Green
4 2013-11-25 Apple 22.1 Red
5 2013-11-25 Orange 8.6 Orange
# for df_diff
Date Fruit Num Color
4 2013-11-25 Apple 22.1 Red
5 2013-11-25 Orange 8.6 Orange
Passing the dataframes to concat in a dictionary, results in a multi-index dataframe from which you can easily delete the duplicates, which results in a multi-index dataframe with the differences between the dataframes:
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
import pandas as pd
DF1 = StringIO("""Date Fruit Num Color
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange 8.6 Orange
2013-11-24 Apple 7.6 Green
2013-11-24 Celery 10.2 Green
""")
DF2 = StringIO("""Date Fruit Num Color
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange 8.6 Orange
2013-11-24 Apple 7.6 Green
2013-11-24 Celery 10.2 Green
2013-11-25 Apple 22.1 Red
2013-11-25 Orange 8.6 Orange""")
df1 = pd.read_table(DF1, sep='\s+')
df2 = pd.read_table(DF2, sep='\s+')
#%%
dfs_dictionary = {'DF1':df1,'DF2':df2}
df=pd.concat(dfs_dictionary)
df.drop_duplicates(keep=False)
Result:
Date Fruit Num Color
DF2 4 2013-11-25 Apple 22.1 Red
5 2013-11-25 Orange 8.6 Orange
# THIS WORK FOR ME
# Get all diferent values
df3 = pd.merge(df1, df2, how='outer', indicator='Exist')
df3 = df3.loc[df3['Exist'] != 'both']
# If you like to filter by a common ID
df3 = pd.merge(df1, df2, on="Fruit", how='outer', indicator='Exist')
df3 = df3.loc[df3['Exist'] != 'both']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With