Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compare PandaS DataFrames and return rows that are missing from the first one

I have 2 dataFrames and want to compare them and return rows from the first one (df1) that are not in the second one (df2). I found a way to compare them and return the differences, but can't figure out how to return only missing ones from df1.

import pandas as pd
from pandas import Series, DataFrame

df1 = pd.DataFrame( { 
"City" : ["Chicago", "San Franciso", "Boston"] , 
"State" : ["Illinois", "California", "Massachusett"] } )

df2 = pd.DataFrame( { 
"City" : ["Chicago",  "Mmmmiami", "Dallas" , "Omaha"] , 
"State" : ["Illinois", "Florida", "Texas", "Nebraska"] } )



df = pd.concat([df1, df2])
df = df.reset_index(drop=True)

df_gpby = df.groupby(list(df.columns))
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
blah = df.reindex(idx)
like image 985
runski74 Avatar asked Oct 26 '15 15:10

runski74


2 Answers

Building on @EdChum's suggestion:

df = pd.merge(df1, df2, how='outer', suffixes=('','_y'), indicator=True)
rows_in_df1_not_in_df2 = df[df['_merge']=='left_only'][df1.columns]

rows_in_df1_not_in_df2

|Index |City        |State       |
|------|------------|------------|
|1     |San Franciso|California  |
|2     |Boston      |Massachusett|

EDIT: incorporate @RobertPeters suggestion

like image 69
jabellcu Avatar answered Sep 21 '22 15:09

jabellcu


IIUC then if you're using pandas version 0.17.0 then you can use merge and set indicator=True:

In [80]:
df1 = pd.DataFrame( { 
"City" : ["Chicago", "San Franciso", "Boston"] , 
"State" : ["Illinois", "California", "Massachusett"] } )
​
df2 = pd.DataFrame( { 
"City" : ["Chicago",  "Mmmmiami", "Dallas" , "Omaha"] , 
"State" : ["Illinois", "Florida", "Texas", "Nebraska"] } )
pd.merge(df1,df2, how='outer', indicator=True)

Out[80]:
           City         State      _merge
0       Chicago      Illinois        both
1  San Franciso    California   left_only
2        Boston  Massachusett   left_only
3      Mmmmiami       Florida  right_only
4        Dallas         Texas  right_only
5         Omaha      Nebraska  right_only

This adds a column to indicator whether the rows are only present in either lhs or rhs

like image 34
EdChum Avatar answered Sep 19 '22 15:09

EdChum