Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compare 2 DataFrames for semi matching rows

I have 2 dataframes with strings in the cells:

df1

ID  t1  t2  t3
0   x1  y1  z1
1   x2  y2  z2 
2   x3  y3  z3 
3   x4  y4  z4  
4   x1  y5  z5 

df2

ID  t1  t2  t3
0   x3  y3  z3
1   x4  y4  z4 
2   x1  y1  z1 
3   x2  y2  z2  
4   x1  y7  z5 

I found that I can compare the differences in rows with:

#exactly the same t1, t2, and t3
pd.merge(df1, df2, on=['t1', 't2', 't3'], how='inner')

This will find an exact match between the rows (where t1 in df1 equals t1 in df2, etc.).

How can I find a semi match between the 2 dataframes for a specific column? That is, where there could be a difference in only the specified column in addition to the exact matches? For example, if I specify t2, a match will be t1 in df1 = t1 in df2, t2 in df1 != df2, t3 in df1 = t3 in df3 (for example, row ID=4 in the 2 dataframes will match this in addition to the exact matches).

Update 1:

It seems like a lot of answers take order into consideration (if the rows are not exactly align the method will fail).

Try the following to check your method:

d1 = {'Entity1': ['x1', 'x2','x3','x4','x1', 'x6', 'x1'], 'Relationship': ['y1', 'y2','y3','y4','y5','y6', 'y9'], 'Entity2': ['z1', 'z2','z3','z4','z5','z6', 'z5']}
df1 = pd.DataFrame(data=d1)


d2 = {'Entity1': ['x3', 'x4','x1','x2','x6','x1'], 'Relationship': ['y3', 'y4','y1','y2','y6','y7'], 'Entity2': ['z3', 'z4','z1','z2','z7','z5']}
df2 = pd.DataFrame(data=d2)

Note that one of the exact matches is x2, y2, z2, and one of the semi-match is df1 = x1, y5, z5, df2 = x1, y7,z5

like image 344
Penguin Avatar asked Mar 01 '23 11:03

Penguin


1 Answers

You could merge the two dataframes and then filter for all rows where t1 and t2 are the same on both sides:

df3 = pd.merge(df1, df2, left_index=True, right_index=True)
df3[(df3["t1_x"] == df3["t1_y"]) & (df3["t3_x"] == df3["t3_y"])]
like image 165
Arne Decker Avatar answered Mar 12 '23 18:03

Arne Decker