Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas compare two dataframes and remove what matches in one column

Tags:

I have two separate pandas dataframes (df1 and df2) which have multiple columns, but only one in common ('text').

I would like to do find every row in df2 that does not have a match in any of the rows of the column that df2 and df1 have in common.

df1

A    B    text
45   2    score
33   5    miss
20   1    score

df2

C    D    text
.5   2    shot
.3   2    shot
.3   1    miss

Result df (remove row containing miss since it occurs in df1)

C    D    text
.5   2    shot
.3   2    shot

Is it possible to use the isin method in this scenario?

like image 539
GNMO11 Avatar asked Dec 22 '15 14:12

GNMO11


People also ask

How do I match column values in pandas?

To find the positions of two matching columns, we first initialize a pandas dataframe with two columns of city names. Then we use where() of numpy to compare the values of two columns. This returns an array that represents the indices where the two columns have the same value.


2 Answers

As you asked, you can do this efficiently using isin (without resorting to expensive merges).

>>> df2[~df2.text.isin(df1.text.values)]
C   D   text
0   0.5 2   shot
1   0.3 2   shot
like image 97
Ami Tavory Avatar answered Sep 30 '22 18:09

Ami Tavory


You can merge them and keep only the lines that have a NaN.

df2[pd.merge(df1, df2, how='outer').isnull().any(axis=1)]

or you can use isin:

df2[~df2.text.isin(df1.text)]
like image 23
Julien Spronck Avatar answered Sep 30 '22 20:09

Julien Spronck