Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Subset pandas dataframe by overlap with another [duplicate]

Tags:

python

pandas

For the following two dataframes:

df1 = pd.DataFrame({'name': pd.Series(["A", "B", "C"]), 'value': pd.Series([1., 2., 3.])})

     name  value
0    A    1.0
1    B    2.0
2    C    3.0

df2 = pd.DataFrame({'name': pd.Series(["A", "C", "D"]), 'value': pd.Series([1., 3., 5.])})

     name  value
0    A    1.0
1    C    3.0
2    D    5.0

I would like to keep only the rows in df2 where the value in the name column overlaps with a value in the name column of df1, i.e. produce the following dataframe:

     name  value
0    A    1.0
1    C    3.0

I have tried a number of approaches but I am new to python and pandas and don't appreciate the syntax coming from R. Why does this line of code not work, and what would?

df2[df2["name"] in df1["name"]]
like image 426
dentist_inedible Avatar asked Mar 10 '23 08:03

dentist_inedible


2 Answers

You can use isin:

print (df2[df2["name"].isin(df1["name"])])
  name  value
0    A    1.0
1    C    3.0

Another faster solution with numpy.intersect1d:

val = np.intersect1d(df2["name"], df1["name"])
print (val)
['A' 'C']

print (df2[df2.name.isin(val)])
  name  value
0    A    1.0
1    C    3.0
like image 95
jezrael Avatar answered Mar 24 '23 17:03

jezrael


Slightly different method that might be useful on your actual data, you could use an "inner join" (the intersection) a la SQL. More useful if your columns aren't duplicated in both data frames (e.g. merging two different data sets with some common key)

df1 = pd.DataFrame({'name': pd.Series(["A", "B", "C"]), 'value': pd.Series([1., 2., 3.])})
df2 = pd.DataFrame({'name': pd.Series(["A", "C", "D"]), 'value': pd.Series([1., 3., 5.])})

# supposedly for the join you should be able to tell join on='<column_name>', 'name' here, 
# but wasn't working for me.
df1.set_index('name', inplace=True)
df2.set_index('name', inplace=True)

df1.join(df2, how='inner', rsuffix='_other')

#       value  value_other
# name                    
# A       1.0          1.0
# C       3.0          3.0

Changing how to outer would give you the intersection of the two, left for just df1 rows, right for df2.

like image 30
Nick T Avatar answered Mar 24 '23 19:03

Nick T