For the following two dataframes:
df1 = pd.DataFrame({'name': pd.Series(["A", "B", "C"]), 'value': pd.Series([1., 2., 3.])})
name value
0 A 1.0
1 B 2.0
2 C 3.0
df2 = pd.DataFrame({'name': pd.Series(["A", "C", "D"]), 'value': pd.Series([1., 3., 5.])})
name value
0 A 1.0
1 C 3.0
2 D 5.0
I would like to keep only the rows in df2 where the value in the name column overlaps with a value in the name column of df1, i.e. produce the following dataframe:
name value
0 A 1.0
1 C 3.0
I have tried a number of approaches but I am new to python and pandas and don't appreciate the syntax coming from R. Why does this line of code not work, and what would?
df2[df2["name"] in df1["name"]]
You can use isin:
print (df2[df2["name"].isin(df1["name"])])
name value
0 A 1.0
1 C 3.0
Another faster solution with numpy.intersect1d:
val = np.intersect1d(df2["name"], df1["name"])
print (val)
['A' 'C']
print (df2[df2.name.isin(val)])
name value
0 A 1.0
1 C 3.0
Slightly different method that might be useful on your actual data, you could use an "inner join" (the intersection) a la SQL. More useful if your columns aren't duplicated in both data frames (e.g. merging two different data sets with some common key)
df1 = pd.DataFrame({'name': pd.Series(["A", "B", "C"]), 'value': pd.Series([1., 2., 3.])})
df2 = pd.DataFrame({'name': pd.Series(["A", "C", "D"]), 'value': pd.Series([1., 3., 5.])})
# supposedly for the join you should be able to tell join on='<column_name>', 'name' here,
# but wasn't working for me.
df1.set_index('name', inplace=True)
df2.set_index('name', inplace=True)
df1.join(df2, how='inner', rsuffix='_other')
# value value_other
# name
# A 1.0 1.0
# C 3.0 3.0
Changing how to outer would give you the intersection of the two, left for just df1 rows, right for df2.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With