For the following two dataframes:
df1 = pd.DataFrame({'name': pd.Series(["A", "B", "C"]), 'value': pd.Series([1., 2., 3.])})
name value
0 A 1.0
1 B 2.0
2 C 3.0
df2 = pd.DataFrame({'name': pd.Series(["A", "C", "D"]), 'value': pd.Series([1., 3., 5.])})
name value
0 A 1.0
1 C 3.0
2 D 5.0
I would like to keep only the rows in df2
where the value in the name
column overlaps with a value in the name
column of df1
, i.e. produce the following dataframe:
name value
0 A 1.0
1 C 3.0
I have tried a number of approaches but I am new to python and pandas and don't appreciate the syntax coming from R. Why does this line of code not work, and what would?
df2[df2["name"] in df1["name"]]
You can use isin
:
print (df2[df2["name"].isin(df1["name"])])
name value
0 A 1.0
1 C 3.0
Another faster solution with numpy.intersect1d
:
val = np.intersect1d(df2["name"], df1["name"])
print (val)
['A' 'C']
print (df2[df2.name.isin(val)])
name value
0 A 1.0
1 C 3.0
Slightly different method that might be useful on your actual data, you could use an "inner join" (the intersection) a la SQL. More useful if your columns aren't duplicated in both data frames (e.g. merging two different data sets with some common key)
df1 = pd.DataFrame({'name': pd.Series(["A", "B", "C"]), 'value': pd.Series([1., 2., 3.])})
df2 = pd.DataFrame({'name': pd.Series(["A", "C", "D"]), 'value': pd.Series([1., 3., 5.])})
# supposedly for the join you should be able to tell join on='<column_name>', 'name' here,
# but wasn't working for me.
df1.set_index('name', inplace=True)
df2.set_index('name', inplace=True)
df1.join(df2, how='inner', rsuffix='_other')
# value value_other
# name
# A 1.0 1.0
# C 3.0 3.0
Changing how
to outer
would give you the intersection of the two, left
for just df1
rows, right
for df2
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With