Finding common rows (intersection) in two Pandas dataframes

Question

Assume I have two dataframes of this format (call them df1 and df2):

+------------------------+------------------------+--------+ |        user_id         |      business_id       | rating | +------------------------+------------------------+--------+ | rLtl8ZkDX5vH5nAx9C3q5Q | eIxSLxzIlfExI6vgAbn2JA |      4 | | C6IOtaaYdLIT5fWd7ZYIuA | eIxSLxzIlfExI6vgAbn2JA |      5 | | mlBC3pN9GXlUUfQi1qBBZA | KoIRdcIfh3XWxiCeV1BDmA |      3 | +------------------------+------------------------+--------+

I'm looking to get a dataframe of all the rows that have a common user_id in df1 and df2. (ie. if a user_id is in both df1 and df2, include the two rows in the output dataframe)

I can think of many ways to approach this, but they all strike me as clunky. For example, we could find all the unique user_ids in each dataframe, create a set of each, find their intersection, filter the two dataframes with the resulting set and concatenate the two filtered dataframes.

Maybe that's the best approach, but I know Pandas is clever. Is there a simpler way to do this? I've looked at merge but I don't think that's what I need.

aldorath · Accepted Answer

My understanding is that this question is better answered over in this post.

But briefly, the answer to the OP with this method is simply:

s1 = pd.merge(df1, df2, how='inner', on=['user_id'])

Which gives s1 with 5 columns: user_id and the other two columns from each of df1 and df2.

Phillip Cloud · Answer

If I understand you correctly, you can use a combination of Series.isin() and DataFrame.append():

In [80]: df1 Out[80]:    rating  user_id 0       2  0x21abL 1       1  0x21abL 2       1   0xdafL 3       0  0x21abL 4       4  0x1d14L 5       2  0x21abL 6       1  0x21abL 7       0   0xdafL 8       4  0x1d14L 9       1  0x21abL  In [81]: df2 Out[81]:    rating      user_id 0       2      0x1d14L 1       1    0xdbdcad7 2       1      0x21abL 3       3      0x21abL 4       3      0x21abL 5       1  0x5734a81e2 6       2      0x1d14L 7       0       0xdafL 8       0      0x1d14L 9       4  0x5734a81e2  In [82]: ind = df2.user_id.isin(df1.user_id) & df1.user_id.isin(df2.user_id)  In [83]: ind Out[83]: 0     True 1    False 2     True 3     True 4     True 5    False 6     True 7     True 8     True 9    False Name: user_id, dtype: bool  In [84]: df1[ind].append(df2[ind]) Out[84]:    rating  user_id 0       2  0x21abL 2       1   0xdafL 3       0  0x21abL 4       4  0x1d14L 6       1  0x21abL 7       0   0xdafL 8       4  0x1d14L 0       2  0x1d14L 2       1  0x21abL 3       3  0x21abL 4       3  0x21abL 6       2  0x1d14L 7       0   0xdafL 8       0  0x1d14L

This is essentially the algorithm you described as "clunky", using idiomatic pandas methods. Note the duplicate row indices. Also, note that this won't give you the expected output if df1 and df2 have no overlapping row indices, i.e., if

In [93]: df1.index & df2.index Out[93]: Int64Index([], dtype='int64')

In fact, it won't give the expected output if their row indices are not equal.

Finding common rows (intersection) in two Pandas dataframes

Tags:

python

pandas

dataframe

intersect

David Chouinard

2 Answers

aldorath

Phillip Cloud

Recent Activity

Donate For Us

Finding common rows (intersection) in two Pandas dataframes

Tags:

python

pandas

dataframe

intersect

David Chouinard

2 Answers

aldorath

Phillip Cloud

Related questions

Recent Activity

Donate For Us