Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding common rows (intersection) in two Pandas dataframes

Assume I have two dataframes of this format (call them df1 and df2):

+------------------------+------------------------+--------+ |        user_id         |      business_id       | rating | +------------------------+------------------------+--------+ | rLtl8ZkDX5vH5nAx9C3q5Q | eIxSLxzIlfExI6vgAbn2JA |      4 | | C6IOtaaYdLIT5fWd7ZYIuA | eIxSLxzIlfExI6vgAbn2JA |      5 | | mlBC3pN9GXlUUfQi1qBBZA | KoIRdcIfh3XWxiCeV1BDmA |      3 | +------------------------+------------------------+--------+ 

I'm looking to get a dataframe of all the rows that have a common user_id in df1 and df2. (ie. if a user_id is in both df1 and df2, include the two rows in the output dataframe)

I can think of many ways to approach this, but they all strike me as clunky. For example, we could find all the unique user_ids in each dataframe, create a set of each, find their intersection, filter the two dataframes with the resulting set and concatenate the two filtered dataframes.

Maybe that's the best approach, but I know Pandas is clever. Is there a simpler way to do this? I've looked at merge but I don't think that's what I need.

like image 679
David Chouinard Avatar asked Oct 27 '13 14:10

David Chouinard


2 Answers

My understanding is that this question is better answered over in this post.

But briefly, the answer to the OP with this method is simply:

s1 = pd.merge(df1, df2, how='inner', on=['user_id']) 

Which gives s1 with 5 columns: user_id and the other two columns from each of df1 and df2.

like image 199
aldorath Avatar answered Oct 04 '22 20:10

aldorath


If I understand you correctly, you can use a combination of Series.isin() and DataFrame.append():

In [80]: df1 Out[80]:    rating  user_id 0       2  0x21abL 1       1  0x21abL 2       1   0xdafL 3       0  0x21abL 4       4  0x1d14L 5       2  0x21abL 6       1  0x21abL 7       0   0xdafL 8       4  0x1d14L 9       1  0x21abL  In [81]: df2 Out[81]:    rating      user_id 0       2      0x1d14L 1       1    0xdbdcad7 2       1      0x21abL 3       3      0x21abL 4       3      0x21abL 5       1  0x5734a81e2 6       2      0x1d14L 7       0       0xdafL 8       0      0x1d14L 9       4  0x5734a81e2  In [82]: ind = df2.user_id.isin(df1.user_id) & df1.user_id.isin(df2.user_id)  In [83]: ind Out[83]: 0     True 1    False 2     True 3     True 4     True 5    False 6     True 7     True 8     True 9    False Name: user_id, dtype: bool  In [84]: df1[ind].append(df2[ind]) Out[84]:    rating  user_id 0       2  0x21abL 2       1   0xdafL 3       0  0x21abL 4       4  0x1d14L 6       1  0x21abL 7       0   0xdafL 8       4  0x1d14L 0       2  0x1d14L 2       1  0x21abL 3       3  0x21abL 4       3  0x21abL 6       2  0x1d14L 7       0   0xdafL 8       0  0x1d14L 

This is essentially the algorithm you described as "clunky", using idiomatic pandas methods. Note the duplicate row indices. Also, note that this won't give you the expected output if df1 and df2 have no overlapping row indices, i.e., if

In [93]: df1.index & df2.index Out[93]: Int64Index([], dtype='int64') 

In fact, it won't give the expected output if their row indices are not equal.

like image 38
Phillip Cloud Avatar answered Oct 04 '22 20:10

Phillip Cloud