Assume I have two dataframes of this format (call them df1
and df2
):
+------------------------+------------------------+--------+ | user_id | business_id | rating | +------------------------+------------------------+--------+ | rLtl8ZkDX5vH5nAx9C3q5Q | eIxSLxzIlfExI6vgAbn2JA | 4 | | C6IOtaaYdLIT5fWd7ZYIuA | eIxSLxzIlfExI6vgAbn2JA | 5 | | mlBC3pN9GXlUUfQi1qBBZA | KoIRdcIfh3XWxiCeV1BDmA | 3 | +------------------------+------------------------+--------+
I'm looking to get a dataframe of all the rows that have a common user_id
in df1
and df2
. (ie. if a user_id
is in both df1
and df2
, include the two rows in the output dataframe)
I can think of many ways to approach this, but they all strike me as clunky. For example, we could find all the unique user_id
s in each dataframe, create a set of each, find their intersection, filter the two dataframes with the resulting set and concatenate the two filtered dataframes.
Maybe that's the best approach, but I know Pandas is clever. Is there a simpler way to do this? I've looked at merge
but I don't think that's what I need.
My understanding is that this question is better answered over in this post.
But briefly, the answer to the OP with this method is simply:
s1 = pd.merge(df1, df2, how='inner', on=['user_id'])
Which gives s1 with 5 columns: user_id and the other two columns from each of df1 and df2.
If I understand you correctly, you can use a combination of Series.isin()
and DataFrame.append()
:
In [80]: df1 Out[80]: rating user_id 0 2 0x21abL 1 1 0x21abL 2 1 0xdafL 3 0 0x21abL 4 4 0x1d14L 5 2 0x21abL 6 1 0x21abL 7 0 0xdafL 8 4 0x1d14L 9 1 0x21abL In [81]: df2 Out[81]: rating user_id 0 2 0x1d14L 1 1 0xdbdcad7 2 1 0x21abL 3 3 0x21abL 4 3 0x21abL 5 1 0x5734a81e2 6 2 0x1d14L 7 0 0xdafL 8 0 0x1d14L 9 4 0x5734a81e2 In [82]: ind = df2.user_id.isin(df1.user_id) & df1.user_id.isin(df2.user_id) In [83]: ind Out[83]: 0 True 1 False 2 True 3 True 4 True 5 False 6 True 7 True 8 True 9 False Name: user_id, dtype: bool In [84]: df1[ind].append(df2[ind]) Out[84]: rating user_id 0 2 0x21abL 2 1 0xdafL 3 0 0x21abL 4 4 0x1d14L 6 1 0x21abL 7 0 0xdafL 8 4 0x1d14L 0 2 0x1d14L 2 1 0x21abL 3 3 0x21abL 4 3 0x21abL 6 2 0x1d14L 7 0 0xdafL 8 0 0x1d14L
This is essentially the algorithm you described as "clunky", using idiomatic pandas
methods. Note the duplicate row indices. Also, note that this won't give you the expected output if df1
and df2
have no overlapping row indices, i.e., if
In [93]: df1.index & df2.index Out[93]: Int64Index([], dtype='int64')
In fact, it won't give the expected output if their row indices are not equal.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With