Say I have to DataFrames, one longer than the other, that I want to join on a specific column, as in the following example:
A = pd.DataFrame({'col1': [1, 2, 3, 4, 5], 'col2': [6, 7, 8, 9, 10], 'col3': [11, 12, 13, 14, 15]})
B = pd.DataFrame({'col1': [1, 3, 5], 'col2': [16, 17, 18], 'col4': [19, 20, 21]})
Then I join them with:
pd.merge(A, B, on='col1', how='outer')
And get, as expected:
col1 col2_x col3 col2_y col4
0 1 6 11 16 19
1 2 7 12 NaN NaN
2 3 8 13 17 20
3 4 9 14 NaN NaN
4 5 10 15 18 21
5 rows × 5 columns
However, I have two DataFrames that I'm trying to merge, with 28,011 and 15,676 rows, respectively. Merging them the same way as above, I would expect to get back a DataFrame with 28,011 rows and NaN in those cells where df2 had no observations. What happens instead is this:
len(pd.merge(df1, df2, on='col1', how='outer'))
51881
How is this possible? The column I'm merging on is a unique identifier, and the same operation goes through without problems in Stata. What am I missing here?
Sounds like you want a left join.
Try:
pd.merge(df1, df2, left_on='col1',right_on='col1',how='left')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With