Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: DataFrame too long after merge

Say I have to DataFrames, one longer than the other, that I want to join on a specific column, as in the following example:

A = pd.DataFrame({'col1': [1, 2, 3, 4, 5], 'col2': [6, 7, 8, 9, 10], 'col3': [11, 12, 13, 14, 15]})

B = pd.DataFrame({'col1': [1, 3, 5], 'col2': [16, 17, 18], 'col4': [19, 20, 21]})

Then I join them with:

pd.merge(A, B, on='col1', how='outer')

And get, as expected:

       col1     col2_x  col3    col2_y  col4
0       1       6       11      16      19
1       2       7       12      NaN     NaN
2       3       8       13      17      20
3       4       9       14      NaN     NaN
4       5       10      15      18      21

5 rows × 5 columns

However, I have two DataFrames that I'm trying to merge, with 28,011 and 15,676 rows, respectively. Merging them the same way as above, I would expect to get back a DataFrame with 28,011 rows and NaN in those cells where df2 had no observations. What happens instead is this:

len(pd.merge(df1, df2, on='col1', how='outer'))
  51881

How is this possible? The column I'm merging on is a unique identifier, and the same operation goes through without problems in Stata. What am I missing here?

like image 554
Nils Gudat Avatar asked Oct 21 '22 00:10

Nils Gudat


1 Answers

Sounds like you want a left join.

Try:

pd.merge(df1, df2, left_on='col1',right_on='col1',how='left')
like image 134
Liam Foley Avatar answered Oct 23 '22 10:10

Liam Foley