I have a large data frame df
and a small data frame df_right
with 2 columns a
and b
. I want to do a simple left join / lookup on a
without copying df
.
I come up with this code but I am not sure how robust it is:
dtmp = pd.merge(df[['a']], df_right, on = 'a', how = "left") #one col left join
df['b'] = dtmp['b'].values
I know it certainly fails when there are duplicated keys: pandas left join - why more results?
Is there better way to do this?
Related:
Outer merging two data frames in place in pandas
What are the exact downsides of copy=False in DataFrame.merge()?
You are almost there. There are 4 cases to consider:
df
and df_right
do not have duplicated keysdf
has duplicated keysdf_right
has duplicated keysdf
and df_right
have duplicated keysYour code fails in case 3 & 4 since the merging extends the number of row count in df
. In order to make it work, you need to choose what information to drop in df_right
prior to merging. The purpose of this is to enforce any merging scheme to be either case 1 or 2.
For example, if you wish to keep "first" values for each duplicated key in df_right
, the following code works for all 4 cases above.
dtmp = pd.merge(df[['a']], df_right.drop_duplicates('a', keep='first'), on='a', how='left')
df['b'] = dtmp['b'].values
Alternatively, if column 'b'
of df_right
consists of numeric values and you wish to have summary statistic:
dtmp = pd.merge(df[['a']], df_right.groupby('a').mean().reset_index(drop=False), on='a', how='left')
df['b'] = dtmp['b'].values
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With