I've reviewed pandas documentation on merge but have a question on overriding values efficiently in a 'left' merge. I can do this simply for one pair of values (as seen here), but it becomes cluttered when trying to do multiple pairs.
If I take the following dataframes:
a = pd.DataFrame({
'id': [0,1,2,3,4,5,6,7,8,9],
'val': [100,100,100,100,100,100,100,100,100,100]
})
b = pd.DataFrame({
'id':[0,2,7],
'val': [500, 500, 500]
})
I can merge them:
df = a.merge(b, on=['id'], how='left', suffixes=('','_y'))
to get
id val val_y
0 0 100 500.0
1 1 100 NaN
2 2 100 500.0
3 3 100 NaN
4 4 100 NaN
5 5 100 NaN
6 6 100 NaN
7 7 100 500.0
8 8 100 NaN
9 9 100 NaN
I want to keep left values where no right value exists, but where possible overwrite with the right values.
My desired outcome is:
id val
0 0 500.0
1 1 100.0
2 2 500.0
3 3 100.0
4 4 100.0
5 5 100.0
6 6 100.0
7 7 500.0
8 8 100.0
9 9 100.0
I know I can accomplish this with a few lines of code:
df.loc[df.val_y.notnull(), 'val'] = df[df.val_y.notnull()].val_y
df = df.drop(['val_y'], axis = 1)
Or I can use the logic from this question.
But this becomes cluttered when there are multiple column pairings where I want to apply this logic.
For example, using a
and b
below:
a = pd.DataFrame({
'id': [0,1,2,3,4,5,6,7,8,9],
'val': [100,100,100,100,100,100,100,100,100,100],
'val_2':[200, 200, 200, 200, 200, 200, 200, 200, 200, 200]
})
b = pd.DataFrame({
'id':[0,2,7],
'val': [500, 500, 500],
'val_2': [500,500,500]
})
Is there a quicker, cleaner way to get my desired outcome?
I'd do this using set_index
and update
:
u = a.set_index('id')
u.update(b.set_index('id')) # Update a's values with b's values
u.reset_index()
id val
0 0 500.0
1 1 100.0
2 2 500.0
3 3 100.0
4 4 100.0
5 5 100.0
6 6 100.0
7 7 500.0
8 8 100.0
9 9 100.0
The update is aligned on the index. For this reason, I set "id" to be the index in both DataFrames before performing the update step.
Note that the "id" column must be unique.
Another option is using concat
and drop_duplicates
:
pd.concat([b, a]).drop_duplicates('id').sort_values('id')
id val
0 0 500
1 1 100
1 2 500
3 3 100
4 4 100
5 5 100
6 6 100
2 7 500
8 8 100
9 9 100
Since b
overrides a
, b
must come first in the concat
step.
numpy
searchsorted
and assign
a.iloc[np.searchsorted(a.id,b.id),1]=b.val.values
a
Out[1382]:
id val
0 0 500
1 1 100
2 2 500
3 3 100
4 4 100
5 5 100
6 6 100
7 7 500
8 8 100
9 9 100
dict
d = dict(a.values)
d.update(dict(b.values))
pd.DataFrame(dict(zip(a, zip(*d.items()))))
id val
0 0 500
1 1 100
2 2 500
3 3 100
4 4 100
5 5 100
6 6 100
7 7 500
8 8 100
9 9 100
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With