I have 2 pandas data frames - df_current_data, df_new_data.
my goal is to apply a merge (not a pandas merge function, merge like 'update\insert'). The check for a match is by key columns.
my result need to built by 3 optional rows-types.
rows which exists in df_current_data but not exists in df_new_data - will insert "as is" to the result.
rows which exists in df_new_data but not exists in df_current_data - will insert "as is" to the result.
rows which exists in df_new_data and exists in df_current_data - the result need to take the rows from df_new_data.
This is a classic merge- upsert action.
example:
# rows 0,1 are in current and not in new (check by index1 and index2)
# row 2 is common
In [41]: df_current_source
Out[41]: A index1 index2
0 1 1 4
1 2 2 5
2 3 3 6
# rows 0,2 are in new and not in current (check by index1 and index2)
# row 1 is common
In [42]: df_new_source
Out[42]: A index1 index2
0 4 2 7
1 5 3 6
2 6 4 5
# the result has 2 rows that only in current (rows 0,1)
# the result has 2 rows that only in new (rows 3,4)
# the result has one row that exists in both current and new (row 2 - index1 = 3, index2 = 6) - so the value of the column A is from the new and not from the current (5 instead of 2)
In [43]: df_result
Out[43]: A index1 index2
0 1 1 4
1 2 2 5
2 5 3 6
3 4 2 7
4 6 4 5
That's what I did:
# left join from source to new
df = df_current_source.merge(df_new_source, how='left', left_on=p_new_keys,
right_on=p_curr_keys, indicator=True)
# take only the rows that exists in the current and not exists in the source
df_only_current = df[df['_merge'] == 'left_only']
# merge new data into current data
df_result = pd.concat([df_only_current, df_new_source])
another option is with isin function:
df_result = pd.concat([df_current_source[~df_current_source[p_key_col_name]\
.isin(df_new_source[p_key_col_name])], df_new_source])
The problem is that if I have more than 1 key-column I can't use the isin, I need merge.
Assuming that the current is much bigger from the new, I guess the best way is to directly update the matching rows of current by the rows of new, and append the new rows of the "new" data frame into the current.
But I'm not sure how to do that..
Thanks a lot.
indicator=True
as part of merge
:df_out = df_current_source.merge(df_new_source,
on=['index1', 'index2'],
how='outer', indicator=True)
df_out['A'] = np.where(df_out['_merge'] == 'both',
df_out['A_y'],
df_out.A_x.add(df_out.A_y, fill_value=0)).astype(int)
df_out[['A', 'index1', 'index2']]
Output:
A index1 index2
0 1 1 4
1 2 2 5
2 5 3 6
3 4 2 7
4 6 4 5
combined_first
with set_index
df_new_source.set_index(['index1', 'index2'])\
.combine_first(df_current_source.set_index(['index1', 'index2']))\
.reset_index()\
.astype(int)
Output:
index1 index2 A
0 1 4 1
1 2 5 2
2 2 7 4
3 3 6 5
4 4 5 6
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With