Better way for merge (update\insert) pandas dataframes

Question

I have 2 pandas data frames - df_current_data, df_new_data.

my goal is to apply a merge (not a pandas merge function, merge like 'update\insert'). The check for a match is by key columns.

my result need to built by 3 optional rows-types.

rows which exists in df_current_data but not exists in df_new_data - will insert "as is" to the result.
rows which exists in df_new_data but not exists in df_current_data - will insert "as is" to the result.
rows which exists in df_new_data and exists in df_current_data - the result need to take the rows from df_new_data.

This is a classic merge- upsert action.

example:

# rows 0,1 are in current and not in new (check by index1 and index2)
# row 2 is common
In [41]: df_current_source
Out[41]:    A  index1  index2
         0  1       1       4
         1  2       2       5
         2  3       3       6

# rows 0,2 are in new and not in current (check by index1 and index2)
# row 1 is common
In [42]: df_new_source
Out[42]:    A  index1  index2
         0  4       2       7
         1  5       3       6
         2  6       4       5

# the result has 2 rows that only in current (rows 0,1)
# the result has 2 rows that only in new (rows 3,4)
# the result has one row that exists in both current and new (row 2 - index1 = 3, index2 = 6) - so the value of the column A is from the new and not from the current (5 instead of 2)

In [43]: df_result
Out[43]:    A  index1  index2
         0  1       1       4
         1  2       2       5
         2  5       3       6
         3  4       2       7
         4  6       4       5

That's what I did:

# left join from source to new
df = df_current_source.merge(df_new_source, how='left', left_on=p_new_keys, 
right_on=p_curr_keys, indicator=True)

# take only the rows that exists in the current and not exists in the source
df_only_current = df[df['_merge'] == 'left_only']

# merge new data into current data
df_result = pd.concat([df_only_current, df_new_source])

another option is with isin function:

df_result = pd.concat([df_current_source[~df_current_source[p_key_col_name]\

.isin(df_new_source[p_key_col_name])], df_new_source])

The problem is that if I have more than 1 key-column I can't use the isin, I need merge.

Assuming that the current is much bigger from the new, I guess the best way is to directly update the matching rows of current by the rows of new, and append the new rows of the "new" data frame into the current.

But I'm not sure how to do that..

Thanks a lot.

Scott Boston · Accepted Answer

Option 1: use `indicator=True` as part of `merge`:

df_out = df_current_source.merge(df_new_source, 
                                 on=['index1', 'index2'], 
                                 how='outer', indicator=True)

df_out['A'] = np.where(df_out['_merge'] == 'both',
                       df_out['A_y'],
                       df_out.A_x.add(df_out.A_y, fill_value=0)).astype(int)

df_out[['A', 'index1', 'index2']]

Output:

   A  index1  index2
0  1       1       4
1  2       2       5
2  5       3       6
3  4       2       7
4  6       4       5

Option 2: use `combined_first` with `set_index`

df_new_source.set_index(['index1', 'index2'])\
             .combine_first(df_current_source.set_index(['index1', 'index2']))\
             .reset_index()\
             .astype(int)

Output:

   index1  index2  A
0       1       4  1
1       2       5  2
2       2       7  4
3       3       6  5
4       4       5  6

Better way for merge (update\insert) pandas dataframes

Tags:

python

merge

pandas

dataframe

user2671057

1 Answers

Option 1: use `indicator=True` as part of `merge`:

Option 2: use `combined_first` with `set_index`

Scott Boston

Recent Activity

Donate For Us

Better way for merge (update\insert) pandas dataframes

Tags:

python

merge

pandas

dataframe

user2671057

1 Answers

Option 1: use indicator=True as part of merge:

Option 2: use combined_first with set_index

Scott Boston

Related questions

Recent Activity

Donate For Us

Option 1: use `indicator=True` as part of `merge`:

Option 2: use `combined_first` with `set_index`