Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Better way for merge (update\insert) pandas dataframes

I have 2 pandas data frames - df_current_data, df_new_data.

my goal is to apply a merge (not a pandas merge function, merge like 'update\insert'). The check for a match is by key columns.

my result need to built by 3 optional rows-types.

  1. rows which exists in df_current_data but not exists in df_new_data - will insert "as is" to the result.

  2. rows which exists in df_new_data but not exists in df_current_data - will insert "as is" to the result.

  3. rows which exists in df_new_data and exists in df_current_data - the result need to take the rows from df_new_data.

This is a classic merge- upsert action.

example:

# rows 0,1 are in current and not in new (check by index1 and index2)
# row 2 is common
In [41]: df_current_source
Out[41]:    A  index1  index2
         0  1       1       4
         1  2       2       5
         2  3       3       6

# rows 0,2 are in new and not in current (check by index1 and index2)
# row 1 is common
In [42]: df_new_source
Out[42]:    A  index1  index2
         0  4       2       7
         1  5       3       6
         2  6       4       5

# the result has 2 rows that only in current (rows 0,1)
# the result has 2 rows that only in new (rows 3,4)
# the result has one row that exists in both current and new (row 2 - index1 = 3, index2 = 6) - so the value of the column A is from the new and not from the current (5 instead of 2)

In [43]: df_result
Out[43]:    A  index1  index2
         0  1       1       4
         1  2       2       5
         2  5       3       6
         3  4       2       7
         4  6       4       5

That's what I did:

# left join from source to new
df = df_current_source.merge(df_new_source, how='left', left_on=p_new_keys, 
right_on=p_curr_keys, indicator=True)

# take only the rows that exists in the current and not exists in the source
df_only_current = df[df['_merge'] == 'left_only']

# merge new data into current data
df_result = pd.concat([df_only_current, df_new_source])

another option is with isin function:

df_result = pd.concat([df_current_source[~df_current_source[p_key_col_name]\

.isin(df_new_source[p_key_col_name])], df_new_source])

The problem is that if I have more than 1 key-column I can't use the isin, I need merge.

Assuming that the current is much bigger from the new, I guess the best way is to directly update the matching rows of current by the rows of new, and append the new rows of the "new" data frame into the current.

But I'm not sure how to do that..

Thanks a lot.

like image 881
user2671057 Avatar asked Aug 21 '17 11:08

user2671057


1 Answers

Option 1: use indicator=True as part of merge:

df_out = df_current_source.merge(df_new_source, 
                                 on=['index1', 'index2'], 
                                 how='outer', indicator=True)

df_out['A'] = np.where(df_out['_merge'] == 'both',
                       df_out['A_y'],
                       df_out.A_x.add(df_out.A_y, fill_value=0)).astype(int)

df_out[['A', 'index1', 'index2']]

Output:

   A  index1  index2
0  1       1       4
1  2       2       5
2  5       3       6
3  4       2       7
4  6       4       5

Option 2: use combined_first with set_index

df_new_source.set_index(['index1', 'index2'])\
             .combine_first(df_current_source.set_index(['index1', 'index2']))\
             .reset_index()\
             .astype(int)

Output:

   index1  index2  A
0       1       4  1
1       2       5  2
2       2       7  4
3       3       6  5
4       4       5  6
like image 160
Scott Boston Avatar answered Oct 03 '22 08:10

Scott Boston