Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

join or merge with overwrite in pandas

Tags:

python

pandas

I want to perform a join/merge/append operation on a dataframe with datetime index.

Let's say I have df1 and I want to add df2 to it. df2 can have fewer or more columns, and overlapping indexes. For all rows where the indexes match, if df2 has the same column as df1, I want the values of df1 be overwritten with those from df2.

How can I obtain the desired result?

like image 900
saroele Avatar asked Mar 20 '12 13:03

saroele


People also ask

Is merge or join faster pandas?

As you can see, the merge is faster than joins, though it is small value, but over 4000 iterations, that small value becomes a huge number, in minutes.

What is difference between joining and merging in pandas DataFrame?

Pandas Join vs Merge Differences The main difference between join vs merge would be; join() is used to combine two DataFrames on the index but not on columns whereas merge() is primarily used to specify the columns you wanted to join on, this also supports joining on indexes and combination of index and columns.

Is join and merge same in pandas?

Both join and merge can be used to combines two dataframes but the join method combines two dataframes on the basis of their indexes whereas the merge method is more versatile and allows us to specify columns beside the index to join on for both dataframes.

Is pandas merge efficient?

Merge can be used in cases where both the left and right columns are not unique, and therefore cannot be an index. A merge is also just as efficient as a join as long as: Merging is done on indexes if possible.


2 Answers

How about: df2.combine_first(df1)?

In [33]: df2 Out[33]:                     A         B         C         D 2000-01-03  0.638998  1.277361  0.193649  0.345063 2000-01-04 -0.816756 -1.711666 -1.155077 -0.678726 2000-01-05  0.435507 -0.025162 -1.112890  0.324111 2000-01-06 -0.210756 -1.027164  0.036664  0.884715 2000-01-07 -0.821631 -0.700394 -0.706505  1.193341 2000-01-10  1.015447 -0.909930  0.027548  0.258471 2000-01-11 -0.497239 -0.979071 -0.461560  0.447598  In [34]: df1 Out[34]:                     A         B         C 2000-01-03  2.288863  0.188175 -0.040928 2000-01-04  0.159107 -0.666861 -0.551628 2000-01-05 -0.356838 -0.231036 -1.211446 2000-01-06 -0.866475  1.113018 -0.001483 2000-01-07  0.303269  0.021034  0.471715 2000-01-10  1.149815  0.686696 -1.230991 2000-01-11 -1.296118 -0.172950 -0.603887 2000-01-12 -1.034574 -0.523238  0.626968 2000-01-13 -0.193280  1.857499 -0.046383 2000-01-14 -1.043492 -0.820525  0.868685  In [35]: df2.comb df2.combine        df2.combineAdd     df2.combine_first  df2.combineMult      In [35]: df2.combine_first(df1) Out[35]:                     A         B         C         D 2000-01-03  0.638998  1.277361  0.193649  0.345063 2000-01-04 -0.816756 -1.711666 -1.155077 -0.678726 2000-01-05  0.435507 -0.025162 -1.112890  0.324111 2000-01-06 -0.210756 -1.027164  0.036664  0.884715 2000-01-07 -0.821631 -0.700394 -0.706505  1.193341 2000-01-10  1.015447 -0.909930  0.027548  0.258471 2000-01-11 -0.497239 -0.979071 -0.461560  0.447598 2000-01-12 -1.034574 -0.523238  0.626968       NaN 2000-01-13 -0.193280  1.857499 -0.046383       NaN 2000-01-14 -1.043492 -0.820525  0.868685       NaN 

Note that it takes the values from df1 for indices that do not overlap with df2. If this doesn't do exactly what you want I would be willing to improve this function / add options to it.

like image 84
Wes McKinney Avatar answered Sep 26 '22 08:09

Wes McKinney


For a merge like this, the update method of a DataFrame is useful.

Taking the examples from the documentation:

import pandas as pd import numpy as np  df1 = pd.DataFrame([[np.nan, 3., 5.], [-4.6, 2.1, np.nan],                    [np.nan, 7., np.nan]]) df2 = pd.DataFrame([[-42.6, np.nan, -8.2], [-5., 1.6, 4]],                    index=[1, 2]) 

Data before the update:

>>> df1      0    1    2 0  NaN  3.0  5.0 1 -4.6  2.1  NaN 2  NaN  7.0  NaN >>> >>> df2       0    1    2 1 -42.6  NaN -8.2 2  -5.0  1.6  4.0 

Let's update df1 with data from df2:

df1.update(df2) 

Data after the update:

>>> df1       0    1    2 0   NaN  3.0  5.0 1 -42.6  2.1 -8.2 2  -5.0  1.6  4.0 

Remarks:

  • It's important to notice that this is an operation "in place", modifying the DataFrame that calls update.
  • Also note that non NaN values in df1 are not overwritten with NaN values in df2
like image 20
Nicolás Ozimica Avatar answered Sep 22 '22 08:09

Nicolás Ozimica