I want to perform a join/merge/append operation on a dataframe with datetime index.
Let's say I have df1
and I want to add df2
to it. df2
can have fewer or more columns, and overlapping indexes. For all rows where the indexes match, if df2
has the same column as df1
, I want the values of df1
be overwritten with those from df2
.
How can I obtain the desired result?
As you can see, the merge is faster than joins, though it is small value, but over 4000 iterations, that small value becomes a huge number, in minutes.
Pandas Join vs Merge Differences The main difference between join vs merge would be; join() is used to combine two DataFrames on the index but not on columns whereas merge() is primarily used to specify the columns you wanted to join on, this also supports joining on indexes and combination of index and columns.
Both join and merge can be used to combines two dataframes but the join method combines two dataframes on the basis of their indexes whereas the merge method is more versatile and allows us to specify columns beside the index to join on for both dataframes.
Merge can be used in cases where both the left and right columns are not unique, and therefore cannot be an index. A merge is also just as efficient as a join as long as: Merging is done on indexes if possible.
How about: df2.combine_first(df1)
?
In [33]: df2 Out[33]: A B C D 2000-01-03 0.638998 1.277361 0.193649 0.345063 2000-01-04 -0.816756 -1.711666 -1.155077 -0.678726 2000-01-05 0.435507 -0.025162 -1.112890 0.324111 2000-01-06 -0.210756 -1.027164 0.036664 0.884715 2000-01-07 -0.821631 -0.700394 -0.706505 1.193341 2000-01-10 1.015447 -0.909930 0.027548 0.258471 2000-01-11 -0.497239 -0.979071 -0.461560 0.447598 In [34]: df1 Out[34]: A B C 2000-01-03 2.288863 0.188175 -0.040928 2000-01-04 0.159107 -0.666861 -0.551628 2000-01-05 -0.356838 -0.231036 -1.211446 2000-01-06 -0.866475 1.113018 -0.001483 2000-01-07 0.303269 0.021034 0.471715 2000-01-10 1.149815 0.686696 -1.230991 2000-01-11 -1.296118 -0.172950 -0.603887 2000-01-12 -1.034574 -0.523238 0.626968 2000-01-13 -0.193280 1.857499 -0.046383 2000-01-14 -1.043492 -0.820525 0.868685 In [35]: df2.comb df2.combine df2.combineAdd df2.combine_first df2.combineMult In [35]: df2.combine_first(df1) Out[35]: A B C D 2000-01-03 0.638998 1.277361 0.193649 0.345063 2000-01-04 -0.816756 -1.711666 -1.155077 -0.678726 2000-01-05 0.435507 -0.025162 -1.112890 0.324111 2000-01-06 -0.210756 -1.027164 0.036664 0.884715 2000-01-07 -0.821631 -0.700394 -0.706505 1.193341 2000-01-10 1.015447 -0.909930 0.027548 0.258471 2000-01-11 -0.497239 -0.979071 -0.461560 0.447598 2000-01-12 -1.034574 -0.523238 0.626968 NaN 2000-01-13 -0.193280 1.857499 -0.046383 NaN 2000-01-14 -1.043492 -0.820525 0.868685 NaN
Note that it takes the values from df1
for indices that do not overlap with df2
. If this doesn't do exactly what you want I would be willing to improve this function / add options to it.
For a merge like this, the update
method of a DataFrame is useful.
Taking the examples from the documentation:
import pandas as pd import numpy as np df1 = pd.DataFrame([[np.nan, 3., 5.], [-4.6, 2.1, np.nan], [np.nan, 7., np.nan]]) df2 = pd.DataFrame([[-42.6, np.nan, -8.2], [-5., 1.6, 4]], index=[1, 2])
Data before the update
:
>>> df1 0 1 2 0 NaN 3.0 5.0 1 -4.6 2.1 NaN 2 NaN 7.0 NaN >>> >>> df2 0 1 2 1 -42.6 NaN -8.2 2 -5.0 1.6 4.0
Let's update df1
with data from df2
:
df1.update(df2)
Data after the update:
>>> df1 0 1 2 0 NaN 3.0 5.0 1 -42.6 2.1 -8.2 2 -5.0 1.6 4.0
Remarks:
update
.df1
are not overwritten with NaN values in df2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With