I have two dataframes df1
and df2
that that were computed from the same source, but with different methods, thus most of the values are same, with some differences. Now, I want to update df1
based on values in df2
.
For example:
df1 = pd.DataFrame({'name':['john','deb','john','deb'], 'col1':[490,500,425,678], 'col2':[456,625,578,789],'col3':['TN','OK','OK','NY']})
name col1 col2 col3
john 490 456 TN
deb 500 625 OK
john 425 578 OK
deb 678 789 NY
df2 = pd.DataFrame({'name':['deb','john','deb','john','deb'], 'col1':[400,490,500,425,678], 'col2':[225,456,625,578,789],'col3':['TN','TN','OK','OK','NY']})
name col1 col2 col3
deb 400 225 TN
john 490 456 TN
deb 500 625 OK
john 425 578 OK
deb 678 789 NY
So, in this case .append
should append only the first row from df2
to df1
. So, only if there is a new row in df2
that is not present in df1
(based on name and col3
) that column will be added/updated, else it wont be.
This almost seems like something that concat
should do.
append() function is used to append rows of other dataframe to the end of the given dataframe, returning a new dataframe object. Columns not in the original dataframes are added as new columns and the new cells are populated with NaN value.
Add multiple rows to pandas dataframe We can pass a list of series too in the dataframe. append() for appending multiple rows in dataframe. For example, we can create a list of series with same column names as dataframe i.e. Now pass this list of series to the append() function i.e.
By using df. loc[index]=list you can append a list as a row to the DataFrame at a specified Index, In order to add at the end get the index of the last record using len(df) function. The below example adds the list ["Hyperion",27000,"60days",2000] to the end of the pandas DataFrame. Yields below output.
In pandas you can add/append a new column to the existing DataFrame using DataFrame. insert() method, this method updates the existing DataFrame with a new column. DataFrame. assign() is also used to insert a new column however, this method returns a new Dataframe after adding a new column.
There are two ways of acheiving your result.
I will show you both.
Concat then Drop
This should be more CPU friendly
df3 = pd.concat([df1,df2])
df3.drop_duplicates(subset=['name', 'col3'], inplace=True, keep='last')
This method is possibly more memory intensive than an outer join because at one point you are holding df1
, df2
and the result of the concatination of both [df1, df2]
(df3
) in memory.
Outer join then Drop
This should be more memory friendly.
df3 = df1.merge(df2, on=list(df1), how='outer')
df3.drop_duplicates(subset=['name', 'col3'], inplace=True, keep='last')
Doing an outer
join will make sure you get all entries from both dataframes, but df3
will be smaller than in the case where we use concat
.
The keyword keep='last'
used to be take_last=True
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With