I have two dataframes df1 and df2 that that were computed from the same source, but with different methods, thus most of the values are same, with some differences. Now, I want to update df1 based on values in df2.
For example:
df1 = pd.DataFrame({'name':['john','deb','john','deb'], 'col1':[490,500,425,678], 'col2':[456,625,578,789],'col3':['TN','OK','OK','NY']})
 name col1 col2 col3
 john  490  456  TN
 deb   500  625  OK
 john  425  578  OK
 deb   678  789  NY
df2 = pd.DataFrame({'name':['deb','john','deb','john','deb'], 'col1':[400,490,500,425,678], 'col2':[225,456,625,578,789],'col3':['TN','TN','OK','OK','NY']})
 name col1 col2 col3
  deb  400  225  TN
 john  490  456  TN
  deb  500  625  OK
 john  425  578  OK
 deb   678  789  NY
So, in this case .append should append only the first row from df2 to df1. So, only if there is a new row in df2 that is not present in df1 (based on name and col3) that column will be added/updated, else it wont be.
This almost seems like something that concat should do.
append() function is used to append rows of other dataframe to the end of the given dataframe, returning a new dataframe object. Columns not in the original dataframes are added as new columns and the new cells are populated with NaN value.
Add multiple rows to pandas dataframe We can pass a list of series too in the dataframe. append() for appending multiple rows in dataframe. For example, we can create a list of series with same column names as dataframe i.e. Now pass this list of series to the append() function i.e.
By using df. loc[index]=list you can append a list as a row to the DataFrame at a specified Index, In order to add at the end get the index of the last record using len(df) function. The below example adds the list ["Hyperion",27000,"60days",2000] to the end of the pandas DataFrame. Yields below output.
In pandas you can add/append a new column to the existing DataFrame using DataFrame. insert() method, this method updates the existing DataFrame with a new column. DataFrame. assign() is also used to insert a new column however, this method returns a new Dataframe after adding a new column.
There are two ways of acheiving your result.
I will show you both.
Concat then Drop
This should be more CPU friendly
df3 = pd.concat([df1,df2])
df3.drop_duplicates(subset=['name', 'col3'], inplace=True, keep='last')
This method is possibly more memory intensive than an outer join because at one point you are holding df1, df2 and the result of the concatination of both [df1, df2] (df3) in memory.
Outer join then Drop
This should be more memory friendly.
df3 = df1.merge(df2, on=list(df1), how='outer')
df3.drop_duplicates(subset=['name', 'col3'], inplace=True, keep='last')
Doing an outer join will make sure you get all entries from both dataframes, but df3 will be smaller than in the case where we use concat.
The keyword keep='last' used to be take_last=True
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With