Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Add new rows to a pandas dataframe

I have two dataframes df1 and df2 that that were computed from the same source, but with different methods, thus most of the values are same, with some differences. Now, I want to update df1 based on values in df2.

For example:

df1 = pd.DataFrame({'name':['john','deb','john','deb'], 'col1':[490,500,425,678], 'col2':[456,625,578,789],'col3':['TN','OK','OK','NY']})
 name col1 col2 col3
 john  490  456  TN
 deb   500  625  OK
 john  425  578  OK
 deb   678  789  NY

df2 = pd.DataFrame({'name':['deb','john','deb','john','deb'], 'col1':[400,490,500,425,678], 'col2':[225,456,625,578,789],'col3':['TN','TN','OK','OK','NY']})
 name col1 col2 col3
  deb  400  225  TN
 john  490  456  TN
  deb  500  625  OK
 john  425  578  OK
 deb   678  789  NY

So, in this case .append should append only the first row from df2 to df1. So, only if there is a new row in df2 that is not present in df1 (based on name and col3) that column will be added/updated, else it wont be.

This almost seems like something that concat should do.

like image 202
msakya Avatar asked Mar 25 '14 23:03

msakya


People also ask

How do I add rows to a pandas DataFrame?

append() function is used to append rows of other dataframe to the end of the given dataframe, returning a new dataframe object. Columns not in the original dataframes are added as new columns and the new cells are populated with NaN value.

How do I add multiple rows to a DataFrame?

Add multiple rows to pandas dataframe We can pass a list of series too in the dataframe. append() for appending multiple rows in dataframe. For example, we can create a list of series with same column names as dataframe i.e. Now pass this list of series to the append() function i.e.

How do I add a row to a DataFrame list?

By using df. loc[index]=list you can append a list as a row to the DataFrame at a specified Index, In order to add at the end get the index of the last record using len(df) function. The below example adds the list ["Hyperion",27000,"60days",2000] to the end of the pandas DataFrame. Yields below output.

How will you add a new column and new row to a pandas DataFrame?

In pandas you can add/append a new column to the existing DataFrame using DataFrame. insert() method, this method updates the existing DataFrame with a new column. DataFrame. assign() is also used to insert a new column however, this method returns a new Dataframe after adding a new column.


1 Answers

There are two ways of acheiving your result.

  1. Concat both dataframes, then drop duplicates
  2. Using an outer join/merge, then drop duplicates

I will show you both.

Concat then Drop

This should be more CPU friendly

df3 = pd.concat([df1,df2])
df3.drop_duplicates(subset=['name', 'col3'], inplace=True, keep='last')

This method is possibly more memory intensive than an outer join because at one point you are holding df1, df2 and the result of the concatination of both [df1, df2] (df3) in memory.

Outer join then Drop

This should be more memory friendly.

df3 = df1.merge(df2, on=list(df1), how='outer')
df3.drop_duplicates(subset=['name', 'col3'], inplace=True, keep='last')

Doing an outer join will make sure you get all entries from both dataframes, but df3 will be smaller than in the case where we use concat.

Version 0.15 and older note:

The keyword keep='last' used to be take_last=True

like image 172
firelynx Avatar answered Oct 23 '22 20:10

firelynx