Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas DataFrame concat / update ("upsert")?

Tags:

python

pandas

I am looking for an elegant way to append all the rows from one DataFrame to another DataFrame (both DataFrames having the same index and column structure), but in cases where the same index value appears in both DataFrames, use the row from the second data frame.

So, for example, if I start with:

df1:                     A      B     date     '2015-10-01'  'A1'   'B1'     '2015-10-02'  'A2'   'B2'     '2015-10-03'  'A3'   'B3'  df2:     date            A      B     '2015-10-02'  'a1'   'b1'     '2015-10-03'  'a2'   'b2'     '2015-10-04'  'a3'   'b3' 

I would like the result to be:

                    A      B     date     '2015-10-01'  'A1'   'B1'     '2015-10-02'  'a1'   'b1'     '2015-10-03'  'a2'   'b2'     '2015-10-04'  'a3'   'b3' 

This is analogous to what I think is called "upsert" in some SQL systems --- a combination of update and insert, in the sense that each row from df2 is either (a) used to update an existing row in df1 if the row key already exists in df1, or (b) inserted into df1 at the end if the row key does not already exist.

I have come up with the following

pd.concat([df1, df2])     # concat the two DataFrames     .reset_index()        # turn 'date' into a regular column     .groupby('date')      # group rows by values in the 'date' column     .tail(1)              # take the last row in each group     .set_index('date')    # restore 'date' as the index 

which seems to work, but this relies on the order of the rows in each groupby group always being the same as the original DataFrames, which I haven't checked on, and seems displeasingly convoluted.

Does anyone have any ideas for a more straightforward solution?

like image 635
embeepea Avatar asked Oct 07 '15 20:10

embeepea


People also ask

How do I update pandas DataFrame?

Pandas DataFrame update() MethodThe update() method updates a DataFrame with elements from another similar object (like another DataFrame). Note: this method does NOT return a new DataFrame. The updating is done to the original DataFrame.

Is concat faster than append pandas?

In this benchmark, concatenating multiple dataframes by using the Pandas. concat function is 50 times faster than using the DataFrame. append version. With multiple append , a new DataFrame is created at each iteration, and the underlying data is copied each time.

What is difference between pandas concat and merge?

Concat function concatenates dataframes along rows or columns. We can think of it as stacking up multiple dataframes. Merge combines dataframes based on values in shared columns. Merge function offers more flexibility compared to concat function because it allows combinations based on a condition.

What does concat in pandas do?

Concatenate pandas objects along a particular axis with optional set logic along the other axes. Can also add a layer of hierarchical indexing on the concatenation axis, which may be useful if the labels are the same (or overlapping) on the passed axis number.


1 Answers

One solution is to conatenate df1 with new rows in df2 (i.e. where the index does not match). Then update the values with those from df2.

df = pd.concat([df1, df2[~df2.index.isin(df1.index)]]) df.update(df2)  >>> df              A   B 2015-10-01  A1  B1 2015-10-02  a1  b1 2015-10-03  a2  b2 2015-10-04  a3  b3 

EDIT: Per the suggestion of @chrisb, this can further be simplified as follows:

pd.concat([df1[~df1.index.isin(df2.index)], df2]) 

Thanks Chris!

like image 102
Alexander Avatar answered Sep 21 '22 03:09

Alexander