Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas/Python: How to concatenate two dataframes without duplicates?

I'd like to concatenate two dataframes A, B to a new one without duplicate rows (if rows in B already exist in A, don't add):

Dataframe A: Dataframe B:

   I    II    I    II 0  1    2     5    6 1  3    1     3    1 

New Dataframe:

     I    II   0  1    2   1  3    1   2  5    6 

How can I do this?

like image 227
MJP Avatar asked Jan 23 '14 19:01

MJP


People also ask

How avoid duplicates in pandas merge?

merge() function to join the two data frames by inner join. Now, add a suffix called 'remove' for newly joined columns that have the same name in both data frames. Use the drop() function to remove the columns with the suffix 'remove'. This will ensure that identical columns don't exist in the new dataframe.

How do I get rid of duplicates in pandas DataFrame?

Consider dataset containing ramen rating. By default, it removes duplicate rows based on all columns. To remove duplicates on specific column(s), use subset . To remove duplicates and keep last occurrences, use keep .

How do you concatenate two different DataFrames in Python?

The concat() function in pandas is used to append either columns or rows from one DataFrame to another. The concat() function does all the heavy lifting of performing concatenation operations along an axis while performing optional set logic (union or intersection) of the indexes (if any) on the other axes.

What is difference between pandas concat and merge?

Concat function concatenates dataframes along rows or columns. We can think of it as stacking up multiple dataframes. Merge combines dataframes based on values in shared columns. Merge function offers more flexibility compared to concat function because it allows combinations based on a condition.


2 Answers

The simplest way is to just do the concatenation, and then drop duplicates.

>>> df1    A  B 0  1  2 1  3  1 >>> df2    A  B 0  5  6 1  3  1 >>> pandas.concat([df1,df2]).drop_duplicates().reset_index(drop=True)    A  B 0  1  2 1  3  1 2  5  6 

The reset_index(drop=True) is to fix up the index after the concat() and drop_duplicates(). Without it you will have an index of [0,1,0] instead of [0,1,2]. This could cause problems for further operations on this dataframe down the road if it isn't reset right away.

like image 168
Ryan G Avatar answered Sep 22 '22 18:09

Ryan G


In case you have a duplicate row already in DataFrame A, then concatenating and then dropping duplicate rows, will remove rows from DataFrame A that you might want to keep.

In this case, you will need to create a new column with a cumulative count, and then drop duplicates, it all depends on your use case, but this is common in time-series data

Here is an example:

df_1 = pd.DataFrame([ {'date':'11/20/2015', 'id':4, 'value':24}, {'date':'11/20/2015', 'id':4, 'value':24}, {'date':'11/20/2015', 'id':6, 'value':34},])  df_2 = pd.DataFrame([ {'date':'11/20/2015', 'id':4, 'value':24}, {'date':'11/20/2015', 'id':6, 'value':14}, ])   df_1['count'] = df_1.groupby(['date','id','value']).cumcount() df_2['count'] = df_2.groupby(['date','id','value']).cumcount()  df_tot = pd.concat([df_1,df_2], ignore_index=False) df_tot = df_tot.drop_duplicates() df_tot = df_tot.drop(['count'], axis=1) >>> df_tot  date    id  value 0   11/20/2015  4   24 1   11/20/2015  4   24 2   11/20/2015  6   34 1   11/20/2015  6   14 
like image 28
marwan Avatar answered Sep 23 '22 18:09

marwan