I'd like to concatenate two dataframes A, B to a new one without duplicate rows (if rows in B already exist in A, don't add):
Dataframe A: Dataframe B:
I II I II 0 1 2 5 6 1 3 1 3 1
New Dataframe:
I II 0 1 2 1 3 1 2 5 6
How can I do this?
merge() function to join the two data frames by inner join. Now, add a suffix called 'remove' for newly joined columns that have the same name in both data frames. Use the drop() function to remove the columns with the suffix 'remove'. This will ensure that identical columns don't exist in the new dataframe.
Consider dataset containing ramen rating. By default, it removes duplicate rows based on all columns. To remove duplicates on specific column(s), use subset . To remove duplicates and keep last occurrences, use keep .
The concat() function in pandas is used to append either columns or rows from one DataFrame to another. The concat() function does all the heavy lifting of performing concatenation operations along an axis while performing optional set logic (union or intersection) of the indexes (if any) on the other axes.
Concat function concatenates dataframes along rows or columns. We can think of it as stacking up multiple dataframes. Merge combines dataframes based on values in shared columns. Merge function offers more flexibility compared to concat function because it allows combinations based on a condition.
The simplest way is to just do the concatenation, and then drop duplicates.
>>> df1 A B 0 1 2 1 3 1 >>> df2 A B 0 5 6 1 3 1 >>> pandas.concat([df1,df2]).drop_duplicates().reset_index(drop=True) A B 0 1 2 1 3 1 2 5 6
The reset_index(drop=True)
is to fix up the index after the concat()
and drop_duplicates()
. Without it you will have an index of [0,1,0]
instead of [0,1,2]
. This could cause problems for further operations on this dataframe
down the road if it isn't reset right away.
In case you have a duplicate row already in DataFrame A, then concatenating and then dropping duplicate rows, will remove rows from DataFrame A that you might want to keep.
In this case, you will need to create a new column with a cumulative count, and then drop duplicates, it all depends on your use case, but this is common in time-series data
Here is an example:
df_1 = pd.DataFrame([ {'date':'11/20/2015', 'id':4, 'value':24}, {'date':'11/20/2015', 'id':4, 'value':24}, {'date':'11/20/2015', 'id':6, 'value':34},]) df_2 = pd.DataFrame([ {'date':'11/20/2015', 'id':4, 'value':24}, {'date':'11/20/2015', 'id':6, 'value':14}, ]) df_1['count'] = df_1.groupby(['date','id','value']).cumcount() df_2['count'] = df_2.groupby(['date','id','value']).cumcount() df_tot = pd.concat([df_1,df_2], ignore_index=False) df_tot = df_tot.drop_duplicates() df_tot = df_tot.drop(['count'], axis=1) >>> df_tot date id value 0 11/20/2015 4 24 1 11/20/2015 4 24 2 11/20/2015 6 34 1 11/20/2015 6 14
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With