Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Duplicated rows when merging dataframes in Python

I am currently merging two dataframes with an outer join. However, after merging, I see all the rows are duplicated even when the columns that I merged upon contain the same values.

Specifically, I have the following code.

merged_df = pd.merge(df1, df2, on=['email_address'], how='inner') 

Here are the two dataframes and the results.

df1

          email_address    name   surname 0  [email protected]    john     smith 1  [email protected]    john     smith 2       [email protected]   elvis   presley 

df2

          email_address    street  city 0  [email protected]   street1    NY 1  [email protected]   street1    NY 2       [email protected]   street2    LA 

merged_df

          email_address    name   surname    street  city 0  [email protected]    john     smith   street1    NY 1  [email protected]    john     smith   street1    NY 2  [email protected]    john     smith   street1    NY 3  [email protected]    john     smith   street1    NY 4       [email protected]   elvis   presley   street2    LA 5       [email protected]   elvis   presley   street2    LA 

My question is, shouldn't it be like this?

This is how I would like my merged_df to be like.

          email_address    name   surname    street  city 0  [email protected]    john     smith   street1    NY 1  [email protected]    john     smith   street1    NY 2       [email protected]   elvis   presley   street2    LA 

Are there any ways I can achieve this?

like image 490
Roberto Bertinetti Avatar asked Aug 18 '16 13:08

Roberto Bertinetti


People also ask

How do I merge two DataFrames without duplicates in Python?

To concatenate DataFrames, use the concat() method, but to ignore duplicates, use the drop_duplicates() method.

How do you eliminate duplicate rows in a DataFrame in Python?

Use DataFrame. drop_duplicates() to Drop Duplicate and Keep First Rows. You can use DataFrame. drop_duplicates() without any arguments to drop rows with the same values on all columns.

How avoid duplicates in pandas merge?

merge() function to join the two data frames by inner join. Now, add a suffix called 'remove' for newly joined columns that have the same name in both data frames. Use the drop() function to remove the columns with the suffix 'remove'. This will ensure that identical columns don't exist in the new dataframe.


1 Answers

list_2_nodups = list_2.drop_duplicates() pd.merge(list_1 , list_2_nodups , on=['email_address']) 

enter image description here

The duplicate rows are expected. Each john smith in list_1 matches with each john smith in list_2. I had to drop the duplicates in one of the lists. I chose list_2.

like image 59
piRSquared Avatar answered Sep 23 '22 18:09

piRSquared