I am currently merging two dataframes with an outer join. However, after merging, I see all the rows are duplicated even when the columns that I merged upon contain the same values.
Specifically, I have the following code.
merged_df = pd.merge(df1, df2, on=['email_address'], how='inner')  Here are the two dataframes and the results.
df1
          email_address    name   surname 0  [email protected]    john     smith 1  [email protected]    john     smith 2       [email protected]   elvis   presley  df2
          email_address    street  city 0  [email protected]   street1    NY 1  [email protected]   street1    NY 2       [email protected]   street2    LA  merged_df
          email_address    name   surname    street  city 0  [email protected]    john     smith   street1    NY 1  [email protected]    john     smith   street1    NY 2  [email protected]    john     smith   street1    NY 3  [email protected]    john     smith   street1    NY 4       [email protected]   elvis   presley   street2    LA 5       [email protected]   elvis   presley   street2    LA  My question is, shouldn't it be like this?
This is how I would like my merged_df to be like.
          email_address    name   surname    street  city 0  [email protected]    john     smith   street1    NY 1  [email protected]    john     smith   street1    NY 2       [email protected]   elvis   presley   street2    LA  Are there any ways I can achieve this?
To concatenate DataFrames, use the concat() method, but to ignore duplicates, use the drop_duplicates() method.
Use DataFrame. drop_duplicates() to Drop Duplicate and Keep First Rows. You can use DataFrame. drop_duplicates() without any arguments to drop rows with the same values on all columns.
merge() function to join the two data frames by inner join. Now, add a suffix called 'remove' for newly joined columns that have the same name in both data frames. Use the drop() function to remove the columns with the suffix 'remove'. This will ensure that identical columns don't exist in the new dataframe.
list_2_nodups = list_2.drop_duplicates() pd.merge(list_1 , list_2_nodups , on=['email_address'])   
The duplicate rows are expected.  Each john smith in list_1 matches with each john smith in list_2.  I had to drop the duplicates in one of the lists.  I chose list_2.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With