Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas merge with duplicated key - removing duplicated rows or preventing it's creation

I have two dataframes that i want to merge, but my key column contains duplicates. Dataframes looks like this:

Name,amount,id
John,500.25,GH10
Helen,1250.00,GH11
Adam,432.54,GH11
Sarah,567.12,GH12

Category,amount,id
Food,500.25,GH10
Travel,1250.00,GH11
Food,432.54,GH11

And I'm performing on it merge with outer join to include everything in merged table:

merged_table = pd.merge(df1,df2,on="id",how='outer')

And my output is:

Name,amount_x,id,category,amount_y
John,500.25,GH10,Food,500.25
Helen,1250.00,GH11,Travel,1250.00
Helen,1250.00,GH11,Food,432.54
Adam,432.54,GH11,Travel,1250.00
Adam,432.54,GH11,Food,432.54
Sarah,567.12,GH12

However, my desired output is:

Name,amount_x,id,category,amount_y
John,500.25,GH10,Food,500.25
Helen,1250.00,GH11,Travel,1250.00
Adam,432.54,GH11,Food,432.54
Sarah,567.12,GH12

So what's happening here is that each record with duplicated key is matched with every record on other table, so the output have 4 rows instead of 2, and these two in the middle (row 2 and 3) are unwanted.

So the solutions that comes to my mind:

  1. Preventing somehow creation of duplicated rows. I can't use drop_duplicates() before merge, because then i would exclude some of the rows with doubled key. But the other column, Amount, should have the same 2 values on both tables, but there is very small possibility that they may differ.

  2. Using merge in the same way as i'm doing it, but then dropping rows 2 and 3 and keeping rows 1 and 4, if ID is duplicated, because as matching goes in way where first row in df1 is connected with first row in df2, then second row in df2, and then second row from df1 is connected with first row in df2 and then with the second, rows 1 and 4 are the one that are correct.

I'm think here of using .apply() and writing some lambda function, but i can't really wrap my head around how it should be written correctly.

like image 439
sartek Avatar asked Aug 03 '18 09:08

sartek


People also ask

How avoid duplicates in Pandas merge?

merge() function to join the two data frames by inner join. Now, add a suffix called 'remove' for newly joined columns that have the same name in both data frames. Use the drop() function to remove the columns with the suffix 'remove'. This will ensure that identical columns don't exist in the new dataframe.

Which Pandas function using to removed duplicate rows?

Pandas drop_duplicates() Function Syntax If 'first', duplicate rows except the first one is deleted. If 'last', duplicate rows except the last one is deleted.

Does Panda concat remove duplicates?

By default, when you concatenate two dataframes with duplicate records, Pandas automatically combine them together without removing the duplicate rows.

What does Pandas merge do?

“Merging” two datasets is the process of bringing two datasets together into one, and aligning the rows from each based on common attributes or columns. The words “merge” and “join” are used relatively interchangeably in Pandas and other languages.


1 Answers

I suggest create new helper column for count id values by cumcount and then merge by this values:

df1['g'] = df1.groupby('id').cumcount()
df2['g'] = df2.groupby('id').cumcount()

merged_table = pd.merge(df1,df2,on=["id", 'g'],how='outer')
print (merged_table)
    Name  amount_x    id  g Category  amount_y
0   John    500.25  GH10  0     Food    500.25
1  Helen   1250.00  GH11  0   Travel   1250.00
2   Adam    432.54  GH11  1     Food    432.54
3  Sarah    567.12  GH12  0      NaN       NaN

And last remove id:

merged_table = pd.merge(df1,df2,on=["id", 'g'],how='outer').drop('g', axis=1)
print (merged_table)
    Name  amount_x    id Category  amount_y
0   John    500.25  GH10     Food    500.25
1  Helen   1250.00  GH11   Travel   1250.00
2   Adam    432.54  GH11     Food    432.54
3  Sarah    567.12  GH12      NaN       NaN 

Detail:

print (df1)
    Name   amount    id  g
0   John   500.25  GH10  0
1  Helen  1250.00  GH11  0
2   Adam   432.54  GH11  1
3  Sarah   567.12  GH12  0

print (df2)
  Category   amount    id  g
0     Food   500.25  GH10  0
1   Travel  1250.00  GH11  0
2     Food   432.54  GH11  1
like image 110
jezrael Avatar answered Oct 27 '22 22:10

jezrael