I have two dataframes that i want to merge, but my key column contains duplicates. Dataframes looks like this:
Name,amount,id
John,500.25,GH10
Helen,1250.00,GH11
Adam,432.54,GH11
Sarah,567.12,GH12
Category,amount,id
Food,500.25,GH10
Travel,1250.00,GH11
Food,432.54,GH11
And I'm performing on it merge with outer join to include everything in merged table:
merged_table = pd.merge(df1,df2,on="id",how='outer')
And my output is:
Name,amount_x,id,category,amount_y
John,500.25,GH10,Food,500.25
Helen,1250.00,GH11,Travel,1250.00
Helen,1250.00,GH11,Food,432.54
Adam,432.54,GH11,Travel,1250.00
Adam,432.54,GH11,Food,432.54
Sarah,567.12,GH12
However, my desired output is:
Name,amount_x,id,category,amount_y
John,500.25,GH10,Food,500.25
Helen,1250.00,GH11,Travel,1250.00
Adam,432.54,GH11,Food,432.54
Sarah,567.12,GH12
So what's happening here is that each record with duplicated key is matched with every record on other table, so the output have 4 rows instead of 2, and these two in the middle (row 2 and 3) are unwanted.
So the solutions that comes to my mind:
Preventing somehow creation of duplicated rows. I can't use drop_duplicates() before merge, because then i would exclude some of the rows with doubled key. But the other column, Amount, should have the same 2 values on both tables, but there is very small possibility that they may differ.
Using merge in the same way as i'm doing it, but then dropping rows 2 and 3 and keeping rows 1 and 4, if ID is duplicated, because as matching goes in way where first row in df1 is connected with first row in df2, then second row in df2, and then second row from df1 is connected with first row in df2 and then with the second, rows 1 and 4 are the one that are correct.
I'm think here of using .apply() and writing some lambda function, but i can't really wrap my head around how it should be written correctly.
merge() function to join the two data frames by inner join. Now, add a suffix called 'remove' for newly joined columns that have the same name in both data frames. Use the drop() function to remove the columns with the suffix 'remove'. This will ensure that identical columns don't exist in the new dataframe.
Pandas drop_duplicates() Function Syntax If 'first', duplicate rows except the first one is deleted. If 'last', duplicate rows except the last one is deleted.
By default, when you concatenate two dataframes with duplicate records, Pandas automatically combine them together without removing the duplicate rows.
“Merging” two datasets is the process of bringing two datasets together into one, and aligning the rows from each based on common attributes or columns. The words “merge” and “join” are used relatively interchangeably in Pandas and other languages.
I suggest create new helper column for count id
values by cumcount
and then merge by this values:
df1['g'] = df1.groupby('id').cumcount()
df2['g'] = df2.groupby('id').cumcount()
merged_table = pd.merge(df1,df2,on=["id", 'g'],how='outer')
print (merged_table)
Name amount_x id g Category amount_y
0 John 500.25 GH10 0 Food 500.25
1 Helen 1250.00 GH11 0 Travel 1250.00
2 Adam 432.54 GH11 1 Food 432.54
3 Sarah 567.12 GH12 0 NaN NaN
And last remove id
:
merged_table = pd.merge(df1,df2,on=["id", 'g'],how='outer').drop('g', axis=1)
print (merged_table)
Name amount_x id Category amount_y
0 John 500.25 GH10 Food 500.25
1 Helen 1250.00 GH11 Travel 1250.00
2 Adam 432.54 GH11 Food 432.54
3 Sarah 567.12 GH12 NaN NaN
Detail:
print (df1)
Name amount id g
0 John 500.25 GH10 0
1 Helen 1250.00 GH11 0
2 Adam 432.54 GH11 1
3 Sarah 567.12 GH12 0
print (df2)
Category amount id g
0 Food 500.25 GH10 0
1 Travel 1250.00 GH11 0
2 Food 432.54 GH11 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With