Pandas merge with duplicated key - removing duplicated rows or preventing it's creation

Tags:

I have two dataframes that i want to merge, but my key column contains duplicates. Dataframes looks like this:

Name,amount,id
John,500.25,GH10
Helen,1250.00,GH11
Adam,432.54,GH11
Sarah,567.12,GH12

Category,amount,id
Food,500.25,GH10
Travel,1250.00,GH11
Food,432.54,GH11

And I'm performing on it merge with outer join to include everything in merged table:

merged_table = pd.merge(df1,df2,on="id",how='outer')

And my output is:

Name,amount_x,id,category,amount_y
John,500.25,GH10,Food,500.25
Helen,1250.00,GH11,Travel,1250.00
Helen,1250.00,GH11,Food,432.54
Adam,432.54,GH11,Travel,1250.00
Adam,432.54,GH11,Food,432.54
Sarah,567.12,GH12

However, my desired output is:

Name,amount_x,id,category,amount_y
John,500.25,GH10,Food,500.25
Helen,1250.00,GH11,Travel,1250.00
Adam,432.54,GH11,Food,432.54
Sarah,567.12,GH12

So what's happening here is that each record with duplicated key is matched with every record on other table, so the output have 4 rows instead of 2, and these two in the middle (row 2 and 3) are unwanted.

So the solutions that comes to my mind:

Preventing somehow creation of duplicated rows. I can't use drop_duplicates() before merge, because then i would exclude some of the rows with doubled key. But the other column, Amount, should have the same 2 values on both tables, but there is very small possibility that they may differ.
Using merge in the same way as i'm doing it, but then dropping rows 2 and 3 and keeping rows 1 and 4, if ID is duplicated, because as matching goes in way where first row in df1 is connected with first row in df2, then second row in df2, and then second row from df1 is connected with first row in df2 and then with the second, rows 1 and 4 are the one that are correct.

I'm think here of using .apply() and writing some lambda function, but i can't really wrap my head around how it should be written correctly.

439

asked Aug 03 '18 09:08

sartek

1 Answers

I suggest create new helper column for count id values by cumcount and then merge by this values:

df1['g'] = df1.groupby('id').cumcount()
df2['g'] = df2.groupby('id').cumcount()

merged_table = pd.merge(df1,df2,on=["id", 'g'],how='outer')
print (merged_table)
    Name  amount_x    id  g Category  amount_y
0   John    500.25  GH10  0     Food    500.25
1  Helen   1250.00  GH11  0   Travel   1250.00
2   Adam    432.54  GH11  1     Food    432.54
3  Sarah    567.12  GH12  0      NaN       NaN

And last remove id:

merged_table = pd.merge(df1,df2,on=["id", 'g'],how='outer').drop('g', axis=1)
print (merged_table)
    Name  amount_x    id Category  amount_y
0   John    500.25  GH10     Food    500.25
1  Helen   1250.00  GH11   Travel   1250.00
2   Adam    432.54  GH11     Food    432.54
3  Sarah    567.12  GH12      NaN       NaN

Detail:

print (df1)
    Name   amount    id  g
0   John   500.25  GH10  0
1  Helen  1250.00  GH11  0
2   Adam   432.54  GH11  1
3  Sarah   567.12  GH12  0

print (df2)
  Category   amount    id  g
0     Food   500.25  GH10  0
1   Travel  1250.00  GH11  0
2     Food   432.54  GH11  1

110

answered Oct 27 '22 22:10

jezrael

Related questions
                            
                                How does one implement adversarial examples in pytorch?
                            
                                Replace strings using List Comprehensions
                            
                                Python Array Memory Footprint versus List
                            
                                Convert HH:MM:SS.micro string to microseconds?
                            
                                Jupyter notebook - can't import python functions from other folders
                            
                                What to do when I don't want Luigi to output a file but show the task as complete?
                            
                                Django AttributeError: Form object has no attribute '_errors'
                            
                                Appending rows to empty DataFrame not working
                            
                                Shift time in multi-index to merge
                            
                                Reflection padding Conv2D
                            
                                How to use ruamel.yaml to dump literal scalars
                            
                                An error occurred in the current transaction. You can't execute queries until the end of the 'atomic' block
                            
                                Trying to change a single value in pandas dataframe
                            
                                How to implement Merge from Keras.layers
                            
                                aws: boto3 get all instances of a load balancers
                            
                                Break very long lines with access to deeply nested dictionaries
                            
                                python mask netcdf data using shapefile
                            
                                Keras try save and load model error You are trying to load a weight file containing 16 layers into a model with 0 layers
                            
                                python - No module named dill while using pickle.load()
                            
                                ImportError: cannot import name normalize_data_format

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas merge with duplicated key - removing duplicated rows or preventing it's creation

Tags:

python

pandas

dataframe

sartek

People also ask

1 Answers

jezrael

Recent Activity

Donate For Us