Pandas merge creates unwanted duplicate entries

Question

I'm new to Pandas and I want to merge two datasets that have similar columns. The columns are going to each have some unique values compared to the other column, in addition to many identical values. There are some duplicates in each column that I'd like to keep. My desired output is shown below. Adding how='inner' or 'outer' does not yield the desired result.

import pandas as pd

df1 = df2 = pd.DataFrame({'A': [2,2,3,4,5]})

print(pd.merge(df1,df2))

output:
   A
0  2
1  2
2  2
3  2
4  3
5  4
6  5

desired/expected output:
   A
0  2
1  2
2  3
3  4
4  5

Please let me know how/if I can achieve the desired output using merge, thank you!

EDIT To clarify why I'm confused about this behavior, if I simply add another column, it doesn't make four 2's but rather there are only two 2's, so I would expect that in my first example it would also have the two 2's. Why does the behavior seem to change, what's pandas doing?

import pandas as pd
df1 = df2 = pd.DataFrame(
    {'A': [2,2,3,4,5], 'B': ['red','orange','yellow','green','blue']}
)

print(pd.merge(df1,df2))

output:
   A       B
0  2     red
1  2  orange
2  3  yellow
3  4   green
4  5    blue

However, based on the first example I would expect:
   A       B
0  2     red
1  2  orange
2  2     red
3  2  orange
4  3  yellow
5  4   green
6  5    blue

Rene · Accepted Answer

import pandas as pd

dict1 = {'A':[2,2,3,4,5]}
dict2 = {'A':[2,2,3,4,5]}

df1 = pd.DataFrame(dict1).reset_index()
df2 = pd.DataFrame(dict2).reset_index()

df = df1.merge(df2, on = 'A')
df = pd.DataFrame(df[df.index_x==df.index_y]['A'], columns=['A']).reset_index(drop=True)

print(df)

Output:

Chris Dixon · Answer

The duplicates are caused by duplicate entries in the target table's columns you're joining on (df2['A']). We can remove duplicates while making the join without permanently altering df2:

df1 = df2 = pd.DataFrame({'A': [2,2,3,4,5]})

join_cols = ['A']

merged = pd.merge(df1, df2[df2.duplicated(subset=join_cols, keep='first') == False],  on=join_cols)

Note we defined join_cols, ensuring columns being joined and columns duplicates are being removed on match.

Pandas merge creates unwanted duplicate entries

Tags:

python

merge

pandas

Nic Scozzaro

2 Answers

Rene

Chris Dixon

Recent Activity

Donate For Us

Pandas merge creates unwanted duplicate entries

Tags:

python

merge

pandas

Nic Scozzaro

2 Answers

Rene

Chris Dixon

Related questions

Recent Activity

Donate For Us