Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas merge creates unwanted duplicate entries

I'm new to Pandas and I want to merge two datasets that have similar columns. The columns are going to each have some unique values compared to the other column, in addition to many identical values. There are some duplicates in each column that I'd like to keep. My desired output is shown below. Adding how='inner' or 'outer' does not yield the desired result.

import pandas as pd

df1 = df2 = pd.DataFrame({'A': [2,2,3,4,5]})

print(pd.merge(df1,df2))

output:
   A
0  2
1  2
2  2
3  2
4  3
5  4
6  5

desired/expected output:
   A
0  2
1  2
2  3
3  4
4  5

Please let me know how/if I can achieve the desired output using merge, thank you!

EDIT To clarify why I'm confused about this behavior, if I simply add another column, it doesn't make four 2's but rather there are only two 2's, so I would expect that in my first example it would also have the two 2's. Why does the behavior seem to change, what's pandas doing?

import pandas as pd
df1 = df2 = pd.DataFrame(
    {'A': [2,2,3,4,5], 'B': ['red','orange','yellow','green','blue']}
)

print(pd.merge(df1,df2))

output:
   A       B
0  2     red
1  2  orange
2  3  yellow
3  4   green
4  5    blue

However, based on the first example I would expect:
   A       B
0  2     red
1  2  orange
2  2     red
3  2  orange
4  3  yellow
5  4   green
6  5    blue
like image 725
Nic Scozzaro Avatar asked Feb 24 '17 16:02

Nic Scozzaro


2 Answers

import pandas as pd

dict1 = {'A':[2,2,3,4,5]}
dict2 = {'A':[2,2,3,4,5]}

df1 = pd.DataFrame(dict1).reset_index()
df2 = pd.DataFrame(dict2).reset_index()

df = df1.merge(df2, on = 'A')
df = pd.DataFrame(df[df.index_x==df.index_y]['A'], columns=['A']).reset_index(drop=True)

print(df)

Output:

   A
0  2
1  2
2  3
3  4
4  5
like image 168
Rene Avatar answered Sep 22 '22 15:09

Rene


The duplicates are caused by duplicate entries in the target table's columns you're joining on (df2['A']). We can remove duplicates while making the join without permanently altering df2:

df1 = df2 = pd.DataFrame({'A': [2,2,3,4,5]})

join_cols = ['A']

merged = pd.merge(df1, df2[df2.duplicated(subset=join_cols, keep='first') == False],  on=join_cols)

Note we defined join_cols, ensuring columns being joined and columns duplicates are being removed on match.

like image 27
Chris Dixon Avatar answered Sep 22 '22 15:09

Chris Dixon