Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding duplicate rows in a Pandas Dataframe then Adding a column in the Dataframe that states if the row is a duplicate

I have a pandas dataframe that contains a column with possible duplicates. I would like to create a column that will produce a 1 if the row is duplicate and 0 if it is not.

So if I have:

     A|B
1    1|x
2    2|y
3    1|x
4    3|z

I would get:

     A|B|C
1    1|x|1
2    2|y|0
3    1|x|1
4    3|z|0

I tried df['C'] = np.where(df['A']==df['A'], '1', '0') but this just created a column of all 1's in C.

like image 382
Prof. Falken Avatar asked Jul 30 '17 21:07

Prof. Falken


1 Answers

You need Series.duplicated with parameter keep=False for all duplicates first, then cast boolean mask (Trues and Falses) to 1s and 0s by astype by int and if necessary then cast to str:

df['C'] = df['A'].duplicated(keep=False).astype(int).astype(str)
print (df)
   A  B  C
1  1  x  1
2  2  y  0
3  1  x  1
4  3  z  0

If need check duplicates in columns A and B together use DataFrame.duplicated:

df['C'] = df.duplicated(subset=['A','B'], keep=False).astype(int).astype(str)
print (df)
   A  B  C
1  1  x  1
2  2  y  0
3  1  x  1
4  3  z  0

And numpy.where solution:

df['C'] = np.where(df['A'].duplicated(keep=False), '1', '0')
print (df)
   A  B  C
1  1  x  1
2  2  y  0
3  1  x  1
4  3  z  0
like image 50
jezrael Avatar answered Sep 30 '22 20:09

jezrael