I have a pandas dataframe that contains a column with possible duplicates. I would like to create a column that will produce a 1 if the row is duplicate and 0 if it is not.
So if I have:
A|B
1 1|x
2 2|y
3 1|x
4 3|z
I would get:
A|B|C
1 1|x|1
2 2|y|0
3 1|x|1
4 3|z|0
I tried df['C'] = np.where(df['A']==df['A'], '1', '0')
but this just created a column of all 1's in C.
You need Series.duplicated
with parameter keep=False
for all duplicates first, then cast boolean mask (True
s and False
s) to 1
s and 0
s by astype
by int
and if necessary then cast to str
:
df['C'] = df['A'].duplicated(keep=False).astype(int).astype(str)
print (df)
A B C
1 1 x 1
2 2 y 0
3 1 x 1
4 3 z 0
If need check duplicates in columns A
and B
together use DataFrame.duplicated
:
df['C'] = df.duplicated(subset=['A','B'], keep=False).astype(int).astype(str)
print (df)
A B C
1 1 x 1
2 2 y 0
3 1 x 1
4 3 z 0
And numpy.where
solution:
df['C'] = np.where(df['A'].duplicated(keep=False), '1', '0')
print (df)
A B C
1 1 x 1
2 2 y 0
3 1 x 1
4 3 z 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With