I have a pandas DataFrame with duplicate values for a set of columns. For example:
df = pd.DataFrame({'Column1': {0: 1, 1: 2, 2: 3}, 'Column2': {0: 'ABC', 1: 'XYZ', 2: 'ABC'}, 'Column3': {0: 'DEF', 1: 'DEF', 2: 'DEF'}, 'Column4': {0: 10, 1: 40, 2: 10})
In [2]: df
Out[2]:
Column1 Column2 Column3 Column4 is_duplicated dup_index
0 1 ABC DEF 10 False 0
1 2 XYZ DEF 40 False 1
2 3 ABC DEF 10 True 0
Row (1) and (3) are same. Essentially, Row (3) is a duplicate of Row (1).
I am looking for the following output:
Is_Duplicate
, containing whether the row is a duplicate or not [can be accomplished by using "duplicated" method on dataframe columns (Column2, Column3 and Column4)]
Dup_Index
the original index of the duplicate row.
In [3]: df
Out[3]:
Column1 Column2 Column3 Column4 Is_Duplicate Dup_Index
0 1 ABC DEF 10 False 0
1 2 XYZ DEF 40 False 1
2 3 ABC DEF 10 True 0
The pandas. DataFrame. duplicated() method is used to find duplicate rows in a DataFrame. It returns a boolean series which identifies whether a row is duplicate or unique.
Remove All Duplicate Rows from Pandas DataFrame You can set 'keep=False' in the drop_duplicates() function to remove all the duplicate rows. For E.x, df. drop_duplicates(keep=False) .
You can use the duplicated() function to find duplicate values in a pandas DataFrame.
Pandas DataFrame. duplicated() function is used to get/find/select a list of all duplicate rows(all or selected columns) from pandas. Duplicate rows means, having multiple rows on all columns. Using this method you can get duplicate rows on selected multiple columns or all columns.
There is a DataFrame method duplicated
for the first column:
In [11]: df.duplicated(['Column2', 'Column3', 'Column4'])
Out[11]:
0 False
1 False
2 True
In [12]: df['is_duplicated'] = df.duplicated(['Column2', 'Column3', 'Column4'])
To do the second you could try something like this:
In [13]: g = df.groupby(['Column2', 'Column3', 'Column4'])
In [14]: df1 = df.set_index(['Column2', 'Column3', 'Column4'])
In [15]: df1.index.map(lambda ind: g.indices[ind][0])
Out[15]: array([0, 1, 0])
In [16]: df['dup_index'] = df1.index.map(lambda ind: g.indices[ind][0])
In [17]: df
Out[17]:
Column1 Column2 Column3 Column4 is_duplicated dup_index
0 1 ABC DEF 10 False 0
1 2 XYZ DEF 40 False 1
2 3 ABC DEF 10 True 0
Let's say your dataframe is stored in df
.
You can use groupby to get non duplicated rows of your dataframe. Here we have to ignore Column1 that is not part of the data:
df_nodup = df.groupby(by=['Column2', 'Column3', 'Column4']).first()
you can then merge this new dataframe with the original one by using the merge function:
df = df.merge(df_nodup, left_on=['Column2', 'Column3', 'Column4'], right_index=True, suffixes=('', '_dupindex'))
You can eventually use the _dupindex column merged in the dataframe to make the simple math to add the columns needed:
df['Is_Duplicate'] = df['Column1']!=df['Column1_dupindex']
df['Dup_Index'] = None
df['Dup_Index'] = df['Dup_Index'].where(df['Column1_dupindex']==df['Column1'], df['Column1_dupindex'])
del df['Column1_dupindex']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With